Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Big5 index contains unmappable characters? #293

Open
Mingun opened this issue Aug 21, 2022 · 6 comments
Open

Why Big5 index contains unmappable characters? #293

Mingun opened this issue Aug 21, 2022 · 6 comments

Comments

@Mingun
Copy link

Mingun commented Aug 21, 2022

I try to generate all of characters which particular encoding supports to generate a test files for a quick-xml. I found, that using encoding_rs crate, some codepoints, declared in https://github.com/whatwg/encoding/blob/main/indexes.json for Big5 encoding actually represented as HTML references (&#...;). Digging into that I realized, that such output is generated when character is unmappable by the encoding.

So the question is: what the rationale to include in index characters that is unmappable by the encoding? I cannot find the answer on the https://encoding.spec.whatwg.org/. It has description of how to deal with that strange index, but does not explain why this index is so strange.

@hsivonen
Copy link
Member

hsivonen commented Aug 23, 2022

The Big5 encoder and decoder are asymmetric (like the EUC-JP encoder and decoder). The visualizations visualize what can be decoded. The spec excludes part of the decoding space from round-tripping via the encoder in order for HTML form submission not to generate extension-range bytes that some server-side recipients may not support.

For EUC-JP, the asymmetry is based on historical experience. For Big5, it is by prudent analogy of the problem initially seen with EUC-JP. Also, for Big5, the exclusion for Big5 is questionable and possibly by accident excluding less than what was intended: The encoder only excludes the extension part below the original Big5 range but doesn't exclude the other extension part above the original Big5 range.

@Mingun
Copy link
Author

Mingun commented Aug 23, 2022

Well, probably that information should be included somewhere in the spec, probably here

encoding/encoding.bs

Lines 959 to 961 in 4f549cd

<p class="note no-backref">All <a lt=index>indexes</a> are also available as a non-normative
<a href=indexes.json>indexes.json</a> resource. (<a>Index gb18030 ranges</a> has a slightly
different format here, to be able to represent ranges.)

because it was a little surprising when I used indexes.json for my own goals

@annevk
Copy link
Member

annevk commented Dec 16, 2024

What information exactly? That you cannot use indexes without the corresponding encoder/decoder definitions?

@Mingun
Copy link
Author

Mingun commented Dec 16, 2024

Information that index contains unmappable characters. All other indexes just does not include characters that cannot be represented in the encoding, but not for Big5.

@annevk
Copy link
Member

annevk commented Dec 16, 2024

I don't think that's true? As pointed out above EUC-JP is also not symmetric. gb18030 isn't either. Assuming that's what you mean by "unmappable character".

@Mingun
Copy link
Author

Mingun commented Dec 16, 2024

Yes, some indexes contains characters that only relevant for encoding/decoding but that is not reflected anywhere in the documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants