-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why Big5 index contains unmappable characters? #293
Comments
The Big5 encoder and decoder are asymmetric (like the EUC-JP encoder and decoder). The visualizations visualize what can be decoded. The spec excludes part of the decoding space from round-tripping via the encoder in order for HTML form submission not to generate extension-range bytes that some server-side recipients may not support. For EUC-JP, the asymmetry is based on historical experience. For Big5, it is by prudent analogy of the problem initially seen with EUC-JP. Also, for Big5, the exclusion for Big5 is questionable and possibly by accident excluding less than what was intended: The encoder only excludes the extension part below the original Big5 range but doesn't exclude the other extension part above the original Big5 range. |
Well, probably that information should be included somewhere in the spec, probably here Lines 959 to 961 in 4f549cd
because it was a little surprising when I used indexes.json for my own goals
|
What information exactly? That you cannot use indexes without the corresponding encoder/decoder definitions? |
Information that index contains unmappable characters. All other indexes just does not include characters that cannot be represented in the encoding, but not for Big5. |
I don't think that's true? As pointed out above EUC-JP is also not symmetric. gb18030 isn't either. Assuming that's what you mean by "unmappable character". |
Yes, some indexes contains characters that only relevant for encoding/decoding but that is not reflected anywhere in the documentation |
I try to generate all of characters which particular encoding supports to generate a test files for a quick-xml. I found, that using encoding_rs crate, some codepoints, declared in https://github.com/whatwg/encoding/blob/main/indexes.json for Big5 encoding actually represented as HTML references (
&#...;
). Digging into that I realized, that such output is generated when character is unmappable by the encoding.So the question is: what the rationale to include in index characters that is unmappable by the encoding? I cannot find the answer on the https://encoding.spec.whatwg.org/. It has description of how to deal with that strange index, but does not explain why this index is so strange.
The text was updated successfully, but these errors were encountered: