Why Big5 index contains unmappable characters? #293

Mingun · 2022-08-21T09:36:37Z

I try to generate all of characters which particular encoding supports to generate a test files for a quick-xml. I found, that using encoding_rs crate, some codepoints, declared in https://github.com/whatwg/encoding/blob/main/indexes.json for Big5 encoding actually represented as HTML references (&#...;). Digging into that I realized, that such output is generated when character is unmappable by the encoding.

So the question is: what the rationale to include in index characters that is unmappable by the encoding? I cannot find the answer on the https://encoding.spec.whatwg.org/. It has description of how to deal with that strange index, but does not explain why this index is so strange.

The text was updated successfully, but these errors were encountered:

hsivonen · 2022-08-23T08:00:06Z

The Big5 encoder and decoder are asymmetric (like the EUC-JP encoder and decoder). The visualizations visualize what can be decoded. The spec excludes part of the decoding space from round-tripping via the encoder in order for HTML form submission not to generate extension-range bytes that some server-side recipients may not support.

For EUC-JP, the asymmetry is based on historical experience. For Big5, it is by prudent analogy of the problem initially seen with EUC-JP. Also, for Big5, the exclusion for Big5 is questionable and possibly by accident excluding less than what was intended: The encoder only excludes the extension part below the original Big5 range but doesn't exclude the other extension part above the original Big5 range.

Mingun · 2022-08-23T15:11:34Z

Well, probably that information should be included somewhere in the spec, probably here

encoding/encoding.bs

Lines 959 to 961 in 4f549cd

    
           <p class="note no-backref">All <a lt=index>indexes</a> are also available as a non-normative 
        
           <a href=indexes.json>indexes.json</a> resource. (<a>Index gb18030 ranges</a> has a slightly 
        
           different format here, to be able to represent ranges.)

because it was a little surprising when I used indexes.json for my own goals

annevk · 2024-12-16T09:09:42Z

What information exactly? That you cannot use indexes without the corresponding encoder/decoder definitions?

Mingun · 2024-12-16T15:44:41Z

Information that index contains unmappable characters. All other indexes just does not include characters that cannot be represented in the encoding, but not for Big5.

annevk · 2024-12-16T16:03:49Z

I don't think that's true? As pointed out above EUC-JP is also not symmetric. gb18030 isn't either. Assuming that's what you mean by "unmappable character".

Mingun · 2024-12-16T16:06:47Z

Yes, some indexes contains characters that only relevant for encoding/decoding but that is not reflected anywhere in the documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Big5 index contains unmappable characters? #293

Why Big5 index contains unmappable characters? #293

Mingun commented Aug 21, 2022

hsivonen commented Aug 23, 2022 •

edited

Loading

Mingun commented Aug 23, 2022

annevk commented Dec 16, 2024

Mingun commented Dec 16, 2024

annevk commented Dec 16, 2024

Mingun commented Dec 16, 2024 •

edited

Loading

Why Big5 index contains unmappable characters? #293

Why Big5 index contains unmappable characters? #293

Comments

Mingun commented Aug 21, 2022

hsivonen commented Aug 23, 2022 • edited Loading

Mingun commented Aug 23, 2022

annevk commented Dec 16, 2024

Mingun commented Dec 16, 2024

annevk commented Dec 16, 2024

Mingun commented Dec 16, 2024 • edited Loading

hsivonen commented Aug 23, 2022 •

edited

Loading

Mingun commented Dec 16, 2024 •

edited

Loading