Unicode identifiers in the WAT format #1843

xfq · 2024-11-11T05:59:18Z

With the annotations proposal (Wasm 3) we now support string-escaping arbitrary Unicode as identifiers, so I think we can close this.

We (W3C i18n WG) have two questions about the resolution:

Why are Unicode identifiers not allowed directly in the WebAssembly text format (i.e., string-escaping seems to be required)? Although web developers usually don't read them, devtools developers, Wasm module authors, or WebAssembly compiler developers might read them and find Unicode identifiers useful. Escapes will make the identifiers unreadable. See https://github.com/unicode-org/message-format-wg/blob/5f6657b54f60b35a8fb17653942551ebf0b862ca/spec/message.abnf#L56 for an example of a language supporting Unicode identifiers, using XML-Name related restrictions.
Why is it only supported in Wasm 3, but not Wasm 2 (which is not CR yet)?

rossberg · 2024-11-11T14:47:41Z

Why are Unicode identifiers not allowed directly in the WebAssembly text format (i.e., string-escaping seems to be required)? Although web developers usually don't read them, devtools developers, Wasm module authors, or WebAssembly compiler developers might read them and find Unicode identifiers useful. Escapes will make the identifiers unreadable. See https://github.com/unicode-org/message-format-wg/blob/5f6657b54f60b35a8fb17653942551ebf0b862ca/spec/message.abnf#L56 for an example of a language supporting Unicode identifiers, using XML-Name related restrictions.

The new syntax merely requires delimiting identifiers with quote characters. Escapes are not necessary, except for exceptional cases of names that wouldn't even be allowable as unquoted identifiers, such as ones themselves containing quotes or control characters.

The Wasm text format is a lightweight interchange format that is used by a wide variety of tools, with varying degrees of complexity and resource constraints, on a wide range of platforms, from Web to small embedded systems. Undelimited Unicode identifiers, if handled properly according to Unicode UAX # 31, would add substantial complexity to both specification and implementations: Unicode's definition of identifier is complicated and requires Unicode property tables to handle. The burden would be on all tools processing the Wasm text format, and is unlikely to get implemented on all, causing fragmentation. In contrast, to understand quoted identifiers, tools merely need to implement UTF-8 decoding, which is a few lines of code.

As UAX # 31 admits itself:

"The disadvantage of working with the lexical classes defined previously is the storage space needed for the detailed definitions, plus the fact that with each new version of the Unicode Standard new characters are added, which an existing parser would not be able to recognize. In other words, the recommendations based on that table are not upwardly compatible."

Unfortunately, the alternative it suggests (negative character classification) also has serious problems, such as reserving the entire code space for identifiers, and hence turning many future extensions to the language's lexical syntax that would otherwise be conservative into breaking changes.

Why is it only supported in Wasm 3, but not Wasm 2 (which is not CR yet)?

Simply because it did not make the feature cut, which already happened in 2021. But Wasm 3 is essentially done at this point, so will be pushed into the process immediately after Wasm 2 is published.

xfq added the i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. label Nov 11, 2024

xfq mentioned this issue Nov 11, 2024

Unicode identifiers in the WAT format w3c/i18n-activity#1935

Open

plehegar mentioned this issue Dec 13, 2024

CR Request for WebAssembly Core - wasm-core w3c/transitions#651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode identifiers in the WAT format #1843

Unicode identifiers in the WAT format #1843

xfq commented Nov 11, 2024

rossberg commented Nov 11, 2024

Unicode identifiers in the WAT format #1843

Unicode identifiers in the WAT format #1843

Comments

xfq commented Nov 11, 2024

rossberg commented Nov 11, 2024