You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version 1.1. XML 1.1 does permit escaped characters in the range [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] . XML 1.0 on the other hand, does not permit ASCII control characters like . An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version 1.0. This holds both if the character is included in binary (e.g. "\u000c") or as an escaped sequence (e.g. ""). Furthermore, if I a am not mistaken, XHTML is a reformulation of XML version 1.0 [sic] and not 1.1. Finally, the w3c validator raises an error in all the variants of XHTML if the input includes the escape sequence .
If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when OutputSettings.syntax is Syntax.xml. Another option would be to discard those characters when Syntax.xml is used. To be honest it's not clear if Syntax.xml refers to XML version 1.0 or 1.1. In the code of Entities.java though there is a reference to the spec of XML 1.0. Maybe the name of the constant should have been Syntax.xhtml and not Syntax.xml. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for 1.0 since jsoup is an (X)HTML parser not an XML one.
I can help prepare a PR if someone can decide what the expected behaviour should be.
A subtle detail in the test above is that if prettyPrint is true  will be replaced with a space but other control characters forbidden in XML documents (e.g. ) will not. The underlying problem is not really solved if pretty printing is enabled.
Thanks for the review. For max compatibility / least surprise within the existing solution, my inclination is that we should align this on XML 1.0 and the rules around it. The EscapeMode xhtml can be the owner of that rule. If we later want to add XML 1.1, we could perhaps make a new escape mode xml1.1 or similar.
The Syntax.xml is not supposed to be tied to HTML / XHTML but just be XML. And not include any of the e.g. HTML ellisions or specific serialization rules.
Potentially we could infer the escape mode to use by the setting of the syntax, or the meta-data in the document, with the ability to override.
Background
I might be wrong but I believe there is a regression of issue #887. Most probably it was not really solved in the first place. Note that in #1556, which was solved with the same PR, the XML document is in version 1.1. XML 1.1 does permit escaped characters in the range
[#x1-#x8] | [#xB-#xC] | [#xE-#x1F]
. XML 1.0 on the other hand, does not permit ASCII control characters like
. An XML parser should raise an error (and most of them actually do) if those characters appear in a document with an XML declaration of version 1.0. This holds both if the character is included in binary (e.g."\u000c"
) or as an escaped sequence (e.g.""
). Furthermore, if I a am not mistaken, XHTML is a reformulation of XML version 1.0 [sic] and not 1.1. Finally, the w3c validator raises an error in all the variants of XHTML if the input includes the escape sequence
.If this is considered out of the scope of jsoup that's fine but developers should know that the output might not be a valid XML document even when
OutputSettings.syntax
isSyntax.xml
. Another option would be to discard those characters whenSyntax.xml
is used. To be honest it's not clear ifSyntax.xml
refers to XML version 1.0 or 1.1. In the code of Entities.java though there is a reference to the spec of XML 1.0. Maybe the name of the constant should have beenSyntax.xhtml
and notSyntax.xml
. Nevertheless, if it were to choose between the two versions of XML most probably it is more coherent to go for 1.0 since jsoup is an (X)HTML parser not an XML one.I can help prepare a PR if someone can decide what the expected behaviour should be.
Desired behaviour
A subtle detail in the test above is that if
prettyPrint
is true
will be replaced with a space but other control characters forbidden in XML documents (e.g.
) will not. The underlying problem is not really solved if pretty printing is enabled.Observed behaviour
The text was updated successfully, but these errors were encountered: