You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
URI and IRI support in python is a dumpster fire of sadness. It is "good enough" for most day-to-day purposes, but for a project specifically focusing on correctness and compliance, that's not really good enough. Considering some of the more commonly used libraries:
urllib doesn't even pretend to be compliant, and automatically encodes things that should be errors.
rfc3986 comes a lot closer to compliance, but still does undesirable auto-encoding, although in some circumstance you can work around that with more elaborate use of their validation class
rfc3986 2.0.0 also emits deprecation warnings when you use certain non-deprecated functions. I have contributed a fix for this, but it's unclear when they will get around to publishing a new release.
For IRIs, all of the above means that non-URI unicode characters are encoded, rendering the IRI unreadable as an IRI. It's not entirely clear if there's any need for true IRI support for OAS 3.x compliance, but presumably it will be a concern for 4.0. The options for supporting unencoded IRIs is very sketchy:
The standard library has no IRI support
rfc3986.IRIReference is experimental, and incomprehensibly does the same encoding as rfc3986.URIReference
rfc3987 is a completely separate package from rfc3986; it handles unicode correctly but is otherwise very minimal and poorly documented, and is still too permissive about some things
None of this even gets into scheme-specific parsing, or media-type-specific fragment parsing.
To the best of my knowledge, the only code that handles everything correctly is the abnf package, which includes ABNF support for RFC 3986 and 3987, but produces a huge parse tree as a result and is probably significantly slower (although I haven't measured it). Its error reporting is also incomprehensible.
To add further confusion:
rdflilb wants rdflib.URIRef instances (which are actually IRIs and not URI-references, for but called URIRef for historical reasons), which subclass from str.
jschon has its own jschon.URI class which is a wrapper around rfc3986.URIReference
So at minimum there are two URI-ish classes floating around the code, one of which wraps a third, none of which are the standard library, and all of which have different interfaces.
┻━┻︵ \(`Д´)/ ︵ ┻━┻
Clearly the only answer is to write another library. More seriously, wrap this dumpster fire in a facade so that we can be less fiddly but still change implementations until finding the right correctness/convenience/performance balance. We are unlikely to know what constitutes "reasonable" performance until the system is near-complete and can process enormous specs like the GitHub API. A facade will make it feasible to quickly measure different alternative implementations.
Requirements:
easily work with URIs (or IRIs) as distinct from URI-references (or IRI-references)
easily work with query strings, specifically application/x-www-form-urlencoded-style strings
easy integration with templated URLs as defined by OAS (which are essentially a subset of RFC 6570)
easy manipulation of JSON Pointer fragments (probably using the jschon.JSONPointer class, although there are at least three other JSON Pointer packages as alternatives; however we also need Relative JSON Pointer support which is more rare, but well-supported by jschon.RelativeJSONPointer, to which I contributed fixes in jschon 0.10.0)
possibly allow for faster manipulation while deferring validation to the last step
reliable validation, at least of the full generic syntax, without silencing errors through aggressive re-encoding
excellent error reporting UX for validation errors
support for important scheme-specific syntax (http(s), file, urn especially urn:uuid, and possibly tag and mailto - this can be deferred and added as the use cases arise, but it needs to be planned)
support for important media-type specific fragment syntax (JSON Pointer for OAS media types, application/schema+json, and application/yaml; JSON Schema plain name fragments, possibly other things to support External Documentation links?)
reasonable performance on large specifications (e.g. GitHub, which is >200K lines in YAML)
smooth interoperability with rdflib.URIRef and jschon.URI, as well as plain strings
The text was updated successfully, but these errors were encountered:
Looking more closely at rfc3987, the function I tried defaults to non-validating parsing, but it's trivial to enable validation, which seems to be appropriately strict. It seems like the best candidate for validation, but it does not offer any sort of convenience class packaging up relevant functionality.
URI and IRI support in python is a dumpster fire of sadness. It is "good enough" for most day-to-day purposes, but for a project specifically focusing on correctness and compliance, that's not really good enough. Considering some of the more commonly used libraries:
urllib
doesn't even pretend to be compliant, and automatically encodes things that should be errors.urllib3
also auto-encodes instead of detecting errorsrfc3986
comes a lot closer to compliance, but still does undesirable auto-encoding, although in some circumstance you can work around that with more elaborate use of their validation classrfc3986
2.0.0 also emits deprecation warnings when you use certain non-deprecated functions. I have contributed a fix for this, but it's unclear when they will get around to publishing a new release.For IRIs, all of the above means that non-URI unicode characters are encoded, rendering the IRI unreadable as an IRI. It's not entirely clear if there's any need for true IRI support for OAS 3.x compliance, but presumably it will be a concern for 4.0. The options for supporting unencoded IRIs is very sketchy:
rfc3986.IRIReference
is experimental, and incomprehensibly does the same encoding asrfc3986.URIReference
rfc3987
is a completely separate package fromrfc3986
; it handles unicode correctly but is otherwise very minimal and poorly documented, and is still too permissive about some thingsNone of this even gets into scheme-specific parsing, or media-type-specific fragment parsing.
To the best of my knowledge, the only code that handles everything correctly is the
abnf
package, which includes ABNF support for RFC 3986 and 3987, but produces a huge parse tree as a result and is probably significantly slower (although I haven't measured it). Its error reporting is also incomprehensible.To add further confusion:
rdflilb
wantsrdflib.URIRef
instances (which are actually IRIs and not URI-references, for but calledURIRef
for historical reasons), which subclass fromstr
.jschon
has its ownjschon.URI
class which is a wrapper aroundrfc3986.URIReference
So at minimum there are two URI-ish classes floating around the code, one of which wraps a third, none of which are the standard library, and all of which have different interfaces.
┻━┻︵ \(`Д´)/ ︵ ┻━┻
Clearly the only answer is to write another library. More seriously, wrap this dumpster fire in a facade so that we can be less fiddly but still change implementations until finding the right correctness/convenience/performance balance. We are unlikely to know what constitutes "reasonable" performance until the system is near-complete and can process enormous specs like the GitHub API. A facade will make it feasible to quickly measure different alternative implementations.
Requirements:
application/x-www-form-urlencoded
-style stringsjschon.JSONPointer
class, although there are at least three other JSON Pointer packages as alternatives; however we also need Relative JSON Pointer support which is more rare, but well-supported byjschon.RelativeJSONPointer
, to which I contributed fixes injschon
0.10.0)http(s)
,file
,urn
especiallyurn:uuid
, and possiblytag
andmailto
- this can be deferred and added as the use cases arise, but it needs to be planned)application/schema+json
, andapplication/yaml
; JSON Schema plain name fragments, possibly other things to support External Documentation links?)rdflib.URIRef
andjschon.URI
, as well as plain stringsThe text was updated successfully, but these errors were encountered: