Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI/IRI library #7

Closed
handrews opened this issue Apr 30, 2023 · 2 comments
Closed

URI/IRI library #7

handrews opened this issue Apr 30, 2023 · 2 comments
Assignees
Labels
uri iri jsonptr Modules for working with URIs/IRIs/JSONPtrs and templates of those things

Comments

@handrews
Copy link
Member

handrews commented Apr 30, 2023

URI and IRI support in python is a dumpster fire of sadness. It is "good enough" for most day-to-day purposes, but for a project specifically focusing on correctness and compliance, that's not really good enough. Considering some of the more commonly used libraries:

  • urllib doesn't even pretend to be compliant, and automatically encodes things that should be errors.
  • urllib3 also auto-encodes instead of detecting errors
  • rfc3986 comes a lot closer to compliance, but still does undesirable auto-encoding, although in some circumstance you can work around that with more elaborate use of their validation class

rfc3986 2.0.0 also emits deprecation warnings when you use certain non-deprecated functions. I have contributed a fix for this, but it's unclear when they will get around to publishing a new release.

For IRIs, all of the above means that non-URI unicode characters are encoded, rendering the IRI unreadable as an IRI. It's not entirely clear if there's any need for true IRI support for OAS 3.x compliance, but presumably it will be a concern for 4.0. The options for supporting unencoded IRIs is very sketchy:

  • The standard library has no IRI support
  • rfc3986.IRIReference is experimental, and incomprehensibly does the same encoding as rfc3986.URIReference
  • rfc3987 is a completely separate package from rfc3986; it handles unicode correctly but is otherwise very minimal and poorly documented, and is still too permissive about some things

None of this even gets into scheme-specific parsing, or media-type-specific fragment parsing.

To the best of my knowledge, the only code that handles everything correctly is the abnf package, which includes ABNF support for RFC 3986 and 3987, but produces a huge parse tree as a result and is probably significantly slower (although I haven't measured it). Its error reporting is also incomprehensible.

To add further confusion:

  • rdflilb wants rdflib.URIRef instances (which are actually IRIs and not URI-references, for but called URIRef for historical reasons), which subclass from str.
  • jschon has its own jschon.URI class which is a wrapper around rfc3986.URIReference

So at minimum there are two URI-ish classes floating around the code, one of which wraps a third, none of which are the standard library, and all of which have different interfaces.

┻━┻︵ \(`Д´)/ ︵ ┻━┻

Clearly the only answer is to write another library. More seriously, wrap this dumpster fire in a facade so that we can be less fiddly but still change implementations until finding the right correctness/convenience/performance balance. We are unlikely to know what constitutes "reasonable" performance until the system is near-complete and can process enormous specs like the GitHub API. A facade will make it feasible to quickly measure different alternative implementations.

Requirements:

  • easily work with URIs (or IRIs) as distinct from URI-references (or IRI-references)
  • easily work with query strings, specifically application/x-www-form-urlencoded-style strings
  • easy integration with templated URLs as defined by OAS (which are essentially a subset of RFC 6570)
  • easy manipulation of JSON Pointer fragments (probably using the jschon.JSONPointer class, although there are at least three other JSON Pointer packages as alternatives; however we also need Relative JSON Pointer support which is more rare, but well-supported by jschon.RelativeJSONPointer, to which I contributed fixes in jschon 0.10.0)
  • possibly allow for faster manipulation while deferring validation to the last step
  • reliable validation, at least of the full generic syntax, without silencing errors through aggressive re-encoding
  • excellent error reporting UX for validation errors
  • support for important scheme-specific syntax (http(s), file, urn especially urn:uuid, and possibly tag and mailto - this can be deferred and added as the use cases arise, but it needs to be planned)
  • support for important media-type specific fragment syntax (JSON Pointer for OAS media types, application/schema+json, and application/yaml; JSON Schema plain name fragments, possibly other things to support External Documentation links?)
  • reasonable performance on large specifications (e.g. GitHub, which is >200K lines in YAML)
  • smooth interoperability with rdflib.URIRef and jschon.URI, as well as plain strings
@handrews handrews self-assigned this Apr 30, 2023
@handrews
Copy link
Member Author

handrews commented May 2, 2023

Looking more closely at rfc3987, the function I tried defaults to non-validating parsing, but it's trivial to enable validation, which seems to be appropriately strict. It seems like the best candidate for validation, but it does not offer any sort of convenience class packaging up relevant functionality.

@handrews
Copy link
Member Author

New classes wrapping rfc3987 with a more rfc3986-ish interface were included in PR #16.

@handrews handrews added the uri iri jsonptr Modules for working with URIs/IRIs/JSONPtrs and templates of those things label May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
uri iri jsonptr Modules for working with URIs/IRIs/JSONPtrs and templates of those things
Projects
None yet
Development

No branches or pull requests

1 participant