-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support any table nesting level in SQL queries (i.e SELECT * FROM one.two.three.four.five
)
#13822
Comments
I'm happy to work on this implementation, assuming I get consensus that this is a good idea. |
Ordinary (single level) schema name can contain a dot.
IIUC, from internal API perspective this is exactly what's going to happen. I'm not sure about dev-ex consequences of doing so though. Would every catalog be responsible for identifier re-parsing (a task solved by SQL parser today)? I understand all you want is for SQL parser/analyzer to desugar
the dotted syntax is not only for resolving table names, it's also used for column resolution: SELECT unqualified_column, one.two.three.four.five.qualified_column, one.two.three.four.five.another_column.a.nested.field
FROM one.two.three.four.five |
Yes, that is my main goal. Basically sugar around what we can already do today with
I don't think the catalog would need to do anything special, as long as it follows the same rule for registering the schemas available. But its possible I'm missing something.
I see, that would make parsing more complicated. |
Not sure if they can be distinguished. |
Yes, this depends on the position: whether it's in the relation position or the expression position. In the relation position, column resolution can be ignored. |
Correct, my proposal was to basically establish a convention on how to distinguish them. Currently it returns an error, so adding this as a convention seemed relatively harmless. However, I think the column resolution shoots this down as a general purpose feature. I still want to be able to customize the |
I'll leave this open for a day to see if anyone has a better idea and/or has an idea on how the column parsing could be handled and/or how to customize the |
This convention seems to be consistent with the current DataFusion, for example, "one.two.three" is not parsed as: Considering nested fields, we need to be careful with column parsing, such as establishing a reasonable search terms and priority in generate_schema_search_terms. DataFusion CLI v43.0.0
> create table a(b struct<c int>);
0 row(s) fetched.
# c is a field name of struct
> select a.b.c from a;
+--------+
| a.b[c] |
+--------+
+--------+
0 row(s) fetched.
Elapsed 0.017 seconds. |
Is your feature request related to a problem or challenge?
Currently DataFusion only support queries with 3-levels of nesting, i.e.
SELECT * FROM catalog.schema.table
. Many catalog providers (i.e. Iceberg) allow any arbitrary level of nesting, i.e.:. ├── benchmarks │ └── tpcds │ ├── foo │ └── bar ├── spice │ ├── tpch │ │ ├── orders │ │ └── customers │ ├── info │ └── extra │ └── tpch_orders_metadata └── one └── two └── three └── four └── five
Attempting to represent this in DataFusion is tricky and several of the alternatives I considered (see below) are poor UX.
Describe the solution you'd like
I would like to be able to write a catalog provider that allows users to select any of the tables in the Iceberg catalog with a natural dot separated syntax.
sqlparser-rs
already supports parsing this, creating anObjectName
which is aVec<Ident>
.There is a function in DataFusion
idents_to_table_reference
that is responsible for transforming theVec<Ident>
into aTableReference
.Its current implement looks like:
Instead of erroring on >3 idents, I propose that we concatenate all of the "middle" namespaces into the
schema
part:That would allow us to do
SELECT * FROM one.two.three.four.five
and have it be converted into a DF TableReference with:catalog
: oneschema
: two.three.fourtable
: fiveAnd any catalog provider implementations can know to create "schemas" to match this format to support arbitrarily nested namespaces.
Describe alternatives you've considered
I've considered in my implementation when integrating with a catalog provider to concatenate the middle namespaces with a
_
in the schema part so thatSELECT * FROM one.two.three.four.five
would becomeSELECT * FROM one.two_three_four.five
.I've also considered concatenating the middle namspaces with
.
and not changing DataFusion, but that would require doingSELECT * FROM one."two.three.four".five
which is also not an ideal UX.Another alternative if we didn't want this to be the default, is to allow users to customize the behavior of the
object_name_to_table_reference
function somehow.Additional context
No response
The text was updated successfully, but these errors were encountered: