Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support any table nesting level in SQL queries (i.e SELECT * FROM one.two.three.four.five) #13822

Open
phillipleblanc opened this issue Dec 18, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@phillipleblanc
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently DataFusion only support queries with 3-levels of nesting, i.e. SELECT * FROM catalog.schema.table. Many catalog providers (i.e. Iceberg) allow any arbitrary level of nesting, i.e.:

.
├── benchmarks
│   └── tpcds
│       ├── foo
│       └── bar
├── spice
│   ├── tpch
│   │   ├── orders
│   │   └── customers
│   ├── info
│   └── extra
│       └── tpch_orders_metadata
└── one
    └── two
        └── three
            └── four
                └── five

Attempting to represent this in DataFusion is tricky and several of the alternatives I considered (see below) are poor UX.

Describe the solution you'd like

I would like to be able to write a catalog provider that allows users to select any of the tables in the Iceberg catalog with a natural dot separated syntax. sqlparser-rs already supports parsing this, creating an ObjectName which is a Vec<Ident>.

There is a function in DataFusion idents_to_table_reference that is responsible for transforming the Vec<Ident> into a TableReference.

Its current implement looks like:

pub(crate) fn idents_to_table_reference(
    idents: Vec<Ident>,
    enable_normalization: bool,
) -> Result<TableReference> {
    let mut taker = IdentTaker::new(idents, enable_normalization);

    match taker.len() {
        1 => {
            let table = taker.take();
            Ok(TableReference::bare(table))
        }
        2 => {
            let table = taker.take();
            let schema = taker.take();
            Ok(TableReference::partial(schema, table))
        }
        3 => {
            let table = taker.take();
            let schema = taker.take();
            let catalog = taker.take();
            Ok(TableReference::full(catalog, schema, table))
        }
        _ => plan_err!(
            "Unsupported compound identifier '{}'. Expected 1, 2 or 3 parts, got {}",
            taker,
            taker.len()
        ),
    }
}

Instead of erroring on >3 idents, I propose that we concatenate all of the "middle" namespaces into the schema part:

        // This won't compile as is, but demonstrates the idea
        _ => {
            // Concatenate all of the middle identifiers with a dot and put into the "schema" field
            let table = taker.take(enable_normalization);
            let schema = idents
                .iter()
                .skip(1)
                .take(idents.len() - 2)
                .map(|ident| {
                    IdentNormalizer::new(enable_normalization).normalize(ident.clone())
                })
                .collect::<Vec<_>>()
                .join(".");
            let catalog = taker.take(enable_normalization);
            Ok(TableReference::full(catalog, schema, table))
        }

That would allow us to do SELECT * FROM one.two.three.four.five and have it be converted into a DF TableReference with:
catalog: one
schema: two.three.four
table: five

And any catalog provider implementations can know to create "schemas" to match this format to support arbitrarily nested namespaces.

Describe alternatives you've considered

I've considered in my implementation when integrating with a catalog provider to concatenate the middle namespaces with a _ in the schema part so that SELECT * FROM one.two.three.four.five would become SELECT * FROM one.two_three_four.five.

I've also considered concatenating the middle namspaces with . and not changing DataFusion, but that would require doing SELECT * FROM one."two.three.four".five which is also not an ideal UX.

Another alternative if we didn't want this to be the default, is to allow users to customize the behavior of the object_name_to_table_reference function somehow.

Additional context

No response

@phillipleblanc phillipleblanc added the enhancement New feature or request label Dec 18, 2024
@phillipleblanc
Copy link
Contributor Author

I'm happy to work on this implementation, assuming I get consensus that this is a good idea.

@findepi
Copy link
Member

findepi commented Dec 18, 2024

Ordinary (single level) schema name can contain a dot.

I've also considered concatenating the middle namspaces with . and not changing DataFusion, but that would require doing SELECT * FROM one."two.three.four".five which is also not an ideal UX.

IIUC, from internal API perspective this is exactly what's going to happen. I'm not sure about dev-ex consequences of doing so though. Would every catalog be responsible for identifier re-parsing (a task solved by SQL parser today)?

I understand all you want is for SQL parser/analyzer to desugar one.two.three.four.five into one."two.three.four".five` so that query author doesn't need to do anything.

There is a function in DataFusion idents_to_table_reference that is responsible for transforming the Vec into a TableReference.

the dotted syntax is not only for resolving table names, it's also used for column resolution:

SELECT unqualified_column, one.two.three.four.five.qualified_column, one.two.three.four.five.another_column.a.nested.field
FROM one.two.three.four.five

@phillipleblanc
Copy link
Contributor Author

phillipleblanc commented Dec 18, 2024

I understand all you want is for SQL parser/analyzer to desugar one.two.three.four.five into one."two.three.four".five` so that query author doesn't need to do anything.

Yes, that is my main goal. Basically sugar around what we can already do today with one."two.three.four".five.

Would every catalog be responsible for identifier re-parsing (a task solved by SQL parser today)?

I don't think the catalog would need to do anything special, as long as it follows the same rule for registering the schemas available. But its possible I'm missing something.

the dotted syntax is not only for resolving table names, it's also used for column resolution

I see, that would make parsing more complicated.

@jonahgao
Copy link
Member

SELECT * FROM one.two.three.four.five can also be resolved as:
catalog: None
schema: one.two.three.four
table: five

Not sure if they can be distinguished.

@jonahgao
Copy link
Member

the dotted syntax is not only for resolving table names, it's also used for column resolution:

Yes, this depends on the position: whether it's in the relation position or the expression position. In the relation position, column resolution can be ignored.

@phillipleblanc
Copy link
Contributor Author

SELECT * FROM one.two.three.four.five can also be resolved as: catalog: None schema: one.two.three.four table: five

Not sure if they can be distinguished.

Correct, my proposal was to basically establish a convention on how to distinguish them. Currently it returns an error, so adding this as a convention seemed relatively harmless.

However, I think the column resolution shoots this down as a general purpose feature. I still want to be able to customize the object name -> table reference resolution to implement this, but I might need to just keep that in my fork.

@phillipleblanc
Copy link
Contributor Author

I'll leave this open for a day to see if anyone has a better idea and/or has an idea on how the column parsing could be handled and/or how to customize the object name -> table reference in a nice way. If nothing, I'll close it and implement it in my fork instead.

@jonahgao
Copy link
Member

Correct, my proposal was to basically establish a convention on how to distinguish them. Currently it returns an error, so adding this as a convention seemed relatively harmless.

This convention seems to be consistent with the current DataFusion, for example, "one.two.three" is not parsed as:
catalog: None
schema: one.two
table: three

Considering nested fields, we need to be careful with column parsing, such as establishing a reasonable search terms and priority in generate_schema_search_terms.

DataFusion CLI v43.0.0
> create table a(b struct<c int>);
0 row(s) fetched.

# c is a field name of struct
> select a.b.c from a;
+--------+
| a.b[c] |
+--------+
+--------+
0 row(s) fetched.
Elapsed 0.017 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants