Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50618] Make DataFrameReader and DataStreamReader leverage the analyzer more #49238

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Dec 19, 2024

What changes were proposed in this pull request?

Introduces two logical nodes:

  • UnresolvedDataSource
  • UnresolvedJDBCRelation

The DataFrameReader and DataStreamReader creates these unresolved nodes instead, and calls the analyzer to resolve these data sources. These then get analyzed as part of the ResolveDataSource rule. All logic in DataFrameReader and DataStreamReader has been moved here.

There is still logic around text based format parsing on an existing Dataset. I will refactor this in a subsequent PR.

Why are the changes needed?

The DataFrameReader and DataStreamReader typically creates analyzed relations as part of their respective .load() methods.

This creates inconsistencies for what rules get applied to the query plan as part of Catalyst depending on your API of choice, such as SQL vs Python or SQL vs Scala.

The goal of this Jira is to refactor the logic in DataFrameReader and DataStreamReader classes to create unresolved plans that get analyzed as part of Catalyst.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests and will add new tests

Was this patch authored or co-authored using generative AI tooling?

No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant