-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing to multiple GeoParquet files will not output _metadata #1296
Comments
The
We have to implement an output committer for GeoParquet to merge |
@Kontinuation Thanks. I think it would be totally fine to leave off the Since GeoParquet doesn't define these single _metadata summary files, I don't think it would be any issue at all - of course in the future it may standardize on a definition but I think for now it'll only be used for row group filtering and the geo metadata is not needed. |
Expected behavior
When writing out a GeoParquet dataframe that results in multiple files, the _metadata summary file will not be created when configured to do so.
If the number of records exceeds maxRecordsPerFile so that more than one file is written, the
_metadata
and_common_metadata
files will not be written. When there are fewer records that only one file is written, then_metadata
and_common_metadata
will be created.However if I change the above to write parquet instead of geoparquet:
Then
_metadata
and_common_metadata
will be written even with multiple files. Is there a setting or other way to enable writing the common metadata files?I'd like to write these files as reading in full datasets from pyarrow or others will not need to fully scan all files which can be time-consuming for large datasets.
Settings
Sedona version = 3.4.1
Apache Spark version = 3.4.1
Environment = Databricks
The text was updated successfully, but these errors were encountered: