Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: intermediary file format specification #108

Open
3 tasks
gessulat opened this issue Sep 21, 2023 · 0 comments
Open
3 tasks

WIP: intermediary file format specification #108

gessulat opened this issue Sep 21, 2023 · 0 comments

Comments

@gessulat
Copy link
Contributor

gessulat commented Sep 21, 2023

@sambenfredj 's pull requests introduces streaming at several places of the workflow but those intermediary file formats are not specified and documented yet. In addition, switching to a binary format such as partitioned pyarrow datasets would speed up IO.

Schemas will be defined here after @wfondrie 's switch to polars.

Tasks

  • document where we need intermediary files
  • document how the files relate to input files, to each other, and to output files (e.g. how should they be joined?)
  • specify columns and their datatype and potential indeces on columns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant