Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect file order in DocumentDataset.read_ #487

Merged

Conversation

praateekmahajan
Copy link
Collaborator

@praateekmahajan praateekmahajan commented Jan 17, 2025

Description

By removing the sorted(..) in read_data_files_per_partition we can be sure that the input order provided by user is respected. And in case a directory is provided as an input then we are ls-ing and then sorting before reading.

Solves #486

Current Code Flow

Current flow is

  1. DocumentDataset.read_json/parquet calls _read_json_or_parquet
  2. _read_json_or_parquet which calls read_data
    • _read_json_or_parquet accepts input_files that could be list[files], list[dir] or file or dir
    • if it's a dir or list[dir] it calls get_all_files_paths_under which has an implicit sort
  3. read_data accepts list[files] and calls read_data_blocksize or read_data_files_per_partition`
    1. read_data_files_per_partition performs a sort
    2. read_data_blocksize doesn't perform a sort

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good with me, was actually wondering why that was there earlier today lol.

@praateekmahajan praateekmahajan merged commit 7a49ebb into NVIDIA:main Jan 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants