Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make add_filename str/bool #465

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

praateekmahajan
Copy link
Collaborator

Description

Inspired from dask.read_json where

include_path_column : bool or str, optional :

Include a column with the file path where each row in the dataframe originated.
If True, a new column is added to the dataframe called path. If str, sets new column name. Default is False.

We also had a bug in our code where we were doing include_path_column=True and then renaming the column from path -> file_name. This meant if there was already a path column we would have an error.

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@praateekmahajan praateekmahajan added the gpuci Run GPU CI/CD on PR label Jan 3, 2025
@@ -273,6 +273,18 @@ def _set_torch_to_use_rmm():
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)


def _resolve_filename_col(filename: Union[bool, str]) -> Union[str, bool]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot, thanks! Was wondering if other files from #449 (like doc_dataset.py) need to be updated as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants