Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add from_parquet to dataloader #483

Conversation

JamesHWade
Copy link
Contributor

A parquet file loader would be a convenient addition to the dataloader class. I particularly like that parquet files preserve types better than a csv file.

The primary change is the addition of the from_parquet() function:

def from_parquet(self, file_path: str, fields: list[str] = None, input_keys: tuple[str] = ()) -> list[dspy.Example]:
        dataset = load_dataset("parquet", data_files=file_path)["train"]

        if not fields:
            fields = list(dataset.features)

        return [dspy.Example({field: row[field] for field in fields}).with_inputs(input_keys) for row in dataset]

The rest of the changes were caused by the ruff precommit hook.

@krypticmouse
Copy link
Collaborator

krypticmouse commented Mar 2, 2024

There might be a conflict to be resolved but other than that this looks good!

@okhat
Copy link
Collaborator

okhat commented Mar 2, 2024

happy to merge once no conflict! thank you @JamesHWade !

@JamesHWade JamesHWade force-pushed the feat/add-from-parquet-to-dataloader branch from 25b7e5f to ad77d35 Compare March 3, 2024 02:56
@JamesHWade
Copy link
Contributor Author

should be good to go. thanks!

@krypticmouse
Copy link
Collaborator

@JamesHWade There seems to be some conflicts still. Aside from that, any reason to remove typing types in favor of python type?

I don't oppose the idea as such, just wanted to know the reasoning behind it. Thanks for contributing!

@JamesHWade
Copy link
Contributor Author

I'll fix. No reason for the type change. They were added by Ruff automatically with the precommit hook, but happy to revert to python types.

@JamesHWade JamesHWade force-pushed the feat/add-from-parquet-to-dataloader branch from ad77d35 to 910606a Compare March 4, 2024 14:25
@JamesHWade
Copy link
Contributor Author

Should be merge-ready now. I just skipped pre-commit this time.

@krypticmouse
Copy link
Collaborator

No worries, thanks for informing! I actually have no issues with python types either!

Some tests seem to be failing on this though. Will take a look soon!

@JamesHWade
Copy link
Contributor Author

I think it was a temporary network error. It failed to checkout the repo for testing.

@JamesHWade
Copy link
Contributor Author

Very odd. Here is what I get if I try to merge into main on my fork:

image

@JamesHWade
Copy link
Contributor Author

Sorry for the trouble with this PR. For reasons unclear to me, the checkout@v4 action is look for my branch within this repo. I'm not sure how to fix it other than to abandon this PR and create a new one. @krypticmouse, LMK if that sounds good to you and I'll create that quickly.

@arnavsinghvi11
Copy link
Collaborator

@JamesHWade just following up on this PR. feel free to close this and open the new one without conflicts! Thanks

@JamesHWade
Copy link
Contributor Author

Closing to resubmit as separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants