Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py #78

Merged
merged 13 commits into from
Jan 28, 2025

Conversation

chjuncn
Copy link
Collaborator

@chjuncn chjuncn commented Jan 27, 2025

Support workflow:

df = pd.read_csv("testdata/enron-tiny.csv")
qr2 = pz.Dataset(df)
qr2 = qr2.add_columns({"sender": ("The email address of the sender", "string"),
"subject": ("The subject of the email", "string"),#
"date": ("The date the email was sent", "string")})

qr3 = qr2.filter("It is an email").filter("It has Vacation in the subject")

output_df = qr3.run().to_df()

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.
Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.
@chjuncn chjuncn changed the base branch from main to dev January 27, 2025 06:52
The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```
update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.
@chjuncn chjuncn changed the title Support add_columns in Dataset. Support demo in df-newinterface.py 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py Jan 27, 2025
people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.
@mdr223 mdr223 linked an issue Jan 27, 2025 that may be closed by this pull request
Copy link
Collaborator

@mdr223 mdr223 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments:

  • A few need to be addressed before merging (e.g. the comment re: ONE_TO_MANY cardinality in add_columns())
  • Some others are more style suggestions (or questions) which you can ignore if you prefer

I'm pre-emptively approving this PR so that once you make the changes in your timezone you can merge the PR without needing to wait for me to wake up :)

src/palimpzest/core/lib/schemas.py Show resolved Hide resolved
src/palimpzest/sets.py Outdated Show resolved Hide resolved
testdata/convert_enron_to_csv.py Outdated Show resolved Hide resolved
testdata/enron-tiny.csv Outdated Show resolved Hide resolved
demos/bdf-suite.py Show resolved Hide resolved
demos/bdf-usecase3.py Show resolved Hide resolved
demos/fever-demo.py Show resolved Hide resolved
1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.
@chjuncn chjuncn merged commit 37b8835 into dev Jan 28, 2025
sivaprasadsudhir pushed a commit to sivaprasadsudhir/palimpzest that referenced this pull request Jan 28, 2025
…emo in df-newinterface.py (mitdbg#78)

* Support add_columns in Dataset. Support demo in df-newinterface.py

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.

* ruff check --fix

* Support run().to_df()

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.

* run check --fix

* fix typo in DataRecordCollection

* Update records.py

* fix tiny bug in mab processor.

The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```

* update record.to_df interface

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

* Update demo for the new execute() output format

* better way to get plan from output.run()

* fix getting plan from DataRecordCollection.

people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.

* Update df-newinterface.py

* update code based on comments from Matt.

1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.
mdr223 added a commit that referenced this pull request Jan 28, 2025
* update README

* 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78)

* Support add_columns in Dataset. Support demo in df-newinterface.py

Currently we have to do

records, _ = qr3.run()
outputDf = DataRecord.to_df(records)

I'll try to make qr3.run().to_df() work in another PR.

* ruff check --fix

* Support run().to_df()

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output.

We support to_df() in this change.

I'll send out following commits to update other demos.

* run check --fix

* fix typo in DataRecordCollection

* Update records.py

* fix tiny bug in mab processor.

The code will run into issue if we don't return any stats for this function in

```
                            max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets)
                            if (
                                not prev_logical_op_is_filter
                                or (
                                    prev_logical_op_is_filter
                                    and max_quality_record_set.record_op_stats[0].passed_operator
                                )
```

* update record.to_df interface

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

* Update demo for the new execute() output format

* better way to get plan from output.run()

* fix getting plan from DataRecordCollection.

people used to get plan from execute() of streaming processor, which is not a good practice.

I update plan_str to plan_stats, and they need to get physical plan from processor.

Consider use better ways to provide executed physical plan to  DataRecordCollection, possibly from stats.

* Update df-newinterface.py

* update code based on comments from Matt.

1. add cardinality param in add_columns
2. remove extra testdata files
3. add __iter__ in DataRecordCollection to help iter over streaming output.

* see if copilot just saved me 20 minutes

* fix package name

* use sed to get version from pyproject.toml

* bump project version; keep docs behind to test ci pipeline

* bumping docs version to match code version

* use new __iter__ method in demos where possible

* add type hint for output of __iter__; use __iter__ in unit tests

* Update download-testdata.sh (#89)

Added enron-tiny.csv

---------

Co-authored-by: Matthew Russo <[email protected]>
Co-authored-by: Gerardo Vitagliano <[email protected]>
@chjuncn chjuncn deleted the chjun-0127 branch February 1, 2025 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve syntax for quickstart demo(s)
2 participants