1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py #78

chjuncn · 2025-01-27T02:46:36Z

Support workflow:

df = pd.read_csv("testdata/enron-tiny.csv")
qr2 = pz.Dataset(df)
qr2 = qr2.add_columns({"sender": ("The email address of the sender", "string"),
"subject": ("The subject of the email", "string"),#
"date": ("The date the email was sent", "string")})

qr3 = qr2.filter("It is an email").filter("It has Vacation in the subject")

output_df = qr3.run().to_df()

Currently we have to do records, _ = qr3.run() outputDf = DataRecord.to_df(records) I'll try to make qr3.run().to_df() work in another PR.

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output. We support to_df() in this change. I'll send out following commits to update other demos.

The code will run into issue if we don't return any stats for this function in ``` max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets) if ( not prev_logical_op_is_filter or ( prev_logical_op_is_filter and max_quality_record_set.record_op_stats[0].passed_operator ) ```

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

people used to get plan from execute() of streaming processor, which is not a good practice. I update plan_str to plan_stats, and they need to get physical plan from processor. Consider use better ways to provide executed physical plan to DataRecordCollection, possibly from stats.

mdr223

I've left a few comments:

A few need to be addressed before merging (e.g. the comment re: ONE_TO_MANY cardinality in add_columns())
Some others are more style suggestions (or questions) which you can ignore if you prefer

I'm pre-emptively approving this PR so that once you make the changes in your timezone you can merge the PR without needing to wait for me to wake up :)

src/palimpzest/core/lib/schemas.py

src/palimpzest/sets.py

testdata/convert_enron_to_csv.py

testdata/enron-tiny.csv

demos/bdf-suite.py

demos/bdf-usecase3.py

demos/fever-demo.py

1. add cardinality param in add_columns 2. remove extra testdata files 3. add __iter__ in DataRecordCollection to help iter over streaming output.

…emo in df-newinterface.py (mitdbg#78) * Support add_columns in Dataset. Support demo in df-newinterface.py Currently we have to do records, _ = qr3.run() outputDf = DataRecord.to_df(records) I'll try to make qr3.run().to_df() work in another PR. * ruff check --fix * Support run().to_df() Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output. We support to_df() in this change. I'll send out following commits to update other demos. * run check --fix * fix typo in DataRecordCollection * Update records.py * fix tiny bug in mab processor. The code will run into issue if we don't return any stats for this function in ``` max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets) if ( not prev_logical_op_is_filter or ( prev_logical_op_is_filter and max_quality_record_set.record_op_stats[0].passed_operator ) ``` * update record.to_df interface update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class. * Update demo for the new execute() output format * better way to get plan from output.run() * fix getting plan from DataRecordCollection. people used to get plan from execute() of streaming processor, which is not a good practice. I update plan_str to plan_stats, and they need to get physical plan from processor. Consider use better ways to provide executed physical plan to DataRecordCollection, possibly from stats. * Update df-newinterface.py * update code based on comments from Matt. 1. add cardinality param in add_columns 2. remove extra testdata files 3. add __iter__ in DataRecordCollection to help iter over streaming output.

* update README * 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py (#78) * Support add_columns in Dataset. Support demo in df-newinterface.py Currently we have to do records, _ = qr3.run() outputDf = DataRecord.to_df(records) I'll try to make qr3.run().to_df() work in another PR. * ruff check --fix * Support run().to_df() Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output. We support to_df() in this change. I'll send out following commits to update other demos. * run check --fix * fix typo in DataRecordCollection * Update records.py * fix tiny bug in mab processor. The code will run into issue if we don't return any stats for this function in ``` max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets) if ( not prev_logical_op_is_filter or ( prev_logical_op_is_filter and max_quality_record_set.record_op_stats[0].passed_operator ) ``` * update record.to_df interface update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class. * Update demo for the new execute() output format * better way to get plan from output.run() * fix getting plan from DataRecordCollection. people used to get plan from execute() of streaming processor, which is not a good practice. I update plan_str to plan_stats, and they need to get physical plan from processor. Consider use better ways to provide executed physical plan to DataRecordCollection, possibly from stats. * Update df-newinterface.py * update code based on comments from Matt. 1. add cardinality param in add_columns 2. remove extra testdata files 3. add __iter__ in DataRecordCollection to help iter over streaming output. * see if copilot just saved me 20 minutes * fix package name * use sed to get version from pyproject.toml * bump project version; keep docs behind to test ci pipeline * bumping docs version to match code version * use new __iter__ method in demos where possible * add type hint for output of __iter__; use __iter__ in unit tests * Update download-testdata.sh (#89) Added enron-tiny.csv --------- Co-authored-by: Matthew Russo <[email protected]> Co-authored-by: Gerardo Vitagliano <[email protected]>

Support add_columns in Dataset. Support demo in df-newinterface.py

c6213ee

Currently we have to do records, _ = qr3.run() outputDf = DataRecord.to_df(records) I'll try to make qr3.run().to_df() work in another PR.

chjuncn requested review from mikecafarella and mdr223 January 27, 2025 02:46

chjuncn added 5 commits January 27, 2025 10:50

ruff check --fix

ec7b29e

Support run().to_df()

497bd74

Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output. We support to_df() in this change. I'll send out following commits to update other demos.

run check --fix

34202de

fix typo in DataRecordCollection

363c70f

Update records.py

8226de5

chjuncn changed the base branch from main to dev January 27, 2025 06:52

chjuncn added 3 commits January 27, 2025 14:53

update record.to_df interface

61249da

update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class.

Update demo for the new execute() output format

f4e4dbe

chjuncn changed the title ~~Support add_columns in Dataset. Support demo in df-newinterface.py~~ 1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py Jan 27, 2025

chjuncn added 3 commits January 27, 2025 16:26

better way to get plan from output.run()

f9c90fa

Update df-newinterface.py

27dcf3e

mdr223 linked an issue Jan 27, 2025 that may be closed by this pull request

Improve syntax for quickstart demo(s) #80

Open

mdr223 approved these changes Jan 27, 2025

View reviewed changes

update code based on comments from Matt.

cd8a39f

1. add cardinality param in add_columns 2. remove extra testdata files 3. add __iter__ in DataRecordCollection to help iter over streaming output.

chjuncn merged commit 37b8835 into dev Jan 28, 2025

chjuncn deleted the chjun-0127 branch February 1, 2025 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py #78

1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py #78

chjuncn commented Jan 27, 2025 •

edited

Loading

mdr223 left a comment

1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py #78

1. support add_columns in Dataset; 2. support run().to_df(); 3. add demo in df-newinterface.py #78

Conversation

chjuncn commented Jan 27, 2025 • edited Loading

mdr223 left a comment

Choose a reason for hiding this comment

chjuncn commented Jan 27, 2025 •

edited

Loading