forked from mitdbg/palimpzest
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
1. support add_columns in Dataset; 2. support run().to_df(); 3. add d…
…emo in df-newinterface.py (mitdbg#78) * Support add_columns in Dataset. Support demo in df-newinterface.py Currently we have to do records, _ = qr3.run() outputDf = DataRecord.to_df(records) I'll try to make qr3.run().to_df() work in another PR. * ruff check --fix * Support run().to_df() Update run() to DataRecordCollection, so that it will be easier for use to support more features for run() output. We support to_df() in this change. I'll send out following commits to update other demos. * run check --fix * fix typo in DataRecordCollection * Update records.py * fix tiny bug in mab processor. The code will run into issue if we don't return any stats for this function in ``` max_quality_record_set = self.pick_highest_quality_output(all_source_record_sets) if ( not prev_logical_op_is_filter or ( prev_logical_op_is_filter and max_quality_record_set.record_op_stats[0].passed_operator ) ``` * update record.to_df interface update to record.to_df(records: list[DataRecord], project_cols: list[str] | None = None) which is consistent with other function in this class. * Update demo for the new execute() output format * better way to get plan from output.run() * fix getting plan from DataRecordCollection. people used to get plan from execute() of streaming processor, which is not a good practice. I update plan_str to plan_stats, and they need to get physical plan from processor. Consider use better ways to provide executed physical plan to DataRecordCollection, possibly from stats. * Update df-newinterface.py * update code based on comments from Matt. 1. add cardinality param in add_columns 2. remove extra testdata files 3. add __iter__ in DataRecordCollection to help iter over streaming output.
- Loading branch information
1 parent
c3cbca9
commit 3275864
Showing
23 changed files
with
277 additions
and
111 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
import pandas as pd | ||
|
||
import palimpzest as pz | ||
from palimpzest.query.processor.config import QueryProcessorConfig | ||
|
||
df = pd.read_csv("testdata/enron-tiny.csv") | ||
qr2 = pz.Dataset(df) | ||
qr2 = qr2.add_columns({"sender": ("The email address of the sender", "string"), | ||
"subject": ("The subject of the email", "string"),# | ||
"date": ("The date the email was sent", "string")}) | ||
qr3 = qr2.filter("It is an email").filter("It has Vacation in the subject") | ||
|
||
config = QueryProcessorConfig( | ||
verbose=True, | ||
execution_strategy="pipelined_parallel", | ||
) | ||
|
||
output = qr3.run(config) | ||
output_df = output.to_df() | ||
print(output_df) | ||
|
||
output_df = output.to_df(project_cols=["sender", "subject", "date"]) | ||
print(output_df) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.