with_columns for Pandas #1209

jernejfrank · 2024-10-30T13:00:25Z

Creating the option to use with_columns on Pandas data frames.

Partially addressing #1158.

Changes

added new decorator

How I tested this

unit tests
e2e
example

Notes

Known issues:

config.when does not work for individual nodes within the with_columns subtag.

The other thing is, at the moment, this extracts relevant columns from the df and then appends them. This would then be an eager execution.

Polars and PySpark can do with_columns lazy and use their optimizers under the hood. Pandas to my knowledge doesn't have that, but there are two ways how we could mimic laziness:

using eval

df.eval('new_column = f(previous_columns)'))

usingassign

df.assign(new_column = lambda x: x(previous_columns))

My reasoning so far is, that anybody who is concerned with latency will not be using Pandas in the first place so this seems like an overkill.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

ellipsis-dev

❌ Changes requested. Reviewed everything up to 0c41502 in 50 seconds

More details

Looked at 755 lines of code in 7 files
Skipped 2 files when reviewing.
Skipped posting 4 drafted comments based on config settings.

1. tests/function_modifiers/test_recursive.py:778

Draft comment:
Typo in function name. Consider renaming substract_1_from_2 to subtract_1_from_2 for clarity and correctness. This applies to other occurrences as well.
Reason this comment was not posted:
Marked as duplicate.

2. hamilton/function_modifiers/recursive.py:778

Draft comment:
Typo in function name. Consider renaming substract_1_from_2 to subtract_1_from_2 for clarity and correctness. This applies to other occurrences as well.
Reason this comment was not posted:
Marked as duplicate.

3. tests/resources/with_columns.py:10

Draft comment:
Typo in function name substract_1_from_2. It should be subtract_1_from_2. This typo is present in multiple places, including the function definition and its usage in decorators.
Reason this comment was not posted:
Marked as duplicate.

4. tests/function_modifiers/test_recursive.py:778

Draft comment:
Typo in function name substract_1_from_2. It should be subtract_1_from_2. This typo is present in multiple places, including the function definition and its usage in decorators.
Reason this comment was not posted:
Marked as duplicate.

Workflow ID: wflow_tdIW9EERoU9TNVSK

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2024-10-30T13:01:27Z

tests/resources/with_columns.py

+    return pd.DataFrame({"col_1": [1, 2, 3, 4], "col_2": [11, 12, 13, 14], "col_3": [1, 1, 1, 1]})
+
+
+def substract_1_from_2(col_1: pd.Series, col_2: pd.Series) -> pd.Series:


Typo in function name. Consider renaming substract_1_from_2 to subtract_1_from_2 for clarity and correctness.

Suggested change

def substract_1_from_2(col_1: pd.Series, col_2: pd.Series) -> pd.Series:

def subtract_1_from_2(col_1: pd.Series, col_2: pd.Series) -> pd.Series:

Learned something new today:
Substract was formerly used in analogy with abstract. But in modern usage, it is written according to the Latin, subtract.

zilto · 2024-10-30T18:41:45Z

tests/function_modifiers/test_recursive.py

 import pytest

-from hamilton import ad_hoc_utils, graph
+from hamilton import ad_hoc_utils, driver, graph, node


Could use

from hamilton import ..., node as hamilton_node

because the name node is overriden elsewhere (e.g., line 187)

elijahbenizzy

Looking good! Main comments about how to do the assign-style operations. Definitely on the right track!

elijahbenizzy · 2024-11-01T05:15:19Z

hamilton/function_modifiers/__init__.py

@@ -88,6 +88,7 @@

 subdag = recursive.subdag
 parameterized_subdag = recursive.parameterized_subdag
+with_columns = recursive.with_columns


This should probably live in a pandas extension -- while pandas has, historically, been the dependency for Hamilton, we want to move it out (and this is very pandas-specific logic)

elijahbenizzy · 2024-11-01T05:15:36Z