Add column-wise transforms & refactor TableVectorizer #902

jeromedockes · 2024-05-02T15:12:46Z

closes #874, #886, #894, #877, #848, #904, #905, #830, #626, #870

This is the last part of the changes outlined in #877 (the first two parts have been merged in #895 and #888)
The main addition is OnEachColumn, a transformer that applies a transformation independently to each column in a dataframe, and is used to refactor the TableVectorizer and ensure it does consistent operations across calls to transform.

This reverts commit 87e5a3c.

…mations

TheooJ

Third pass, thank you @jeromedockes !

skrub/_clean_categories.py

skrub/tests/test_on_subframe.py

skrub/tests/test_to_datetime.py

skrub/tests/test_clean_null_strings.py

skrub/tests/test_to_float32.py

skrub/tests/test_to_str.py

skrub/_table_vectorizer.py

examples/01_encodings.py

Co-authored-by: Théo Jolivet <[email protected]>

jeromedockes

thanks @TheooJ

skrub/tests/test_clean_null_strings.py

skrub/tests/test_to_str.py

glemaitre · 2024-05-28T12:29:40Z

I'll make a pass now.

glemaitre

Just ignore the things for the example style. I think that we should in another PR.

glemaitre · 2024-05-28T12:47:17Z

examples/01_encodings.py

 #
-# Let's first retrieve the dataset:
+# Let's first retrieve the dataset, using one of the downloaders from the :mod:`skrub.datasets` module.


Do you to make the example black complient now (less than 88 characters) or make an automatic pass of the tool in another PR?

glemaitre · 2024-05-28T12:48:22Z

examples/01_encodings.py

 ###############################################################################
-# A simple prediction pipeline
-# ----------------------------
+# Easily encoding a dataframe


If we change the example, I would probably use the # %% delimiter nowadays.

Suggested change

###############################################################################

# A simple prediction pipeline

# ----------------------------

# Easily encoding a dataframe

# %%

# Easily encoding a dataframe

glemaitre · 2024-05-28T12:49:15Z

examples/01_encodings.py


 from skrub.datasets import fetch_employee_salaries

 dataset = fetch_employee_salaries()
+employees, salaries = dataset.X, dataset.y
+employees

 ###############################################################################


Suggested change

###############################################################################

# %%

glemaitre · 2024-05-28T12:49:27Z

examples/01_encodings.py


-X = dataset.X
-y = dataset.y
+###############################################################################


Suggested change

###############################################################################

# %%

glemaitre · 2024-05-28T12:50:16Z

examples/01_encodings.py

 ###############################################################################
-# We observe diverse columns in the dataset:
-#   - binary (``'gender'``),
-#   - numerical (``'employee_annual_salary'``),
-#   - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``),
-#   - datetime (``'date_first_hired'``)
-#   - dirty categorical (``'employee_position_title'``, ``'division'``).
-#
-# Using skrub's |TableVectorizer|, we can now already build a machine-learning
-# pipeline and train it:
+# From our 8 columns, the |TableVectorizer| has extracted 143 numerical
+# features. Most of them are one-hot encoded representations of the categorical
+# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``,
+# ``'gender_nan'`` were created to encode the ``'gender'`` column.
+
+###############################################################################
+# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning:

 from sklearn.ensemble import HistGradientBoostingRegressor


Suggested change

###############################################################################

# We observe diverse columns in the dataset:

# - binary (``'gender'``),

# - numerical (``'employee_annual_salary'``),

# - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``),

# - datetime (``'date_first_hired'``)

# - dirty categorical (``'employee_position_title'``, ``'division'``).

#

# Using skrub's |TableVectorizer|, we can now already build a machine-learning

# pipeline and train it:

# From our 8 columns, the |TableVectorizer| has extracted 143 numerical

# features. Most of them are one-hot encoded representations of the categorical

# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``,

# ``'gender_nan'`` were created to encode the ``'gender'`` column.

###############################################################################

# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning:

from sklearn.ensemble import HistGradientBoostingRegressor

# %%

# From our 8 columns, the |TableVectorizer| has extracted 143 numerical

# features. Most of them are one-hot encoded representations of the categorical

# features. For example, we can see that 3 columns ``'gender_F'``, ``'gender_M'``,

# ``'gender_nan'`` were created to encode the ``'gender'`` column.

#

# By performing appropriate transformations on our complex data, the |TableVectorizer| produced numeric features that we can use for machine-learning:

from sklearn.ensemble import HistGradientBoostingRegressor

glemaitre · 2024-05-28T12:53:08Z

examples/01_encodings.py

 ###############################################################################
-# The simple pipeline applied on this complex dataset gave us very good results.
+# We can see that this new pipeline achieves a similar score but is fitted much faster.
+# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable).

 ###############################################################################


Suggested change

###############################################################################

# The simple pipeline applied on this complex dataset gave us very good results.

# We can see that this new pipeline achieves a similar score but is fitted much faster.

# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable).

###############################################################################

# %%

# We can see that this new pipeline achieves a similar score but is fitted much faster.

# This is mostly due to replacing |GapEncoder| with |MinHashEncoder| (however this makes the features less interpretable).

#

glemaitre · 2024-05-28T12:53:30Z

examples/01_encodings.py

-pipeline = make_pipeline(TableVectorizer(), regressor)
-pipeline.fit(X, y)
+pipeline = make_pipeline(vectorizer, regressor)
+pipeline.fit(employees, salaries)

 ###############################################################################


Suggested change

###############################################################################

# %%

glemaitre · 2024-05-28T12:54:22Z

examples/03_datetime_encoder.py

    remainder="drop",
 )

 X_enc = encoder.fit_transform(X)
-pprint(encoder.get_feature_names_out())
+# pprint(encoder.get_feature_names_out())


We should remove it then.

glemaitre · 2024-05-28T13:15:46Z

skrub/_selectors/_base.py

@@ -85,6 +85,7 @@ def cols(*columns):
    >>> s.all() & ['kind', 'ID']
    (all() & cols('kind', 'ID'))

+    # noqa


What is the reason for noqa?

glemaitre · 2024-05-28T13:20:19Z

skrub/_datetime_encoder.py

+    Here we can see the input to ``transform`` has been converted back to the
+    timezone used during ``fit`` and that we get the same result for "hour".
+
+    # noqa


OK so this is to avoid the check on the docstring. I assume that we can clean it afterwords

glemaitre

So this is actually looking good.

jeromedockes · 2024-05-28T13:35:26Z

Oops so sorry @glemaitre I should have said so but I think @GaelVaroquaux was planning to review it as well ... @GaelVaroquaux , if you would like LMK if you want to review maybe the easiest way will be to revert the merge commit and open a new PR to un-revert

jeromedockes · 2024-05-28T13:35:41Z

Thanks a lot for the review @glemaitre !

…er (#902)" This reverts commit 5b30ddd.

GaelVaroquaux · 2024-05-28T13:39:11Z

No, no, it's good to have merged. I can give feedback via issues.

Hurray for merge. Thanks a lot to everyone involved!!

glemaitre · 2024-05-28T13:43:22Z

Thanks @GaelVaroquaux. We will address the subsequent issues. Let's roll ;)

jeromedockes · 2024-05-28T16:11:52Z

No, no, it's good to have merged. I can give feedback via issues.

ok, thanks. there will be a few follow-up PRs in any case, @TheooJ and I are going to open a couple of issues

jeromedockes added 5 commits May 2, 2024 16:57

add column-wise transforms & refactor TV

258647f

docstring validation

e06d996

add changelog

97d2371

do not force categories in ToCategorical

ebeb1ae

remove pandasconvertdtypes step

9f16bff

jeromedockes force-pushed the columnwise-transformations branch from f8636a4 to 9f16bff Compare May 3, 2024 12:09

jeromedockes added 5 commits May 3, 2024 14:29

null value & string detection for pandas object column

af0db0f

fix test

3c6c1e4

copy=False in make_dataframe_like

120ad28

copy=False in concat pandas

42dbe1a

no convert_dtypes()

cc37b63

TheooJ mentioned this pull request May 3, 2024

Dispatch correct backend in tests #903

Merged

jeromedockes changed the title ~~Add column-wise transforms & refactor TableVectorizer~~ [WIP] Add column-wise transforms & refactor TableVectorizer May 6, 2024

jeromedockes added 17 commits May 6, 2024 12:03

pandas numpy vs nullable dtypes in tests

8747231

update selectors tests

d0b995d

finish updating tests

3d4cd2b

faster is_string for pandas object columns

87e5a3c

Revert "faster is_string for pandas object columns"

4fdf48c

This reverts commit 87e5a3c.

Merge remote-tracking branch 'upstream/main' into columnwise-transfor…

93f7088

…mations

better handling of string & object columns in CleanNullStrings

63fd3a6

skip doctest when polars not installed

370a4c4

adjustments to to_datetime + add docstring

52e6673

udpate tests

f1600c8

iter

642b759

treat bool as numeric

8ce8add

use ToFloat instead of ToNumeric

cf8706d

remove unused function to_numeric

5db0c1d

convert pandas string columns to str (object)

73d71c7

small speedup string dtype checks

3fd749e

add ToCategorical docstring

792ba91

TheooJ approved these changes May 27, 2024

View reviewed changes

Apply suggestions from code review

4c7a761

Co-authored-by: Théo Jolivet <[email protected]>

jeromedockes commented May 27, 2024

View reviewed changes

skrub/tests/test_clean_null_strings.py Outdated Show resolved Hide resolved

skrub/tests/test_to_str.py Show resolved Hide resolved

jeromedockes added 5 commits May 27, 2024 17:12

address review comments

f6e3137

check_is_fitted

0341f94

.

b539ea7

adress review comment

bbc4c84

formatting

12df400

glemaitre self-requested a review May 28, 2024 12:30

glemaitre reviewed May 28, 2024

View reviewed changes

glemaitre approved these changes May 28, 2024

View reviewed changes

glemaitre merged commit 5b30ddd into skrub-data:main May 28, 2024
26 checks passed

glemaitre added a commit that referenced this pull request May 28, 2024

Revert "FEA/MAINT Add column-wise transforms & refactor TableVectoriz…

5303306

…er (#902)" This reverts commit 5b30ddd.

glemaitre mentioned this pull request May 28, 2024

Revert "Add column-wise transforms & refactor TableVectorizer" #913

Closed

jeromedockes mentioned this pull request May 28, 2024

TableVectoriser's "numerical_transformer" does not accept Pipelines #886

Closed

TheooJ mentioned this pull request May 28, 2024

Drop numpy array input support for TableVectorizer #830

Closed

This was referenced May 28, 2024

Polars deprecation inbound #894

Closed

[WIP] Column-wise transformations on part of a dataframe #877

Closed

[WIP] Add parsers #848

Closed

TheooJ mentioned this pull request May 28, 2024

Allow sorting transformers in TableVectorizer #626

Closed

jeromedockes mentioned this pull request May 28, 2024

TableVectorizer raises when a categorical column contains pd.NA #905

Closed

TheooJ mentioned this pull request May 28, 2024

Consider casting to float32 by default in TableVectorizer #870

Closed

jeromedockes mentioned this pull request May 28, 2024

Improve coverage of main features #371

Open

9 tasks

TheooJ mentioned this pull request Jun 11, 2024

[ENH] Drop numpy array input support #831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column-wise transforms & refactor TableVectorizer #902

Add column-wise transforms & refactor TableVectorizer #902

jeromedockes commented May 2, 2024 •

edited

Loading

TheooJ left a comment

jeromedockes left a comment

glemaitre commented May 28, 2024

glemaitre left a comment

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre May 28, 2024

glemaitre left a comment

jeromedockes commented May 28, 2024

jeromedockes commented May 28, 2024

GaelVaroquaux commented May 28, 2024

glemaitre commented May 28, 2024

jeromedockes commented May 28, 2024

	###############################################################################
	# %%

Add column-wise transforms & refactor TableVectorizer #902

Add column-wise transforms & refactor TableVectorizer #902

Conversation

jeromedockes commented May 2, 2024 • edited Loading

TheooJ left a comment

Choose a reason for hiding this comment

jeromedockes left a comment

Choose a reason for hiding this comment

glemaitre commented May 28, 2024

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

jeromedockes commented May 28, 2024

jeromedockes commented May 28, 2024

GaelVaroquaux commented May 28, 2024

glemaitre commented May 28, 2024

jeromedockes commented May 28, 2024

jeromedockes commented May 2, 2024 •

edited

Loading