Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column-wise transforms & refactor TableVectorizer #902

Merged
merged 116 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from 111 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
258647f
add column-wise transforms & refactor TV
jeromedockes May 2, 2024
e06d996
docstring validation
jeromedockes May 2, 2024
97d2371
add changelog
jeromedockes May 2, 2024
ebeb1ae
do not force categories in ToCategorical
jeromedockes May 3, 2024
9f16bff
remove pandasconvertdtypes step
jeromedockes May 3, 2024
af0db0f
null value & string detection for pandas object column
jeromedockes May 3, 2024
3c6c1e4
fix test
jeromedockes May 3, 2024
120ad28
copy=False in make_dataframe_like
jeromedockes May 3, 2024
42dbe1a
copy=False in concat pandas
jeromedockes May 3, 2024
cc37b63
no convert_dtypes()
jeromedockes May 3, 2024
8747231
pandas numpy vs nullable dtypes in tests
jeromedockes May 6, 2024
d0b995d
update selectors tests
jeromedockes May 6, 2024
3d4cd2b
finish updating tests
jeromedockes May 6, 2024
87e5a3c
faster is_string for pandas object columns
jeromedockes May 6, 2024
4fdf48c
Revert "faster is_string for pandas object columns"
jeromedockes May 6, 2024
93f7088
Merge remote-tracking branch 'upstream/main' into columnwise-transfor…
jeromedockes May 7, 2024
63fd3a6
better handling of string & object columns in CleanNullStrings
jeromedockes May 7, 2024
370a4c4
skip doctest when polars not installed
jeromedockes May 7, 2024
52e6673
adjustments to to_datetime + add docstring
jeromedockes May 7, 2024
f1600c8
udpate tests
jeromedockes May 7, 2024
642b759
iter
jeromedockes May 7, 2024
8ce8add
treat bool as numeric
jeromedockes May 8, 2024
cf8706d
use ToFloat instead of ToNumeric
jeromedockes May 8, 2024
5db0c1d
remove unused function to_numeric
jeromedockes May 8, 2024
73d71c7
convert pandas string columns to str (object)
jeromedockes May 8, 2024
3fd749e
small speedup string dtype checks
jeromedockes May 8, 2024
792ba91
add ToCategorical docstring
jeromedockes May 8, 2024
3fed421
split cleancategories and tocategorical
jeromedockes May 11, 2024
a2050f0
iter string dtype t object
jeromedockes May 13, 2024
9f1816e
fix rename categories for old pandas versions
jeromedockes May 13, 2024
bd90c72
convert all remaining columns to strings
jeromedockes May 13, 2024
0384f5e
remove remainder transformer
jeromedockes May 13, 2024
d25a079
update tests
jeromedockes May 13, 2024
27321e4
doctst output formatting
jeromedockes May 13, 2024
66719a4
doctst output
jeromedockes May 13, 2024
acd103f
always cast to float32 in ToFloat
jeromedockes May 13, 2024
1773e66
detail
jeromedockes May 13, 2024
d86394f
add column_kinds_
jeromedockes May 13, 2024
f3c2686
_
jeromedockes May 13, 2024
f611718
naming
jeromedockes May 13, 2024
f02af18
add to_float32 in input_to_processing_steps_
jeromedockes May 14, 2024
7606602
iter docstring
jeromedockes May 14, 2024
d5b0aef
add SingleColumnTransformer base class & move RejectColumn definition
jeromedockes May 14, 2024
92ee574
update doctests
jeromedockes May 14, 2024
9653dbb
do not allow single column transformers to reject a column by default
jeromedockes May 14, 2024
4015142
improve docstrings
jeromedockes May 14, 2024
1278908
cleanup
jeromedockes May 14, 2024
a14e11c
cleanup
jeromedockes May 14, 2024
efa28e3
improve method order + set_output api compat
jeromedockes May 14, 2024
97539ab
compatibility with set_output api
jeromedockes May 14, 2024
47a0e64
docstrings + weekday starts at 1
jeromedockes May 15, 2024
2abbdea
iter example 1
jeromedockes May 15, 2024
ba1348f
detail
jeromedockes May 15, 2024
09069c9
better error message when a single-column transformer is applied to a…
jeromedockes May 16, 2024
30d3927
make the DatetimeEncoder single-column & remove EncodeDatetime
jeromedockes May 16, 2024
bb4b636
update docstrings & joiner
jeromedockes May 16, 2024
d6b6403
only check single-column once
jeromedockes May 16, 2024
15020b3
update example
jeromedockes May 16, 2024
bab7693
rename datetime_format -> format
jeromedockes May 16, 2024
1fff5de
add some to_datetime tests
jeromedockes May 16, 2024
6b3496a
tz tests
jeromedockes May 16, 2024
eecc1fe
fill_nulls for polars dataframes
jeromedockes May 17, 2024
23fb464
Apply suggestions from code review
jeromedockes May 17, 2024
3aa994e
typo
jeromedockes May 17, 2024
b5f20e0
adapt test for older polars
jeromedockes May 17, 2024
9cba72a
old pandas datetime formats
jeromedockes May 22, 2024
c916605
Apply suggestions from code review
jeromedockes May 22, 2024
8161b20
created_by private
jeromedockes May 22, 2024
d2a9083
add check_input tests
jeromedockes May 22, 2024
2222c21
address some review comments
jeromedockes May 22, 2024
778aff4
fix test
jeromedockes May 22, 2024
b6b85f0
Apply suggestions from code review
jeromedockes May 22, 2024
52c77e3
address more review comments
jeromedockes May 22, 2024
5228e03
more review comments
jeromedockes May 22, 2024
42f17b8
fix docstring
jeromedockes May 22, 2024
60a6f9a
simplify datetime format guessing
jeromedockes May 22, 2024
e5685f5
smaller + seeded datetime sample
jeromedockes May 22, 2024
afb1884
rename column_kinds_ and add column_to_kind_
jeromedockes May 22, 2024
bc23d0d
minor changes to tv docstring
jeromedockes May 22, 2024
d65ec3a
show tablevectorizer specialized for hgb
jeromedockes May 22, 2024
5bedfdc
docstrings
jeromedockes May 22, 2024
a8ece56
doc
jeromedockes May 22, 2024
6c5208e
note about single-column transformers in docstrings
jeromedockes May 22, 2024
8efbd44
address review comment: unify to_float32 implementations
jeromedockes May 22, 2024
25bcbb3
fix to_float32
jeromedockes May 22, 2024
7b81a5d
Apply suggestions from code review
jeromedockes May 23, 2024
1f1b93e
code review comments
jeromedockes May 23, 2024
3f1582b
detail
jeromedockes May 23, 2024
dc22de2
singlecolumntransformer tests
jeromedockes May 23, 2024
71aee50
oneachcolumn tests
jeromedockes May 23, 2024
33c00d8
more tests
jeromedockes May 23, 2024
53d8e90
add check_is_fitted checks
jeromedockes May 23, 2024
8472061
on_subframe tests
jeromedockes May 24, 2024
3c84103
more to_datetime tests
jeromedockes May 24, 2024
a50fa00
remove unused function
jeromedockes May 24, 2024
6e356b2
clean_categories tests
jeromedockes May 24, 2024
7e291f4
add test_clean_null_strings
jeromedockes May 24, 2024
dca3d65
add test_to_categorical
jeromedockes May 24, 2024
c9b0fb6
add test_to_float32
jeromedockes May 24, 2024
7e654dd
more tests
jeromedockes May 24, 2024
1586558
misc testing
jeromedockes May 24, 2024
7763031
fix test
jeromedockes May 24, 2024
935e69d
add notes and checks that transformer parameters output dataframes
jeromedockes May 24, 2024
88a2ea0
details
jeromedockes May 27, 2024
0c90757
_
jeromedockes May 27, 2024
2afaa4d
add test with supervised encoder
jeromedockes May 27, 2024
111a733
line lengths in docstrings
jeromedockes May 27, 2024
18aeec5
rename day_of_the_week → weekday
jeromedockes May 27, 2024
008b5e2
iter example
jeromedockes May 27, 2024
d841526
fix test
jeromedockes May 27, 2024
4c7a761
Apply suggestions from code review
jeromedockes May 27, 2024
f6e3137
address review comments
jeromedockes May 27, 2024
0341f94
check_is_fitted
jeromedockes May 27, 2024
b539ea7
.
jeromedockes May 27, 2024
bbc4c84
adress review comment
jeromedockes May 28, 2024
12df400
formatting
jeromedockes May 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,14 @@ It is currently undergoing fast development and backward compatibility is not en

Major changes
-------------
* The :class:`TableVectorizer` now consistently applies the same transformation
across different calls to `transform`. There also have been some breaking
changes to its functionality: (i) all transformations are now applied
independently to each column, i.e. it does not perform multivariate
transformations (ii) in ``specific_transformers`` the same column may not be
used twice (go through 2 different transformers).
:pr:`902` by :user:`Jérôme Dockès <jeromedockes>`.

* Added the :class:`MultiAggJoiner` that allows to augment a main table with
multiple auxiliary tables. :pr:`876` by :user:`Théo Jolivet <TheooJ>`.

Expand Down
11 changes: 10 additions & 1 deletion doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ This page lists all available functions and classes of `skrub`.
GapEncoder
MinHashEncoder
SimilarityEncoder
ToCategorical

.. raw:: html

Expand All @@ -98,10 +99,18 @@ This page lists all available functions and classes of `skrub`.

.. autosummary::
:toctree: generated/
:template: function.rst
:template: class.rst
:nosignatures:
:caption: Converting datetime columns in a table

ToDatetime


.. autosummary::
:toctree: generated/
:template: function.rst
:nosignatures:

to_datetime

.. raw:: html
Expand Down
226 changes: 168 additions & 58 deletions examples/01_encodings.py

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions examples/03_datetime_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,12 @@

encoder = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), ["city"]),
(DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
(DatetimeEncoder(add_weekday=True, resolution="minute"), "date.utc"),
remainder="drop",
)

X_enc = encoder.fit_transform(X)
pprint(encoder.get_feature_names_out())
# pprint(encoder.get_feature_names_out())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove it then.


###############################################################################
# We see that the encoder is working as expected: the ``"date.utc"`` column has
Expand All @@ -119,7 +119,7 @@
# Here, for example, we want it to extract the day of the week.

table_vec = TableVectorizer(
datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
datetime_transformer=DatetimeEncoder(add_weekday=True),
).fit(X)
pprint(table_vec.get_feature_names_out())

Expand Down Expand Up @@ -257,7 +257,7 @@
from sklearn.inspection import permutation_importance

table_vec = TableVectorizer(
datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
datetime_transformer=DatetimeEncoder(add_weekday=True),
)

# In this case, we don't use a pipeline, because we want to compute the
Expand All @@ -280,8 +280,8 @@
y="importances", x="feature_names", title="Feature Importances", figsize=(12, 9)
)
plt.tight_layout()
plt.show()

###############################################################################
# We can see that the total seconds since Epoch and the hour of the day
# are the most important feature, which seems reasonable.
#
Expand Down
22 changes: 11 additions & 11 deletions examples/08_join_aggregation.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,8 @@


table_vectorizer = TableVectorizer(
datetime_transformer=DatetimeEncoder(add_day_of_the_week=True)
datetime_transformer=DatetimeEncoder(add_weekday=True)
)
table_vectorizer.set_output(transform="pandas")
X_date_encoded = table_vectorizer.fit_transform(X)
X_date_encoded.head()

Expand All @@ -103,19 +102,19 @@


def make_barplot(x, y, title):
fig, ax = plt.subplots(layout="constrained")
norm = plt.Normalize(y.min(), y.max())
cmap = plt.get_cmap("magma")

sns.barplot(x=x, y=y, palette=cmap(norm(y)))
plt.title(title)
plt.xticks(rotation=30)
plt.ylabel(None)
plt.tight_layout()
sns.barplot(x=x, y=y, palette=cmap(norm(y)), ax=ax)
ax.set_title(title)
ax.set_xticks(ax.get_xticks(), labels=ax.get_xticklabels(), rotation=30)
ax.set_ylabel(None)


# O is Monday, 6 is Sunday

daily_volume = X_date_encoded["timestamp_day_of_week"].value_counts().sort_index()
daily_volume = X_date_encoded["timestamp_weekday"].value_counts().sort_index()

make_barplot(
x=daily_volume.index,
Expand Down Expand Up @@ -287,9 +286,10 @@ def baseline_r2(X, y, train_idx, test_idx):

# we only keep the 5 out of 10 last results
# because the initial size of the train set is rather small
sns.boxplot(results.tail(5), palette="magma")
plt.ylabel("R2 score")
plt.title("Hyper parameters grid-search results")
fig, ax = plt.subplots(layout="constrained")
sns.boxplot(results.tail(5), palette="magma", ax=ax)
ax.set_ylabel("R2 score")
ax.set_title("Hyper parameters grid-search results")
plt.tight_layout()

###############################################################################
Expand Down
6 changes: 5 additions & 1 deletion skrub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from ._agg_joiner import AggJoiner, AggTarget
from ._check_dependencies import check_dependencies
from ._datetime_encoder import DatetimeEncoder, to_datetime
from ._datetime_encoder import DatetimeEncoder
from ._deduplicate import compute_ngram_distance, deduplicate
from ._fuzzy_join import fuzzy_join
from ._gap_encoder import GapEncoder
Expand All @@ -16,6 +16,8 @@
from ._select_cols import DropCols, SelectCols
from ._similarity_encoder import SimilarityEncoder
from ._table_vectorizer import TableVectorizer
from ._to_categorical import ToCategorical
from ._to_datetime import ToDatetime, to_datetime

check_dependencies()

Expand All @@ -25,6 +27,7 @@

__all__ = [
"DatetimeEncoder",
"ToDatetime",
"Joiner",
"fuzzy_join",
"GapEncoder",
Expand All @@ -34,6 +37,7 @@
"TableVectorizer",
"deduplicate",
"compute_ngram_distance",
"ToCategorical",
"to_datetime",
"AggJoiner",
"MultiAggJoiner",
Expand Down
182 changes: 182 additions & 0 deletions skrub/_check_input.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
import warnings

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

from . import _dataframe as sbd
from . import _join_utils, _utils
from ._dispatch import dispatch

__all__ = ["CheckInputDataFrame"]


def _column_names_to_strings(column_names):
non_string = [c for c in column_names if not isinstance(c, str)]
if not non_string:
return column_names
warnings.warn(
f"Some column names are not strings: {non_string}. All column names"
" must be strings; converting to strings."
)
return list(map(str, column_names))


def _deduplicated_column_names(column_names):
duplicates = _utils.get_duplicates(column_names)
if not duplicates:
return column_names
warnings.warn(
f"Found duplicated column names: {duplicates}. Please make sure column names"
" are unique. Renaming columns that have duplicated names."
)
return _join_utils.pick_column_names(column_names)


def _cleaned_column_names(column_names):
return _deduplicated_column_names(_column_names_to_strings(column_names))


@dispatch
def _check_not_pandas_sparse(df):
pass


@_check_not_pandas_sparse.specialize("pandas")
def _check_not_pandas_sparse_pandas(df):
import pandas as pd

sparse_cols = [
col for col in df.columns if isinstance(df[col].dtype, pd.SparseDtype)
]
if sparse_cols:
raise TypeError(
f"Columns {sparse_cols} are sparse Pandas series, but dense "
"data is required. Use ``df[col].sparse.to_dense()`` to convert "
"a series from sparse to dense."
)


def _check_is_dataframe(df):
if not sbd.is_dataframe(df):
raise TypeError(
"Only pandas and polars DataFrames are supported. Cannot handle X of"
f" type: {type(df)}."
)


def _collect_lazyframe(df):
if not sbd.is_lazyframe(df):
return df
warnings.warn(
"At the moment, skrub only works on eager DataFrames, calling collect()."
)
return sbd.collect(df)


class CheckInputDataFrame(TransformerMixin, BaseEstimator):
"""Check the dataframe entering a skrub pipeline.

This transformer ensures that:

- The input is a dataframe.
- Numpy arrays are converted to pandas dataframes with a warning.
- The dataframe library is the same during ``fit`` and ``transform``, e.g.
fitting on a polars dataframe and then transforming a pandas dataframe is
not allowed.
- A TypeError is raised otherwise.
- Column names are unique strings.
- Non-strings are cast to strings.
- A random suffix is added to duplicated names.
- If either of these operations is needed, a warning is emitted.
- Only applies to pandas; polars column names are always unique strings.
- The input is not sparse.
- A TypeError is raised otherwise.
- The input is not a ``LazyFrame``.
- A ``LazyFrame`` is ``collect``ed with a warning.
- The column names are the same during ``fit`` and ``transform``.
- A ValueError is raised otherwise.

Attributes
----------
module_name_ : str
The name of the dataframe module, 'polars' or 'pandas'.
feature_names_in_ : list
The column names of the input (before cleaning).
n_features_in_ : int
The number of input columns.
feature_names_out_ : list of str
The column names after converting to string and deduplication.
"""

def fit(self, X, y=None):
self.fit_transform(X, y)
return self

def fit_transform(self, X, y=None):
del y
X = self._handle_array(X)
_check_is_dataframe(X)
self.module_name_ = sbd.dataframe_module_name(X)
# TODO check schema (including dtypes) not just names.
# Need to decide how strict we should be about types
column_names = sbd.column_names(X)
self.feature_names_in_ = column_names
self.n_features_in_ = len(column_names)
self.feature_names_out_ = _cleaned_column_names(column_names)
if sbd.column_names(X) != self.feature_names_out_:
X = sbd.set_column_names(X, self.feature_names_out_)
_check_not_pandas_sparse(X)
X = _collect_lazyframe(X)
return X

def transform(self, X):
X = self._handle_array(X)
_check_is_dataframe(X)
module_name = sbd.dataframe_module_name(X)
if module_name != self.module_name_:
raise TypeError(
f"Pipeline was fitted to a {self.module_name_} dataframe "
f"but is being applied to a {module_name} dataframe. "
"This is likely to produce errors and is not supported."
)
column_names = sbd.column_names(X)
if column_names != self.feature_names_in_:
import difflib

diff = "\n".join(
difflib.Differ().compare(self.feature_names_in_, column_names)
)
message = (
f"Columns of dataframes passed to fit() and transform() differ:\n{diff}"
jeromedockes marked this conversation as resolved.
Show resolved Hide resolved
)
raise ValueError(message)
if sbd.column_names(X) != self.feature_names_out_:
X = sbd.set_column_names(X, self.feature_names_out_)
_check_not_pandas_sparse(X)
X = _collect_lazyframe(X)
return X

def _handle_array(self, X):
if not isinstance(X, np.ndarray):
return X
if X.ndim != 2:
raise ValueError(
"Input should be a DataFrame. Found an array with incompatible shape:"
f" {X.shape}."
)
warnings.warn(
"Only pandas and polars DataFrames are supported, but input is a Numpy"
" array. Please convert Numpy arrays to DataFrames before passing them to"
" skrub transformers. Converting to pandas DataFrame with columns"
" ['0', '1', …]."
)
import pandas as pd

columns = list(map(str, range(X.shape[1])))
X = pd.DataFrame(X, columns=columns)
return X

# set_output api compatibility

def get_feature_names_out(self):
return self.feature_names_out_
Loading
Loading