-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add column-wise transforms & refactor TableVectorizer #902
Merged
glemaitre
merged 116 commits into
skrub-data:main
from
jeromedockes:columnwise-transformations
May 28, 2024
Merged
Changes from all commits
Commits
Show all changes
116 commits
Select commit
Hold shift + click to select a range
258647f
add column-wise transforms & refactor TV
jeromedockes e06d996
docstring validation
jeromedockes 97d2371
add changelog
jeromedockes ebeb1ae
do not force categories in ToCategorical
jeromedockes 9f16bff
remove pandasconvertdtypes step
jeromedockes af0db0f
null value & string detection for pandas object column
jeromedockes 3c6c1e4
fix test
jeromedockes 120ad28
copy=False in make_dataframe_like
jeromedockes 42dbe1a
copy=False in concat pandas
jeromedockes cc37b63
no convert_dtypes()
jeromedockes 8747231
pandas numpy vs nullable dtypes in tests
jeromedockes d0b995d
update selectors tests
jeromedockes 3d4cd2b
finish updating tests
jeromedockes 87e5a3c
faster is_string for pandas object columns
jeromedockes 4fdf48c
Revert "faster is_string for pandas object columns"
jeromedockes 93f7088
Merge remote-tracking branch 'upstream/main' into columnwise-transfor…
jeromedockes 63fd3a6
better handling of string & object columns in CleanNullStrings
jeromedockes 370a4c4
skip doctest when polars not installed
jeromedockes 52e6673
adjustments to to_datetime + add docstring
jeromedockes f1600c8
udpate tests
jeromedockes 642b759
iter
jeromedockes 8ce8add
treat bool as numeric
jeromedockes cf8706d
use ToFloat instead of ToNumeric
jeromedockes 5db0c1d
remove unused function to_numeric
jeromedockes 73d71c7
convert pandas string columns to str (object)
jeromedockes 3fd749e
small speedup string dtype checks
jeromedockes 792ba91
add ToCategorical docstring
jeromedockes 3fed421
split cleancategories and tocategorical
jeromedockes a2050f0
iter string dtype t object
jeromedockes 9f1816e
fix rename categories for old pandas versions
jeromedockes bd90c72
convert all remaining columns to strings
jeromedockes 0384f5e
remove remainder transformer
jeromedockes d25a079
update tests
jeromedockes 27321e4
doctst output formatting
jeromedockes 66719a4
doctst output
jeromedockes acd103f
always cast to float32 in ToFloat
jeromedockes 1773e66
detail
jeromedockes d86394f
add column_kinds_
jeromedockes f3c2686
_
jeromedockes f611718
naming
jeromedockes f02af18
add to_float32 in input_to_processing_steps_
jeromedockes 7606602
iter docstring
jeromedockes d5b0aef
add SingleColumnTransformer base class & move RejectColumn definition
jeromedockes 92ee574
update doctests
jeromedockes 9653dbb
do not allow single column transformers to reject a column by default
jeromedockes 4015142
improve docstrings
jeromedockes 1278908
cleanup
jeromedockes a14e11c
cleanup
jeromedockes efa28e3
improve method order + set_output api compat
jeromedockes 97539ab
compatibility with set_output api
jeromedockes 47a0e64
docstrings + weekday starts at 1
jeromedockes 2abbdea
iter example 1
jeromedockes ba1348f
detail
jeromedockes 09069c9
better error message when a single-column transformer is applied to a…
jeromedockes 30d3927
make the DatetimeEncoder single-column & remove EncodeDatetime
jeromedockes bb4b636
update docstrings & joiner
jeromedockes d6b6403
only check single-column once
jeromedockes 15020b3
update example
jeromedockes bab7693
rename datetime_format -> format
jeromedockes 1fff5de
add some to_datetime tests
jeromedockes 6b3496a
tz tests
jeromedockes eecc1fe
fill_nulls for polars dataframes
jeromedockes 23fb464
Apply suggestions from code review
jeromedockes 3aa994e
typo
jeromedockes b5f20e0
adapt test for older polars
jeromedockes 9cba72a
old pandas datetime formats
jeromedockes c916605
Apply suggestions from code review
jeromedockes 8161b20
created_by private
jeromedockes d2a9083
add check_input tests
jeromedockes 2222c21
address some review comments
jeromedockes 778aff4
fix test
jeromedockes b6b85f0
Apply suggestions from code review
jeromedockes 52c77e3
address more review comments
jeromedockes 5228e03
more review comments
jeromedockes 42f17b8
fix docstring
jeromedockes 60a6f9a
simplify datetime format guessing
jeromedockes e5685f5
smaller + seeded datetime sample
jeromedockes afb1884
rename column_kinds_ and add column_to_kind_
jeromedockes bc23d0d
minor changes to tv docstring
jeromedockes d65ec3a
show tablevectorizer specialized for hgb
jeromedockes 5bedfdc
docstrings
jeromedockes a8ece56
doc
jeromedockes 6c5208e
note about single-column transformers in docstrings
jeromedockes 8efbd44
address review comment: unify to_float32 implementations
jeromedockes 25bcbb3
fix to_float32
jeromedockes 7b81a5d
Apply suggestions from code review
jeromedockes 1f1b93e
code review comments
jeromedockes 3f1582b
detail
jeromedockes dc22de2
singlecolumntransformer tests
jeromedockes 71aee50
oneachcolumn tests
jeromedockes 33c00d8
more tests
jeromedockes 53d8e90
add check_is_fitted checks
jeromedockes 8472061
on_subframe tests
jeromedockes 3c84103
more to_datetime tests
jeromedockes a50fa00
remove unused function
jeromedockes 6e356b2
clean_categories tests
jeromedockes 7e291f4
add test_clean_null_strings
jeromedockes dca3d65
add test_to_categorical
jeromedockes c9b0fb6
add test_to_float32
jeromedockes 7e654dd
more tests
jeromedockes 1586558
misc testing
jeromedockes 7763031
fix test
jeromedockes 935e69d
add notes and checks that transformer parameters output dataframes
jeromedockes 88a2ea0
details
jeromedockes 0c90757
_
jeromedockes 2afaa4d
add test with supervised encoder
jeromedockes 111a733
line lengths in docstrings
jeromedockes 18aeec5
rename day_of_the_week → weekday
jeromedockes 008b5e2
iter example
jeromedockes d841526
fix test
jeromedockes 4c7a761
Apply suggestions from code review
jeromedockes f6e3137
address review comments
jeromedockes 0341f94
check_is_fitted
jeromedockes b539ea7
.
jeromedockes bbc4c84
adress review comment
jeromedockes 12df400
formatting
jeromedockes File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
import warnings | ||
|
||
import numpy as np | ||
from sklearn.base import BaseEstimator, TransformerMixin | ||
from sklearn.utils.validation import check_is_fitted | ||
|
||
from . import _dataframe as sbd | ||
from . import _join_utils, _utils | ||
from ._dispatch import dispatch | ||
|
||
__all__ = ["CheckInputDataFrame"] | ||
|
||
|
||
def _column_names_to_strings(column_names): | ||
non_string = [c for c in column_names if not isinstance(c, str)] | ||
if not non_string: | ||
return column_names | ||
warnings.warn( | ||
f"Some column names are not strings: {non_string}. All column names" | ||
" must be strings; converting to strings." | ||
) | ||
return list(map(str, column_names)) | ||
|
||
|
||
def _deduplicated_column_names(column_names): | ||
duplicates = _utils.get_duplicates(column_names) | ||
if not duplicates: | ||
return column_names | ||
warnings.warn( | ||
f"Found duplicated column names: {duplicates}. Please make sure column names" | ||
" are unique. Renaming columns that have duplicated names." | ||
) | ||
return _join_utils.pick_column_names(column_names) | ||
|
||
|
||
def _cleaned_column_names(column_names): | ||
return _deduplicated_column_names(_column_names_to_strings(column_names)) | ||
|
||
|
||
@dispatch | ||
def _check_not_pandas_sparse(df): | ||
pass | ||
|
||
|
||
@_check_not_pandas_sparse.specialize("pandas") | ||
def _check_not_pandas_sparse_pandas(df): | ||
import pandas as pd | ||
|
||
sparse_cols = [ | ||
col for col in df.columns if isinstance(df[col].dtype, pd.SparseDtype) | ||
] | ||
if sparse_cols: | ||
raise TypeError( | ||
f"Columns {sparse_cols} are sparse Pandas series, but dense " | ||
"data is required. Use ``df[col].sparse.to_dense()`` to convert " | ||
"a series from sparse to dense." | ||
) | ||
|
||
|
||
def _check_is_dataframe(df): | ||
if not sbd.is_dataframe(df): | ||
raise TypeError( | ||
"Only pandas and polars DataFrames are supported. Cannot handle X of" | ||
f" type: {type(df)}." | ||
) | ||
|
||
|
||
def _collect_lazyframe(df): | ||
if not sbd.is_lazyframe(df): | ||
return df | ||
warnings.warn( | ||
"At the moment, skrub only works on eager DataFrames, calling collect()." | ||
) | ||
return sbd.collect(df) | ||
|
||
|
||
class CheckInputDataFrame(TransformerMixin, BaseEstimator): | ||
"""Check the dataframe entering a skrub pipeline. | ||
|
||
This transformer ensures that: | ||
|
||
- The input is a dataframe. | ||
- Numpy arrays are converted to pandas dataframes with a warning. | ||
- The dataframe library is the same during ``fit`` and ``transform``, e.g. | ||
fitting on a polars dataframe and then transforming a pandas dataframe is | ||
not allowed. | ||
- A TypeError is raised otherwise. | ||
- Column names are unique strings. | ||
- Non-strings are cast to strings. | ||
- A random suffix is added to duplicated names. | ||
- If either of these operations is needed, a warning is emitted. | ||
- Only applies to pandas; polars column names are always unique strings. | ||
- The input is not sparse. | ||
- A TypeError is raised otherwise. | ||
- The input is not a ``LazyFrame``. | ||
- A ``LazyFrame`` is ``collect``ed with a warning. | ||
- The column names are the same during ``fit`` and ``transform``. | ||
- A ValueError is raised otherwise. | ||
|
||
Attributes | ||
---------- | ||
module_name_ : str | ||
The name of the dataframe module, 'polars' or 'pandas'. | ||
feature_names_in_ : list | ||
The column names of the input (before cleaning). | ||
n_features_in_ : int | ||
The number of input columns. | ||
feature_names_out_ : list of str | ||
The column names after converting to string and deduplication. | ||
""" | ||
|
||
def fit(self, X, y=None): | ||
self.fit_transform(X, y) | ||
return self | ||
|
||
def fit_transform(self, X, y=None): | ||
del y | ||
X = self._handle_array(X) | ||
_check_is_dataframe(X) | ||
self.module_name_ = sbd.dataframe_module_name(X) | ||
# TODO check schema (including dtypes) not just names. | ||
# Need to decide how strict we should be about types | ||
column_names = sbd.column_names(X) | ||
self.feature_names_in_ = column_names | ||
self.n_features_in_ = len(column_names) | ||
self.feature_names_out_ = _cleaned_column_names(column_names) | ||
if sbd.column_names(X) != self.feature_names_out_: | ||
X = sbd.set_column_names(X, self.feature_names_out_) | ||
_check_not_pandas_sparse(X) | ||
X = _collect_lazyframe(X) | ||
return X | ||
|
||
def transform(self, X): | ||
check_is_fitted(self, "module_name_") | ||
X = self._handle_array(X) | ||
_check_is_dataframe(X) | ||
module_name = sbd.dataframe_module_name(X) | ||
if module_name != self.module_name_: | ||
raise TypeError( | ||
f"Pipeline was fitted to a {self.module_name_} dataframe " | ||
f"but is being applied to a {module_name} dataframe. " | ||
"This is likely to produce errors and is not supported." | ||
) | ||
column_names = sbd.column_names(X) | ||
if column_names != self.feature_names_in_: | ||
import difflib | ||
|
||
diff = "\n".join( | ||
difflib.Differ().compare(self.feature_names_in_, column_names) | ||
) | ||
message = ( | ||
f"Columns of dataframes passed to fit() and transform() differ:\n{diff}" | ||
jeromedockes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
raise ValueError(message) | ||
if sbd.column_names(X) != self.feature_names_out_: | ||
X = sbd.set_column_names(X, self.feature_names_out_) | ||
_check_not_pandas_sparse(X) | ||
X = _collect_lazyframe(X) | ||
return X | ||
|
||
def _handle_array(self, X): | ||
if not isinstance(X, np.ndarray): | ||
return X | ||
if X.ndim != 2: | ||
raise ValueError( | ||
"Input should be a DataFrame. Found an array with incompatible shape:" | ||
f" {X.shape}." | ||
) | ||
warnings.warn( | ||
"Only pandas and polars DataFrames are supported, but input is a Numpy" | ||
" array. Please convert Numpy arrays to DataFrames before passing them to" | ||
" skrub transformers. Converting to pandas DataFrame with columns" | ||
" ['0', '1', …]." | ||
) | ||
import pandas as pd | ||
|
||
columns = list(map(str, range(X.shape[1]))) | ||
X = pd.DataFrame(X, columns=columns) | ||
return X | ||
|
||
# set_output api compatibility | ||
|
||
def get_feature_names_out(self): | ||
return self.feature_names_out_ |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove it then.