Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Merged
merged 64 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
bee630f
Adding code for DropNull
rcap107 Oct 17, 2024
ccc9a02
Fixed line
rcap107 Oct 17, 2024
d3f9c90
renamed script
rcap107 Oct 17, 2024
9d42b95
Added new common functions for drop and is_all_null
rcap107 Oct 17, 2024
f249982
Fixed code
rcap107 Oct 17, 2024
a1caf39
Added test for dropcol
rcap107 Oct 17, 2024
b0e3235
Removing dev script
rcap107 Oct 17, 2024
90be825
Update skrub/tests/test_dropnulls.py
rcap107 Oct 21, 2024
55764a8
Renamed file
rcap107 Oct 21, 2024
c8fdaaa
Renamed file
rcap107 Oct 21, 2024
0cdc0bd
Formatting
rcap107 Oct 21, 2024
34c0095
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 21, 2024
430c8e3
Rename file
rcap107 Oct 21, 2024
80bd408
Added docstrings
rcap107 Oct 21, 2024
e2ca33f
Fixing imports and refactoring names
rcap107 Oct 21, 2024
4dbba09
Formatting
rcap107 Oct 21, 2024
7d6f8ce
Updated changelog.
rcap107 Oct 21, 2024
4771d18
Formatting
rcap107 Oct 21, 2024
f0b521a
Removing function because it was not needed
rcap107 Oct 21, 2024
ea9893b
Updated test
rcap107 Oct 21, 2024
c73db7e
Merge branch 'main' into drop_null_columns
rcap107 Oct 21, 2024
09cf9c7
Improving tests
rcap107 Oct 21, 2024
4e4f255
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 21, 2024
754e2ef
Updated test
rcap107 Oct 22, 2024
acafac6
Merge remote-tracking branch 'main_repo/main' into drop_null_columns
rcap107 Oct 22, 2024
4b0aa1c
Fixed is_all_null based on comments
rcap107 Oct 22, 2024
35f8909
Renaming files for consistency
rcap107 Oct 22, 2024
b4e419f
Removing init
rcap107 Oct 22, 2024
75f1110
Moving DropNullColumn after CleanNullStrings
rcap107 Oct 22, 2024
e499dc1
Moved check on drop from transform to fit_transform
rcap107 Oct 22, 2024
c296829
Fixed changelog
rcap107 Oct 22, 2024
ee6b7b5
Moved tests and improved coverage
rcap107 Oct 22, 2024
92210b7
Moved tv test to the proper file
rcap107 Oct 24, 2024
4cad44a
Updated test to make it make sense
rcap107 Oct 24, 2024
836a636
Improving comment
rcap107 Oct 24, 2024
4ec95d6
Improving comment
rcap107 Oct 24, 2024
3c25b84
Removed unneeded code
rcap107 Oct 24, 2024
8638516
Changed default value to True
rcap107 Oct 24, 2024
e70f513
Formatting
rcap107 Oct 24, 2024
6083567
Added back code that should have been there in the first place
rcap107 Oct 24, 2024
a543044
Changed the default parameter
rcap107 Oct 24, 2024
92f5430
Changed to use df interface
rcap107 Oct 24, 2024
24b18ba
Merge remote-tracking branch 'main_repo/main' into drop_null_columns
rcap107 Oct 24, 2024
62ef9d6
Fixed docstring.
rcap107 Oct 24, 2024
53cb8bd
Update skrub/_drop_null_column.py
jeromedockes Oct 24, 2024
7af96ca
Renaming transformer to DropColumnIfNull.
rcap107 Oct 25, 2024
11908b3
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 25, 2024
2499a37
Update skrub/_dataframe/tests/test_common.py
rcap107 Oct 29, 2024
548b792
Removed a coverage file
rcap107 Oct 29, 2024
58feaed
Fix formatting of docstring
rcap107 Oct 29, 2024
5a6539c
Formatting
rcap107 Oct 29, 2024
36c46d4
Whoops
rcap107 Oct 29, 2024
98b6c10
Altering the code to add different options and changing the default
rcap107 Nov 8, 2024
399954a
Improvements to formatting and docstring.
rcap107 Nov 8, 2024
32ca7a0
Adding error checking
rcap107 Nov 8, 2024
b311317
Updated documentation
rcap107 Nov 8, 2024
a04fb50
Fixed tests
rcap107 Nov 8, 2024
7b635ef
Changing exception
rcap107 Nov 8, 2024
43a61d4
Revert "Changing exception"
rcap107 Nov 18, 2024
c48a63d
Revert "Fixed tests"
rcap107 Nov 18, 2024
5704ebf
Revert "Updated documentation"
rcap107 Nov 18, 2024
ab5af46
Revert "Adding error checking"
rcap107 Nov 18, 2024
3f69bde
Revert "Improvements to formatting and docstring."
rcap107 Nov 18, 2024
801d745
Revert "Altering the code to add different options and changing the d…
rcap107 Nov 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ Minor changes
is now always visible when scrolling the table. :pr:`1102` by :user:`Jérôme
Dockès <jeromedockes>`.

* Added a `DropColumnIfNull` transformer that drops columns that contain only null
values. :pr:`1115` by :user: `Riccardo Cappuzzo <riccardocappuzzo>`

Bug fixes
---------

Expand Down
23 changes: 23 additions & 0 deletions skrub/_dataframe/_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
"to_datetime",
"is_categorical",
"to_categorical",
"is_all_null",
#
# Inspecting, selecting and modifying values
#
Expand Down Expand Up @@ -841,6 +842,28 @@
return _cast_polars(col, pl.Categorical())


@dispatch
def is_all_null(col):
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
raise NotImplementedError()


@is_all_null.specialize("pandas", argument_type="Column")
def _is_all_null_pandas(col):
return all(is_null(col))


@is_all_null.specialize("polars", argument_type="Column")
def _is_all_null_polars(col):
# Column type is Null
if col.dtype == pl.Null:
return True

Check warning on line 859 in skrub/_dataframe/_common.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_common.py#L859

Added line #L859 was not covered by tests
# Column type is not Null, but all values are nulls: more efficient
if col.null_count() == col.len():
return True

Check warning on line 862 in skrub/_dataframe/_common.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_common.py#L862

Added line #L862 was not covered by tests
# Column type is not Null, not all values are null (check if NaN etc.): slower
return all(is_null(col))

Check warning on line 864 in skrub/_dataframe/_common.py

View check run for this annotation

Codecov / codecov/patch

skrub/_dataframe/_common.py#L864

Added line #L864 was not covered by tests

rcap107 marked this conversation as resolved.
Show resolved Hide resolved

#
# Inspecting, selecting and modifying values
# ==========================================
Expand Down
27 changes: 27 additions & 0 deletions skrub/_dataframe/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,33 @@ def test_to_categorical(df_module):
assert list(s.cat.categories) == list("ab")


def test_is_all_null(df_module):
"""Check that is_all_null is evaluating null counts correctly."""

# Check that all null columns are marked as "all null"
assert ns.is_all_null(df_module.make_column("all_null", [None, None, None]))
assert ns.is_all_null(df_module.make_column("all_nan", [np.nan, np.nan, np.nan]))
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
assert ns.is_all_null(
df_module.make_column("all_nan_or_null", [np.nan, np.nan, None])
)

# Check that the other columns are *not* marked as "all null"
assert not ns.is_all_null(
df_module.make_column("almost_all_null", ["almost", None, None])
)
assert not ns.is_all_null(
df_module.make_column("almost_all_nan", [2.5, None, None])
)


def test_is_all_null_polars(pl_module):
"""Special case for polars: column is full of nulls, but doesn't have dtype Null"""
col = pl_module.make_column("col", [1, None, None])
col = col[1:]

assert ns.is_all_null(col)


# Inspecting, selecting and modifying values
# ==========================================
#
Expand Down
48 changes: 48 additions & 0 deletions skrub/_drop_column_if_null.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# drop columns that contain all null values
from sklearn.utils.validation import check_is_fitted

from . import _dataframe as sbd
from ._on_each_column import SingleColumnTransformer

__all__ = ["DropColumnIfNull"]


class DropColumnIfNull(SingleColumnTransformer):
"""Drop a single column if it contains only Null, NaN, or a mixture of null
values. If at least one non-null value is found, the column is kept."""

def fit_transform(self, column, y=None):
"""Fit the encoder and transform a column.

Parameters
----------
column : Pandas or Polars series. The input column to check.
y : None. Ignored.

Returns
-------
The input column, or an empty list if the column contains only null values.
"""
del y

self.drop_ = sbd.is_all_null(column)

return self.transform(column)

def transform(self, column):
"""Transform a column.

Parameters:
-----------
column : Pandas or Polars series. The input column to check.

Returns
-------
column
The input column, or an empty list if the column contains only null values.
"""
check_is_fitted(self)

if self.drop_:
return []
return column
29 changes: 21 additions & 8 deletions skrub/_table_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from ._clean_categories import CleanCategories
from ._clean_null_strings import CleanNullStrings
from ._datetime_encoder import DatetimeEncoder
from ._drop_column_if_null import DropColumnIfNull
from ._gap_encoder import GapEncoder
from ._on_each_column import SingleColumnTransformer
from ._select_cols import Drop
Expand Down Expand Up @@ -191,6 +192,9 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
similar functionality to what is offered by scikit-learn's
:class:`~sklearn.compose.ColumnTransformer`.

drop_null_columns : bool, default=True
If set to `True`, columns that contain only null values are dropped.

n_jobs : int, default=None
Number of jobs to run in parallel.
``None`` means 1 unless in a joblib ``parallel_backend`` context.
Expand Down Expand Up @@ -309,12 +313,13 @@ class TableVectorizer(TransformerMixin, BaseEstimator):

Before applying the main transformer, the ``TableVectorizer`` applies
several preprocessing steps, for example to detect numbers or dates that are
represented as strings. Moreover, a final post-processing step is applied to
all non-categorical columns in the encoder's output to cast them to float32.
represented as strings. By default, columns that contain only null values are
dropped. Moreover, a final post-processing step is applied to all
non-categorical columns in the encoder's output to cast them to float32.
We can inspect all the processing steps that were applied to a given column:

>>> vectorizer.all_processing_steps_['B']
[CleanNullStrings(), ToDatetime(), DatetimeEncoder(), {'B_day': ToFloat32(), 'B_month': ToFloat32(), ...}]
[CleanNullStrings(), DropColumnIfNull(), ToDatetime(), DatetimeEncoder(), {'B_day': ToFloat32(), 'B_month': ToFloat32(), ...}]

Note that as the encoder (``DatetimeEncoder()`` above) produces multiple
columns, the last processing step is not described by a single transformer
Expand All @@ -323,7 +328,7 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
``all_processing_steps_`` is useful to inspect the details of the
choices made by the ``TableVectorizer`` during preprocessing, for example:

>>> vectorizer.all_processing_steps_['B'][1]
>>> vectorizer.all_processing_steps_['B'][2]
ToDatetime()
>>> _.format_
'%d/%m/%Y'
Expand Down Expand Up @@ -389,7 +394,7 @@ class TableVectorizer(TransformerMixin, BaseEstimator):
``ToDatetime()``:

>>> vectorizer.all_processing_steps_
{'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), ToFloat32(), PassThrough(), {'C': ToFloat32()}]}
{'A': [Drop()], 'B': [OrdinalEncoder()], 'C': [CleanNullStrings(), DropColumnIfNull(), ToFloat32(), PassThrough(), {'C': ToFloat32()}]}

Specifying several ``specific_transformers`` for the same column is not allowed.

Expand All @@ -412,6 +417,7 @@ def __init__(
numeric=NUMERIC_TRANSFORMER,
datetime=DATETIME_TRANSFORMER,
specific_transformers=(),
drop_null_columns=True,
n_jobs=None,
):
self.cardinality_threshold = cardinality_threshold
Expand All @@ -425,6 +431,7 @@ def __init__(
self.datetime = _utils.clone_if_default(datetime, DATETIME_TRANSFORMER)
self.specific_transformers = specific_transformers
self.n_jobs = n_jobs
self.drop_null_columns = drop_null_columns

def fit(self, X, y=None):
"""Fit transformer.
Expand Down Expand Up @@ -536,13 +543,19 @@ def add_step(steps, transformer, cols, allow_reject=False):
cols = s.all() - self._specific_columns

self._preprocessors = [CheckInputDataFrame()]
for transformer in [
CleanNullStrings(),

transformer_list = [CleanNullStrings()]
if self.drop_null_columns:
rcap107 marked this conversation as resolved.
Show resolved Hide resolved
transformer_list.append(DropColumnIfNull())

transformer_list += [
ToDatetime(),
ToFloat32(),
CleanCategories(),
ToStr(),
]:
]

for transformer in transformer_list:
add_step(self._preprocessors, transformer, cols, allow_reject=True)

self._encoders = []
Expand Down
62 changes: 62 additions & 0 deletions skrub/tests/test_drop_column_if_null.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import numpy as np
import pytest

from skrub import _dataframe as sbd
from skrub._drop_column_if_null import DropColumnIfNull


@pytest.fixture
def drop_null_table(df_module):
return df_module.make_dataframe(
{
"idx": [
1,
2,
3,
],
"value_nan": [
np.nan,
np.nan,
np.nan,
],
"value_null": [
None,
None,
None,
],
"value_almost_nan": [
2.5,
np.nan,
np.nan,
],
"value_almost_null": [
"almost",
None,
None,
],
"mixed_null": [None, np.nan, None],
}
)


def test_single_column(drop_null_table, df_module):
"""Check that null columns are dropped and non-null columns are kept."""
dn = DropColumnIfNull()
assert dn.fit_transform(sbd.col(drop_null_table, "value_nan")) == []
assert dn.fit_transform(sbd.col(drop_null_table, "value_null")) == []
assert dn.fit_transform(sbd.col(drop_null_table, "mixed_null")) == []

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "idx")),
df_module.make_column("idx", [1, 2, 3]),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "value_almost_nan")),
df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "value_almost_null")),
df_module.make_column("value_almost_null", ["almost", None, None]),
)
42 changes: 41 additions & 1 deletion skrub/tests/test_table_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,28 @@ def _get_datetimes_dataframe():
)


def _get_missing_values_dataframe(categorical_dtype="object"):
"""
Creates a simple DataFrame with some columns that contain only missing values.
We'll use different types of missing values (np.nan, pd.NA, None)
to test how the vectorizer handles full null columns with mixed null values.
"""
return pd.DataFrame(
{
"int": pd.Series([15, 56, pd.NA, 12, 44], dtype="Int64"),
"all_null": pd.Series(
[None, None, None, None, None], dtype=categorical_dtype
),
"all_nan": pd.Series(
[np.nan, np.nan, np.nan, np.nan, np.nan], dtype="Float64"
),
"mixed_nulls": pd.Series(
[np.nan, None, pd.NA, "NULL", "NA"], dtype=categorical_dtype
),
}
)


def test_fit_default_transform():
X = _get_clean_dataframe()
vectorizer = TableVectorizer()
Expand Down Expand Up @@ -506,8 +528,11 @@ def test_changing_types(X_train, X_test, expected_X_out):
"""
table_vec = TableVectorizer(
# only extract the total seconds
datetime=DatetimeEncoder(resolution=None)
datetime=DatetimeEncoder(resolution=None),
# True by default
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set this to false to keep the original behavior with no DropNullColumns. Given that the default value is True, should I change the test so that the "default behavior" is what is tested here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok the way you did it

drop_null_columns=False,
)

table_vec.fit(X_train)
X_out = table_vec.transform(X_test)
assert (X_out.isna() == expected_X_out.isna()).all().all()
Expand Down Expand Up @@ -734,3 +759,18 @@ def test_supervised_encoder(df_module):
y = np.random.default_rng(0).normal(size=sbd.shape(X)[0])
tv = TableVectorizer(low_cardinality=TargetEncoder())
tv.fit_transform(X, y)


def test_drop_null_column():
"""Check that all null columns are dropped, and no more."""
# Don't drop null columns
X = _get_missing_values_dataframe()
tv = TableVectorizer(drop_null_columns=False)
transformed = tv.fit_transform(X)

assert sbd.shape(transformed) == sbd.shape(X)

# Drop null columns
tv = TableVectorizer(drop_null_columns=True)
transformed = tv.fit_transform(X)
assert sbd.shape(transformed) == (sbd.shape(X)[0], 1)