Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added categorical encoding function #127

Closed

Conversation

Nandha951
Copy link

Description

This pull request introduces a new categorical encoding feature to the project. The function supports One-Hot Encoding and Ordinal Encoding for categorical variables, allowing users to efficiently transform categorical data into numerical formats. This addition is designed to enhance data preprocessing capabilities within the framework.

The changes include:

  1. A new Python file categorical_encoding.py under the src/koheesio/ directory.
  2. Implementation of a flexible categorical_encoding function.
  3. Addition of a EncodingConfig class for user-configurable options (e.g., encoding type).
  4. Comprehensive unit tests under tests/test_categorical_encoding.py.

Related Issue

This pull request addresses the need for robust categorical data transformation functionality as identified in the enhancement discussions. (Link to any related issue or discussion if applicable, or mention "N/A" if there isn't one.)


Motivation and Context

This change is required to handle categorical data during data preprocessing for machine learning or analytics workflows. The new feature provides:

  • A One-Hot Encoding option for creating binary columns for each category while avoiding the dummy variable trap.
  • An Ordinal Encoding option to assign numerical values to categories, useful for models that can process ordinal relationships.

This functionality improves the versatility and usability of the Koheesio framework in real-world scenarios.


How Has This Been Tested?

The implementation was tested using unit tests created in tests/test_categorical_encoding.py. The tests include:

  1. One-Hot Encoding:
    • Validated the creation of binary columns for multiple categorical variables.
    • Verified the exclusion of original categorical columns.
  2. Ordinal Encoding:
    • Ensured that the correct numerical mappings were assigned to categories.
    • Checked for consistency when applied to multiple variables.

All tests were executed in the local environment using Python's unittest framework, and they passed successfully without affecting other parts of the codebase.


Screenshots (if appropriate):

N/A


Types of Changes

  • New feature (non-breaking change which adds functionality)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@Nandha951 Nandha951 requested a review from a team as a code owner November 25, 2024 21:32
Copy link
Member

@dannymeijer dannymeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you for your contribution, it's very much appreciated.

Please see the comments I gave above.

A few things:

  • Your code should be adjusted to use Koheesio Step class, preferably a pandas adjusted version of the Transformation class we have in the spark module
  • Your code should be moved to an appropriate module
  • The extra dependency you're introducing is not part of the pyproject atm
  • Also, I would like you to add extra documentation (module docstring) to explain your usecase; add some examples as well once you've adjusted your code

I would love to discuss with you the intend of what you are trying to achieve. Feel free to reach out in a DM/email - my contact information is in my profile (LinkedIn, email).

src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved
src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved
src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved
src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved
@dannymeijer
Copy link
Member

Please also see: #129

Copy link
Member

@dannymeijer dannymeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to look really good! :)
I left some detailed comments on what I'd like to see change

Comment on lines +44 to +45
"pandas>=1.5.0",
"scikit-learn>=1.2.0"
Copy link
Member

@dannymeijer dannymeijer Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to make these top level dependencies + we already have pandas as an extra dependency.
Let's make an extra called "ml" and put scikit-learn in there. That way you can install the extra dependency as koheesio[pandas,ml]

import pandas as pd
from pydantic import BaseModel

class PandasCategoricalEncoding(BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this class a PandasStep : from koheesio.pandas import PandasStep

"""

columns: List[str]
encoding_type: str = "one-hot"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the type to Literal["one-hot", "ordinal"], that way you don't need the extra check you put in the __init__ method

drop_first: bool = True
ordinal_mapping: Dict[str, Dict] = None

def __init__(self, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes obsolete if you make the type a Literal as stated above

Comment on lines +27 to +30
columns: List[str]
encoding_type: str = "one-hot"
drop_first: bool = True
ordinal_mapping: Dict[str, Dict] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make all these Fields, example:

from koheesio.models import Field
...
class PandasCategoricalEncoding(PandasStep):

    columns: List[str] = Field(..., description="...")
    encoding_type: Literal["one-hot", "ordinal"] = Field(default="one-hot", description="...")
    ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and of course add appropriate description to each)

Comment on lines +37 to +57
def execute(self, data: pd.DataFrame) -> pd.DataFrame:
"""
Executes the categorical encoding transformation on the provided dataset.

Parameters
----------
data : pd.DataFrame
The input dataset to encode.

Returns
-------
pd.DataFrame
The dataset with the specified categorical columns encoded.
"""
if self.encoding_type == 'one-hot':
data = pd.get_dummies(data, columns=self.columns, drop_first=self.drop_first)
elif self.encoding_type == 'ordinal':
for column in self.columns:
if column in data.columns and self.ordinal_mapping and column in self.ordinal_mapping:
data[column] = data[column].map(self.ordinal_mapping[column]).fillna(-1).astype(int)
return data
Copy link
Member

@dannymeijer dannymeijer Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few things about execute (for when you change to Step:

  • execute takes no arguments, instead add the DataFrame as one of the input Fields; add something like this df: Optional[pd.DataFrame] = Field(default=None, description="...") - I will explain why you want this as Optional in a bit
  • execute is expected to deal with input (your Fields) and generate Output
  • this Output does not need to be returned explicitly, the Step parent-class takes care of this.
  • instead, add a .transform method that can take a DataFrame as input

This means you can change your code like this:

  1. add an Output class
  2. add a .transform method
  3. update your execute method accordingly

Should look something like this (of course add docstrings and things like that):

class PandasCategoricalEncoding(PandasStep):
    ...
    class Output(PandasStep.Output):
        df: pd.DataFrame = Field(..., description="output pandas DataFrame"
    
    def transform(self, df: Optional[pd.DataFrame]):
        self.df = df or self.df
        if not self.df:
            raise RuntimeError("No valid Dataframe was passed")
        self.execute()
        return self.output.df

    def execute(self) -> Output:
        if self.encoding_type == 'one-hot':
            self.output.df = pd.get_dummies(self.df, columns=self.columns, drop_first=self.drop_first)
        elif self.encoding_type == 'ordinal':
            data = self.df
            for column in self.columns:
               for column in self.columns:
                if column in data.columns and self.ordinal_mapping and column in self.ordinal_mapping:
                    data[column] = data[column].map(self.ordinal_mapping[column]).fillna(-1).astype(int)
            self.output.df = data

This way you can interact with your class this way:

encoding_step = PandasCategoricalEncoding(
    columns=["color"],
    encoding_type="one-hot",
    drop_first=False  # Adjusted to match expected columns
)
encoded_data = encoding_step.transform(self.data)

This way this interface matches what we do for Spark.
Note: I will work on making Transformation base classes for Pandas in a separate PR.

For reference:

class Transformation(SparkStep, ABC):
"""Base class for all transformations
Concept
-------
A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is
transformed based on the logic implemented in the `execute` method. Any additional parameters that are needed for
the transformation can be passed to the constructor.
Parameters
----------
df : Optional[DataFrame]
The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the
transform-method.
Example
-------
### Implementing a transformation using the Transformation class:
```python
from koheesio.steps.transformations import Transformation
from pyspark.sql import functions as f
class AddOne(Transformation):
target_column: str = "new_column"
def execute(self):
self.output.df = self.df.withColumn(
self.target_column, f.col("old_column") + 1
)
```
In the example above, the `execute` method is implemented to add 1 to the values of the `old_column` and store the
result in a new column called `new_column`.
### Using the transformation:
In order to use this transformation, we can call the `transform` method:
```python
from pyspark.sql import SparkSession
# create a DataFrame with 3 rows
df = SparkSession.builder.getOrCreate().range(3)
output_df = AddOne().transform(df)
```
The `output_df` will now contain the original DataFrame with an additional column called `new_column` with the
values of `old_column` + 1.
__output_df:__
|id|new_column|
|--|----------|
| 0| 1|
| 1| 2|
| 2| 3|
...
### Alternative ways to use the transformation:
Alternatively, we can pass the DataFrame to the constructor and call the `execute` or `transform` method without
any arguments:
```python
output_df = AddOne(df).transform()
# or
output_df = AddOne(df).execute().output.df
```
> Note: that the transform method was not implemented explicitly in the AddOne class. This is because the `transform`
method is already implemented in the `Transformation` class. This means that all classes that inherit from the
Transformation class will have the `transform` method available. Only the execute method needs to be implemented.
### Using the transformation as a function:
The transformation can also be used as a function as part of a DataFrame's `transform` method:
```python
input_df = spark.range(3)
output_df = input_df.transform(AddOne(target_column="foo")).transform(
AddOne(target_column="bar")
)
```
In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform`
method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and
`bar', each with the values of `id` + 1.
"""
df: Optional[DataFrame] = Field(default=None, description="The Spark DataFrame")
@abstractmethod
def execute(self) -> SparkStep.Output:
"""Execute on a Transformation should handle self.df (input) and set self.output.df (output)
This method should be implemented in the child class. The input DataFrame is available as `self.df` and the
output DataFrame should be stored in `self.output.df`.
For example:
```python
def execute(self):
self.output.df = self.df.withColumn(
"new_column", f.col("old_column") + 1
)
```
The transform method will call this method and return the output DataFrame.
"""
# self.df # input dataframe
# self.output.df # output dataframe
self.output.df = ... # implement the transformation logic
raise NotImplementedError
def transform(self, df: Optional[DataFrame] = None) -> DataFrame:
"""Execute the transformation and return the output DataFrame
Note: when creating a child from this, don't implement this transform method. Instead, implement execute!
See Also
--------
`Transformation.execute`
Parameters
----------
df: Optional[DataFrame]
The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor
will be used.
Returns
-------
DataFrame
The transformed DataFrame
"""
self.df = df or self.df
if not self.df:
raise RuntimeError("No valid Dataframe was passed")
self.execute()
return self.output.df
def __call__(self, *args, **kwargs):
"""Allow the class to be called as a function.
This is especially useful when using a DataFrame's transform method.
Example
-------
```python
input_df = spark.range(3)
output_df = input_df.transform(AddOne(target_column="foo")).transform(
AddOne(target_column="bar")
)
```
In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform`
method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and
`bar', each with the values of `id` + 1.
"""
return self.transform(*args, **kwargs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making df an Optional type allows us to either give the df as an argument when initializing the class, or pass it through transform - this is exactly how we do it inside the Spark module at the moment

import unittest
import pandas as pd

class TestPandasCategoricalEncoding(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use unittest - use pytest.

  1. get rid of the unittest.TestCase (just let it be a regular class)
  2. change your self.assert... (from unittest) to regular python assert (this is how pytest works)
  3. get rid of your setUp - just make the input dataframe a module level variable, OR make it a fixture (a bit overkill for your purpose here)
  4. change your code to match the interface I proposed above

Comment on lines +2 to +5
import pandas as pd
from src.koheesio.pandas.categorical_encoding import PandasCategoricalEncoding
import unittest
import pandas as pd
Copy link
Member

@dannymeijer dannymeijer Dec 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're importing pandas twice. Also, import pandas through the koheesio module to avoid conflict:
from koheesio.pandas import pandas as pd

from koheesio.steps import Step

from typing import List, Dict
import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import pandas through the koheesio module (as stated above) to avoid conflict:
from koheesio.pandas import pandas as pd

Please run isort (through make fmt or hatch fmt, or run ruff)

@dannymeijer dannymeijer added this to the 0.10.0 milestone Dec 1, 2024
@dannymeijer
Copy link
Member

Since no response was provided since several weeks, I am closing this PR. Please re-open a new contribution request once you feel ready to do so and once the concerns have been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants