Added categorical encoding function #127

Nandha951 · 2024-11-25T21:32:08Z

Description

This pull request introduces a new categorical encoding feature to the project. The function supports One-Hot Encoding and Ordinal Encoding for categorical variables, allowing users to efficiently transform categorical data into numerical formats. This addition is designed to enhance data preprocessing capabilities within the framework.

The changes include:

A new Python file categorical_encoding.py under the src/koheesio/ directory.
Implementation of a flexible categorical_encoding function.
Addition of a EncodingConfig class for user-configurable options (e.g., encoding type).
Comprehensive unit tests under tests/test_categorical_encoding.py.

Related Issue

This pull request addresses the need for robust categorical data transformation functionality as identified in the enhancement discussions. (Link to any related issue or discussion if applicable, or mention "N/A" if there isn't one.)

Motivation and Context

This change is required to handle categorical data during data preprocessing for machine learning or analytics workflows. The new feature provides:

A One-Hot Encoding option for creating binary columns for each category while avoiding the dummy variable trap.
An Ordinal Encoding option to assign numerical values to categories, useful for models that can process ordinal relationships.

This functionality improves the versatility and usability of the Koheesio framework in real-world scenarios.

How Has This Been Tested?

The implementation was tested using unit tests created in tests/test_categorical_encoding.py. The tests include:

One-Hot Encoding:
- Validated the creation of binary columns for multiple categorical variables.
- Verified the exclusion of original categorical columns.
Ordinal Encoding:
- Ensured that the correct numerical mappings were assigned to categories.
- Checked for consistency when applied to multiple variables.

All tests were executed in the local environment using Python's unittest framework, and they passed successfully without affecting other parts of the codebase.

Screenshots (if appropriate):

N/A

Types of Changes

New feature (non-breaking change which adds functionality)

Checklist

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

dannymeijer

First of all, thank you for your contribution, it's very much appreciated.

Please see the comments I gave above.

A few things:

Your code should be adjusted to use Koheesio Step class, preferably a pandas adjusted version of the Transformation class we have in the spark module
Your code should be moved to an appropriate module
The extra dependency you're introducing is not part of the pyproject atm
Also, I would like you to add extra documentation (module docstring) to explain your usecase; add some examples as well once you've adjusted your code

I would love to discuss with you the intend of what you are trying to achieve. Feel free to reach out in a DM/email - my contact information is in my profile (LinkedIn, email).

src/koheesio/categorical_encoding.py

dannymeijer · 2024-11-26T08:47:38Z

Please also see: #129

This reverts commit 60c81dc.

…umpy based docstrings

dannymeijer

Starting to look really good! :)
I left some detailed comments on what I'd like to see change

dannymeijer · 2024-12-01T09:18:21Z

pyproject.toml

+  "pandas>=1.5.0",
+  "scikit-learn>=1.2.0"


I don't want to make these top level dependencies + we already have pandas as an extra dependency.
Let's make an extra called "ml" and put scikit-learn in there. That way you can install the extra dependency as koheesio[pandas,ml]

dannymeijer · 2024-12-01T09:19:03Z

src/koheesio/pandas/categorical_encoding.py

+import pandas as pd
+from pydantic import BaseModel
+
+class PandasCategoricalEncoding(BaseModel):


make this class a PandasStep : from koheesio.pandas import PandasStep

dannymeijer · 2024-12-01T09:20:42Z

src/koheesio/pandas/categorical_encoding.py

+    """
+
+    columns: List[str]
+    encoding_type: str = "one-hot"


change the type to Literal["one-hot", "ordinal"], that way you don't need the extra check you put in the __init__ method

dannymeijer · 2024-12-01T09:23:29Z

src/koheesio/pandas/categorical_encoding.py

+    drop_first: bool = True
+    ordinal_mapping: Dict[str, Dict] = None
+
+    def __init__(self, **kwargs):


This becomes obsolete if you make the type a Literal as stated above

dannymeijer · 2024-12-01T09:25:00Z

src/koheesio/pandas/categorical_encoding.py

+    columns: List[str]
+    encoding_type: str = "one-hot"
+    drop_first: bool = True
+    ordinal_mapping: Dict[str, Dict] = None


Make all these Fields, example:

from koheesio.models import Field ... class PandasCategoricalEncoding(PandasStep): columns: List[str] = Field(..., description="...") encoding_type: Literal["one-hot", "ordinal"] = Field(default="one-hot", description="...") ...

(and of course add appropriate description to each)

dannymeijer · 2024-12-01T09:42:30Z

src/koheesio/pandas/categorical_encoding.py

+    def execute(self, data: pd.DataFrame) -> pd.DataFrame:
+        """
+        Executes the categorical encoding transformation on the provided dataset.
+
+        Parameters
+        ----------
+        data : pd.DataFrame
+            The input dataset to encode.
+
+        Returns
+        -------
+        pd.DataFrame
+            The dataset with the specified categorical columns encoded.
+        """
+        if self.encoding_type == 'one-hot':
+            data = pd.get_dummies(data, columns=self.columns, drop_first=self.drop_first)
+        elif self.encoding_type == 'ordinal':
+            for column in self.columns:
+                if column in data.columns and self.ordinal_mapping and column in self.ordinal_mapping:
+                    data[column] = data[column].map(self.ordinal_mapping[column]).fillna(-1).astype(int)
+        return data


a few things about execute (for when you change to Step:

execute takes no arguments, instead add the DataFrame as one of the input Fields; add something like this df: Optional[pd.DataFrame] = Field(default=None, description="...") - I will explain why you want this as Optional in a bit

execute is expected to deal with input (your Fields) and generate Output

this Output does not need to be returned explicitly, the Step parent-class takes care of this.

instead, add a .transform method that can take a DataFrame as input

This means you can change your code like this:

add an Output class

add a .transform method

update your execute method accordingly

Should look something like this (of course add docstrings and things like that):

class PandasCategoricalEncoding(PandasStep): ... class Output(PandasStep.Output): df: pd.DataFrame = Field(..., description="output pandas DataFrame" def transform(self, df: Optional[pd.DataFrame]): self.df = df or self.df if not self.df: raise RuntimeError("No valid Dataframe was passed") self.execute() return self.output.df def execute(self) -> Output: if self.encoding_type == 'one-hot': self.output.df = pd.get_dummies(self.df, columns=self.columns, drop_first=self.drop_first) elif self.encoding_type == 'ordinal': data = self.df for column in self.columns: for column in self.columns: if column in data.columns and self.ordinal_mapping and column in self.ordinal_mapping: data[column] = data[column].map(self.ordinal_mapping[column]).fillna(-1).astype(int) self.output.df = data

This way you can interact with your class this way:

encoding_step = PandasCategoricalEncoding( columns=["color"], encoding_type="one-hot", drop_first=False # Adjusted to match expected columns ) encoded_data = encoding_step.transform(self.data)

This way this interface matches what we do for Spark.
Note: I will work on making Transformation base classes for Pandas in a separate PR.

For reference:

koheesio/src/koheesio/spark/transformations/__init__.py

Lines 35 to 192 in 9bd29ec

class Transformation(SparkStep, ABC):

"""Base class for all transformations

Concept

-------

A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is

transformed based on the logic implemented in the `execute` method. Any additional parameters that are needed for

the transformation can be passed to the constructor.

Parameters

----------

df : Optional[DataFrame]

The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the

transform-method.

Example

-------

### Implementing a transformation using the Transformation class:

```python

from koheesio.steps.transformations import Transformation

from pyspark.sql import functions as f

class AddOne(Transformation):

target_column: str = "new_column"

def execute(self):

self.output.df = self.df.withColumn(

self.target_column, f.col("old_column") + 1

)

```

In the example above, the `execute` method is implemented to add 1 to the values of the `old_column` and store the

result in a new column called `new_column`.

### Using the transformation:

In order to use this transformation, we can call the `transform` method:

```python

from pyspark.sql import SparkSession

# create a DataFrame with 3 rows

df = SparkSession.builder.getOrCreate().range(3)

output_df = AddOne().transform(df)

```

The `output_df` will now contain the original DataFrame with an additional column called `new_column` with the

values of `old_column` + 1.

__output_df:__

|id|new_column|

|--|----------|

| 0| 1|

| 1| 2|

| 2| 3|

...

### Alternative ways to use the transformation:

Alternatively, we can pass the DataFrame to the constructor and call the `execute` or `transform` method without

any arguments:

```python

output_df = AddOne(df).transform()

# or

output_df = AddOne(df).execute().output.df

```

> Note: that the transform method was not implemented explicitly in the AddOne class. This is because the `transform`

method is already implemented in the `Transformation` class. This means that all classes that inherit from the

Transformation class will have the `transform` method available. Only the execute method needs to be implemented.

### Using the transformation as a function:

The transformation can also be used as a function as part of a DataFrame's `transform` method:

```python

input_df = spark.range(3)

output_df = input_df.transform(AddOne(target_column="foo")).transform(

AddOne(target_column="bar")

)

```

In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform`

method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and

`bar', each with the values of `id` + 1.

"""

df: Optional[DataFrame] = Field(default=None, description="The Spark DataFrame")

@abstractmethod

def execute(self) -> SparkStep.Output:

"""Execute on a Transformation should handle self.df (input) and set self.output.df (output)

This method should be implemented in the child class. The input DataFrame is available as `self.df` and the

output DataFrame should be stored in `self.output.df`.

For example:

```python

def execute(self):

self.output.df = self.df.withColumn(

"new_column", f.col("old_column") + 1

)

```

The transform method will call this method and return the output DataFrame.

"""

# self.df # input dataframe

# self.output.df # output dataframe

self.output.df = ... # implement the transformation logic

raise NotImplementedError

def transform(self, df: Optional[DataFrame] = None) -> DataFrame:

"""Execute the transformation and return the output DataFrame

Note: when creating a child from this, don't implement this transform method. Instead, implement execute!

See Also

--------

`Transformation.execute`

Parameters

----------

df: Optional[DataFrame]

The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor

will be used.

Returns

-------

DataFrame

The transformed DataFrame

"""

self.df = df or self.df

if not self.df:

raise RuntimeError("No valid Dataframe was passed")

self.execute()

return self.output.df

def __call__(self, *args, **kwargs):

"""Allow the class to be called as a function.

This is especially useful when using a DataFrame's transform method.

Example

-------

```python

input_df = spark.range(3)

output_df = input_df.transform(AddOne(target_column="foo")).transform(

AddOne(target_column="bar")

)

```

In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform`

method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and

`bar', each with the values of `id` + 1.

"""

return self.transform(*args, **kwargs)

making df an Optional type allows us to either give the df as an argument when initializing the class, or pass it through transform - this is exactly how we do it inside the Spark module at the moment

dannymeijer · 2024-12-01T09:44:50Z

tests/test_categorical_encoding.py

+import unittest
+import pandas as pd
+
+class TestPandasCategoricalEncoding(unittest.TestCase):


don't use unittest - use pytest.

get rid of the unittest.TestCase (just let it be a regular class)

change your self.assert... (from unittest) to regular python assert (this is how pytest works)

get rid of your setUp - just make the input dataframe a module level variable, OR make it a fixture (a bit overkill for your purpose here)

change your code to match the interface I proposed above

dannymeijer · 2024-12-01T09:45:46Z

tests/test_categorical_encoding.py

+import pandas as pd
+from src.koheesio.pandas.categorical_encoding  import PandasCategoricalEncoding
+import unittest
+import pandas as pd


you're importing pandas twice. Also, import pandas through the koheesio module to avoid conflict:
from koheesio.pandas import pandas as pd

dannymeijer · 2024-12-01T09:46:39Z

src/koheesio/pandas/categorical_encoding.py

+from koheesio.steps import Step
+
+from typing import List, Dict
+import pandas as pd


import pandas through the koheesio module (as stated above) to avoid conflict:
from koheesio.pandas import pandas as pd

Please run isort (through make fmt or hatch fmt, or run ruff)

dannymeijer · 2025-01-18T11:02:17Z

Since no response was provided since several weeks, I am closing this PR. Please re-open a new contribution request once you feel ready to do so and once the concerns have been addressed.

Added categorical encoding function

8b7a6cb

Nandha951 requested a review from a team as a code owner November 25, 2024 21:32

dannymeijer requested changes Nov 26, 2024

View reviewed changes

src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved

src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved

src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved

src/koheesio/categorical_encoding.py Outdated Show resolved Hide resolved

dannymeijer added the blocked label Nov 26, 2024

Nandha951 added 5 commits November 28, 2024 17:21

Added dependencies in pyproject.toml

d80646a

Updated Docstrings to be in Numpy notation

60c81dc

Revert "Updated Docstrings to be in Numpy notation"

27d2b8f

This reverts commit 60c81dc.

Updated categorical_encoding.py by using class and docstrings

f69f663

Updated to class based code, placed the file under pandas and added n…

5453aac

…umpy based docstrings

dannymeijer requested changes Dec 1, 2024

View reviewed changes

Merge branch 'main' into feature/categorical-encoding

3a35099

dannymeijer added this to the 0.10.0 milestone Dec 1, 2024

dannymeijer closed this Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added categorical encoding function #127

Added categorical encoding function #127

Nandha951 commented Nov 25, 2024

dannymeijer left a comment

dannymeijer commented Nov 26, 2024

dannymeijer left a comment

dannymeijer Dec 1, 2024 •

edited

Loading

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024 •

edited

Loading

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024

dannymeijer Dec 1, 2024 •

edited

Loading

dannymeijer Dec 1, 2024

dannymeijer commented Jan 18, 2025

	class Transformation(SparkStep, ABC):
	"""Base class for all transformations

	Concept
	-------
	A Transformation is a Step that takes a DataFrame as input and returns a DataFrame as output. The DataFrame is
	transformed based on the logic implemented in the `execute` method. Any additional parameters that are needed for
	the transformation can be passed to the constructor.

	Parameters
	----------
	df : Optional[DataFrame]
	The DataFrame to apply the transformation to. If not provided, the DataFrame has to be passed to the
	transform-method.

	Example
	-------
	### Implementing a transformation using the Transformation class:
	```python
	from koheesio.steps.transformations import Transformation
	from pyspark.sql import functions as f


	class AddOne(Transformation):
	target_column: str = "new_column"

	def execute(self):
	self.output.df = self.df.withColumn(
	self.target_column, f.col("old_column") + 1
	)
	```

	In the example above, the `execute` method is implemented to add 1 to the values of the `old_column` and store the
	result in a new column called `new_column`.

	### Using the transformation:
	In order to use this transformation, we can call the `transform` method:

	```python
	from pyspark.sql import SparkSession

	# create a DataFrame with 3 rows
	df = SparkSession.builder.getOrCreate().range(3)

	output_df = AddOne().transform(df)
	```

	The `output_df` will now contain the original DataFrame with an additional column called `new_column` with the
	values of `old_column` + 1.

	__output_df:__

	\|id\|new_column\|
	\|--\|----------\|
	\| 0\| 1\|
	\| 1\| 2\|
	\| 2\| 3\|
	...

	### Alternative ways to use the transformation:
	Alternatively, we can pass the DataFrame to the constructor and call the `execute` or `transform` method without
	any arguments:

	```python
	output_df = AddOne(df).transform()
	# or
	output_df = AddOne(df).execute().output.df
	```

	> Note: that the transform method was not implemented explicitly in the AddOne class. This is because the `transform`
	method is already implemented in the `Transformation` class. This means that all classes that inherit from the
	Transformation class will have the `transform` method available. Only the execute method needs to be implemented.

	### Using the transformation as a function:
	The transformation can also be used as a function as part of a DataFrame's `transform` method:

	```python
	input_df = spark.range(3)

	output_df = input_df.transform(AddOne(target_column="foo")).transform(
	AddOne(target_column="bar")
	)
	```

	In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform`
	method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and
	`bar', each with the values of `id` + 1.
	"""

	df: Optional[DataFrame] = Field(default=None, description="The Spark DataFrame")

	@abstractmethod
	def execute(self) -> SparkStep.Output:
	"""Execute on a Transformation should handle self.df (input) and set self.output.df (output)

	This method should be implemented in the child class. The input DataFrame is available as `self.df` and the
	output DataFrame should be stored in `self.output.df`.

	For example:
	```python
	def execute(self):
	self.output.df = self.df.withColumn(
	"new_column", f.col("old_column") + 1
	)
	```

	The transform method will call this method and return the output DataFrame.
	"""
	# self.df # input dataframe
	# self.output.df # output dataframe
	self.output.df = ... # implement the transformation logic
	raise NotImplementedError

	def transform(self, df: Optional[DataFrame] = None) -> DataFrame:
	"""Execute the transformation and return the output DataFrame

	Note: when creating a child from this, don't implement this transform method. Instead, implement execute!

	See Also
	--------
	`Transformation.execute`

	Parameters
	----------
	df: Optional[DataFrame]
	The DataFrame to apply the transformation to. If not provided, the DataFrame passed to the constructor
	will be used.

	Returns
	-------
	DataFrame
	The transformed DataFrame
	"""
	self.df = df or self.df
	if not self.df:
	raise RuntimeError("No valid Dataframe was passed")
	self.execute()
	return self.output.df

	def __call__(self, args, *kwargs):
	"""Allow the class to be called as a function.
	This is especially useful when using a DataFrame's transform method.

	Example
	-------
	```python
	input_df = spark.range(3)

	output_df = input_df.transform(AddOne(target_column="foo")).transform(
	AddOne(target_column="bar")
	)
	```

	In the above example, the `AddOne` transformation is applied to the `input_df` DataFrame using the `transform`
	method. The `output_df` will now contain the original DataFrame with an additional columns called `foo` and
	`bar', each with the values of `id` + 1.
	"""
	return self.transform(args, *kwargs)

Added categorical encoding function #127

Added categorical encoding function #127

Conversation

Nandha951 commented Nov 25, 2024

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of Changes

Checklist

dannymeijer left a comment

Choose a reason for hiding this comment

dannymeijer commented Nov 26, 2024

dannymeijer left a comment

Choose a reason for hiding this comment

dannymeijer Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannymeijer Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannymeijer Dec 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dannymeijer commented Jan 18, 2025

dannymeijer Dec 1, 2024 •

edited

Loading

dannymeijer Dec 1, 2024 •

edited

Loading

dannymeijer Dec 1, 2024 •

edited

Loading