Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringNormalizer creats ValueError: Expected 2D array, got 1D array instead #443

Open
woodly0 opened this issue Mar 5, 2025 · 5 comments

Comments

@woodly0
Copy link

woodly0 commented Mar 5, 2025

Hello Sir,
long time no see. Hope you're doing well!

I am trying to use the StringNormalizer prior to an sklearn preprocessor (doesn't matter which one) and there is an exception thrown upon fit_transform(). Consider the following example:

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import (
    ReplaceTransformer,
    LookupTransformer,
    StringNormalizer,
)
from sklearn.preprocessing import TargetEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

X = pd.DataFrame(
    {
        "something": [1, 2, -3, 40, 5],
        "color": ["yellow ", "white", "BLACK", " green", "red"],
        "phone_number": [
            "41764509925",
            "41780000000",
            pd.NA,
            "07821572861",
            "0041442516185",
        ],
    }
)

y = pd.Series([1, 0, 0, 1, 0], name="target")

mapper = DataFrameMapper(
    [
        (
            ["color"],
            [
                StringNormalizer(function="lower", trim_blanks=True),  
                TargetEncoder(random_state=0),
            ],
        ),
        (
            ["phone_number"],
            [
                SimpleImputer(
                    missing_values=pd.NA, strategy="constant", fill_value="missing"
                ),
                ReplaceTransformer("^\d+000000", "missing"),
                ReplaceTransformer("^(41|0041|0)7\d{8}$", "mobile"),
                LookupTransformer(
                    {"missing": "missing", "mobile": "mobile"}, default_value="fix"
                ),
                OneHotEncoder(
                    categories="auto",
                    dtype=int,
                    handle_unknown="ignore",
                    sparse_output=False,
                ),
            ],
        ),
    ],
    df_out=True,
    default=False,
)

mapper.fit_transform(X, y)

The last line throws the following error:

ValueError: ['color']: Expected 2D array, got 1D array instead: array=['yellow' 'white' 'black' 'green' 'red'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The transformation of the phone_number feature has no direct link with the issue. But since it seems a bit rocky, I thought you might give me some feedback on how to improve it anyway.

Thank you a lot in advance!

@vruusmann
Copy link
Member

I am trying to use the StringNormalizer prior to an sklearn preprocessor (doesn't matter which one) and there is an exception thrown upon fit_transform().

Do you think the following to_1d() function call is at fault?
https://github.com/jpmml/sklearn2pmml/blob/0.113.0/sklearn2pmml/preprocessing/__init__.py#L641

The StringNormalizer business logic isn't that complicated. You can easily replace it with an equivalent ExpressionTransformer instance:

mapper = DataFrameMapper(
    [
        (
            ["color"],
            [
                #StringNormalizer(function="lower", trim_blanks=True),  
                ExpressionTransformer("(X[0].lower()).strip()"),
                TargetEncoder(random_state=0),
            ],
        ),
    ],
    df_out=True,
    default=False,
)

Alternatively, you can do exactly what the Python error message tells you to do - reshape the data container to (-1, 1) shape between StringNormalizer and TargetEncoder steps:

from sklearn2pmml.util import Reshaper

mapper = DataFrameMapper(
    [
        (
            ["color"],
            [
                StringNormalizer(function="lower", trim_blanks=True),
                Reshaper((-1, 1)),
                TargetEncoder(random_state=0),
            ],
        ),
    ],
    df_out=True,
    default=False,
)

@vruusmann
Copy link
Member

The transformation of the phone_number feature has no direct link with the issue. But since it seems a bit rocky, I thought you might give me some feedback on how to improve it anyway.

This sub-sequence catches my eye:

SimpleImputer(
    missing_values=pd.NA, strategy="constant", fill_value="missing"
),
ReplaceTransformer("^\d+000000", "missing"),
ReplaceTransformer("^(41|0041|0)7\d{8}$", "mobile"),
LookupTransformer(
    {"missing": "missing", "mobile": "mobile"}, default_value="fix"
)

I would try to replace it with a single ExpressionTransformer step. The expr element would be a standalone Python function (ie. not an in-line expression), which contains a very straightforward if-elif-else statement (one exit for missing, mobile and fix each).

The JPMML-Python library that handles Python-to-PMML expression translation supports RegEx replace functionality using re.sub, pcre.sub or pcre2.substitute functions. Also, the leading SimpleImputer step can be replaced with the ExpressionTransformer.map_missing_to attribute:

Not going to write any Python code for you this time. I'll tag this issue, and perhaps I'll write an example about it sometimes in the not-so-distant future into the JPMML documentation site (under construction rn).

@vruusmann
Copy link
Member

vruusmann commented Mar 6, 2025

The JPMML-Python library that handles Python-to-PMML expression translation supports RegEx replace functionality using re.sub, pcre.sub or pcre2.substitute functions.

Since you'll be using RegEx in if-elif-else conditions, then you need to be using the "search" functionality (not "replace" functionality).

The above pipeline fragment can be simplified to a simple Python UDF:

def phone_number_type(phone_humber):
  if re.search("^\d+000000", phone_number):
    return "missing"
  elif re.search("^(41|0041|0)7\d{8}$", phone_number):
    return "mobile"
  else:
    return "fixed"

Then you'd pass a reference to this Python UDF to ExpressionTransformer, and you'd be all set!

@woodly0
Copy link
Author

woodly0 commented Mar 6, 2025

Thanks for your quick response!

Do you think the following to_1d() function call is at fault?

I do not think this function is at fault, because you use it in many other transformer classes. If I understand your code correctly, I suggest that a simple return Xt.reshape((-1, 1)) at the end of the StringNormalizer.transform() function would solve the issue.

Thank you for proposing working alternatives!

I would try to replace it with a single ExpressionTransformer step. The expr element would be a standalone Python function (ie. not an in-line expression), which contains a very straightforward if-elif-else statement (one exit for missing, mobile and fix each).

OK, I will look into Python UDF in conjunction with the ExpressionTransformer. Haven't really explored this option so far. Are these capable of accepting 2 parameters/columns as input? If so, that would solve my need of RegEx-matching one column with another.

@vruusmann
Copy link
Member

OK, I will look into Python UDF in conjunction with the ExpressionTransformer. Are these capable of accepting 2 parameters/columns as input?

IIRC, you define the Python UDF in terms of scalar values, and the ExpressionTransformer.transform(X) method automatically handles the input column -> scalar -> output column mapping.

Just get some experiments going, and ask for my assistance here if you get completely stuck somewhere.

Right now, the only reference to this topic is this:
https://openscoring.io/blog/2023/03/09/sklearn_udf_expression_transformer/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants