`StringNormalizer` creats ValueError: Expected 2D array, got 1D array instead #443

woodly0 · 2025-03-05T16:37:30Z

Hello Sir,
long time no see. Hope you're doing well!

I am trying to use the StringNormalizer prior to an sklearn preprocessor (doesn't matter which one) and there is an exception thrown upon fit_transform(). Consider the following example:

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import (
    ReplaceTransformer,
    LookupTransformer,
    StringNormalizer,
)
from sklearn.preprocessing import TargetEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

X = pd.DataFrame(
    {
        "something": [1, 2, -3, 40, 5],
        "color": ["yellow ", "white", "BLACK", " green", "red"],
        "phone_number": [
            "41764509925",
            "41780000000",
            pd.NA,
            "07821572861",
            "0041442516185",
        ],
    }
)

y = pd.Series([1, 0, 0, 1, 0], name="target")

mapper = DataFrameMapper(
    [
        (
            ["color"],
            [
                StringNormalizer(function="lower", trim_blanks=True),  
                TargetEncoder(random_state=0),
            ],
        ),
        (
            ["phone_number"],
            [
                SimpleImputer(
                    missing_values=pd.NA, strategy="constant", fill_value="missing"
                ),
                ReplaceTransformer("^\d+000000", "missing"),
                ReplaceTransformer("^(41|0041|0)7\d{8}$", "mobile"),
                LookupTransformer(
                    {"missing": "missing", "mobile": "mobile"}, default_value="fix"
                ),
                OneHotEncoder(
                    categories="auto",
                    dtype=int,
                    handle_unknown="ignore",
                    sparse_output=False,
                ),
            ],
        ),
    ],
    df_out=True,
    default=False,
)

mapper.fit_transform(X, y)

The last line throws the following error:

ValueError: ['color']: Expected 2D array, got 1D array instead: array=['yellow' 'white' 'black' 'green' 'red'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The transformation of the phone_number feature has no direct link with the issue. But since it seems a bit rocky, I thought you might give me some feedback on how to improve it anyway.

Thank you a lot in advance!

The text was updated successfully, but these errors were encountered:

vruusmann · 2025-03-05T20:59:22Z

I am trying to use the StringNormalizer prior to an sklearn preprocessor (doesn't matter which one) and there is an exception thrown upon fit_transform().

Do you think the following to_1d() function call is at fault?
https://github.com/jpmml/sklearn2pmml/blob/0.113.0/sklearn2pmml/preprocessing/__init__.py#L641

The StringNormalizer business logic isn't that complicated. You can easily replace it with an equivalent ExpressionTransformer instance:

mapper = DataFrameMapper(
    [
        (
            ["color"],
            [
                #StringNormalizer(function="lower", trim_blanks=True),  
                ExpressionTransformer("(X[0].lower()).strip()"),
                TargetEncoder(random_state=0),
            ],
        ),
    ],
    df_out=True,
    default=False,
)

Alternatively, you can do exactly what the Python error message tells you to do - reshape the data container to (-1, 1) shape between StringNormalizer and TargetEncoder steps:

from sklearn2pmml.util import Reshaper

mapper = DataFrameMapper(
    [
        (
            ["color"],
            [
                StringNormalizer(function="lower", trim_blanks=True),
                Reshaper((-1, 1)),
                TargetEncoder(random_state=0),
            ],
        ),
    ],
    df_out=True,
    default=False,
)

vruusmann · 2025-03-05T21:09:43Z

The transformation of the phone_number feature has no direct link with the issue. But since it seems a bit rocky, I thought you might give me some feedback on how to improve it anyway.

This sub-sequence catches my eye:

SimpleImputer(
    missing_values=pd.NA, strategy="constant", fill_value="missing"
),
ReplaceTransformer("^\d+000000", "missing"),
ReplaceTransformer("^(41|0041|0)7\d{8}$", "mobile"),
LookupTransformer(
    {"missing": "missing", "mobile": "mobile"}, default_value="fix"
)

I would try to replace it with a single ExpressionTransformer step. The expr element would be a standalone Python function (ie. not an in-line expression), which contains a very straightforward if-elif-else statement (one exit for missing, mobile and fix each).

The JPMML-Python library that handles Python-to-PMML expression translation supports RegEx replace functionality using re.sub, pcre.sub or pcre2.substitute functions. Also, the leading SimpleImputer step can be replaced with the ExpressionTransformer.map_missing_to attribute:

Not going to write any Python code for you this time. I'll tag this issue, and perhaps I'll write an example about it sometimes in the not-so-distant future into the JPMML documentation site (under construction rn).

vruusmann · 2025-03-06T06:55:04Z

The JPMML-Python library that handles Python-to-PMML expression translation supports RegEx replace functionality using re.sub, pcre.sub or pcre2.substitute functions.

Since you'll be using RegEx in if-elif-else conditions, then you need to be using the "search" functionality (not "replace" functionality).

The above pipeline fragment can be simplified to a simple Python UDF:

def phone_number_type(phone_humber):
  if re.search("^\d+000000", phone_number):
    return "missing"
  elif re.search("^(41|0041|0)7\d{8}$", phone_number):
    return "mobile"
  else:
    return "fixed"

Then you'd pass a reference to this Python UDF to ExpressionTransformer, and you'd be all set!

woodly0 · 2025-03-06T08:17:24Z

Thanks for your quick response!

Do you think the following to_1d() function call is at fault?

I do not think this function is at fault, because you use it in many other transformer classes. If I understand your code correctly, I suggest that a simple return Xt.reshape((-1, 1)) at the end of the StringNormalizer.transform() function would solve the issue.

Thank you for proposing working alternatives!

I would try to replace it with a single ExpressionTransformer step. The expr element would be a standalone Python function (ie. not an in-line expression), which contains a very straightforward if-elif-else statement (one exit for missing, mobile and fix each).

OK, I will look into Python UDF in conjunction with the ExpressionTransformer. Haven't really explored this option so far. Are these capable of accepting 2 parameters/columns as input? If so, that would solve my need of RegEx-matching one column with another.

vruusmann · 2025-03-06T08:44:03Z

OK, I will look into Python UDF in conjunction with the ExpressionTransformer. Are these capable of accepting 2 parameters/columns as input?

IIRC, you define the Python UDF in terms of scalar values, and the ExpressionTransformer.transform(X) method automatically handles the input column -> scalar -> output column mapping.

Just get some experiments going, and ask for my assistance here if you get completely stuck somewhere.

Right now, the only reference to this topic is this:
https://openscoring.io/blog/2023/03/09/sklearn_udf_expression_transformer/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`StringNormalizer` creats ValueError: Expected 2D array, got 1D array instead #443

`StringNormalizer` creats ValueError: Expected 2D array, got 1D array instead #443

woodly0 commented Mar 5, 2025

vruusmann commented Mar 5, 2025

vruusmann commented Mar 5, 2025

vruusmann commented Mar 6, 2025 •

edited

Loading

woodly0 commented Mar 6, 2025

vruusmann commented Mar 6, 2025

StringNormalizer creats ValueError: Expected 2D array, got 1D array instead #443

StringNormalizer creats ValueError: Expected 2D array, got 1D array instead #443

Comments

woodly0 commented Mar 5, 2025

vruusmann commented Mar 5, 2025

vruusmann commented Mar 5, 2025

vruusmann commented Mar 6, 2025 • edited Loading

woodly0 commented Mar 6, 2025

vruusmann commented Mar 6, 2025

`StringNormalizer` creats ValueError: Expected 2D array, got 1D array instead #443

`StringNormalizer` creats ValueError: Expected 2D array, got 1D array instead #443

vruusmann commented Mar 6, 2025 •

edited

Loading