-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix header detection for tables with sparse numerical data #77
Conversation
Conflicting PR. Removed from build OMERO-plugins-push#1169. See the console output for more details.
--conflicts |
Conflicting PR. Removed from build OMERO-plugins-push#1176. See the console output for more details.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review summary
In conclusion, what was described in the PR description is reasonable, valid, and in my opinion necessary.
Following the steps described, without the PR there are cases where the script will fail (with a ValueError
exception) in the presence of a sparse numeric column and the lack of --allow-nan
. This PR fixes that behavior such that in the case of a sparse numeric column you can choose to have the sparse numeric column read as 'd'
(DoubleColumn
) or 's'
(StringColumn
) by passing --allow-nan
or not
My current understanding is that nan values or empty values will be processed regardless of if they're in a numeric or string column. And the default is to read the nan values as string type.
If so I'd suggest adding something like "without --allow-nan
, by default, the column will be read as a string column in the presence of nans in the column" somewhere either in the help of the argument or the README.
Review details
Using sparse_string_column.csv
(4th column sparse):
Without PR detected ['s', 'd', 'l', 's', 's']
. Added empty string to final result.
With PR changes, detected ['s', 'd', 'l', 's', 's']
and with --allow-nan
, similary detected ['s', 'd', 'l', 's', 's']
. Added empty string to final result.
Using sparse_numeric_column.csv (2nd and 4th column sparse)
Without PR detected ['s', 'd', 'd', 's', 's']
, however will get the following exception:
ValueError: Empty Double or Long value. Use --allow_nan to convert to NaN
With PR changes, detected ['s', 's', 'd', 's', 's']
which a dded empty string to final result and with --allow-nan
detected ['s', 'd', 'd', 's', 's']
which added 'nan'
to final result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Thanks @muhanadz. The new column detection behavior is now released as |
Fixes #76
Reproducible scenario
First create a minimal dataset/image hierarchy e.g. as follows:
CSV files with sparse string data such as the one below are correctly handled by the current HEAD of
omero-metadata
.The columns with missing values are mapped as
s/StringCOlumn
and the missing value are turned into empty strings where runningomero metadata populate --file sparse_string_column.csv $dataset
e.g.CSV files with sparse numerical columns such as the one below currently fail during the population command:
Here, the
meas1
column is currently mapped into ad
header type/DoubleColumn
by thepandas
detection logic, With the defaultomero metadata populate --file command, the table population fails with
ValueError: Empty Double or Long value. Use --allow_nan to convert to NaN`.Proposed changes
Since the library already includes some logic allowing the user to control whether NaN values are allowed in the OMERO.table (introduced in #60), this PR proposes the following changes
MetadataControl.detect_headers
API supports an extrakeep_default_na
argument (True
by default). Its value is forwarded to
pandas.read_csvand determines controls how pandas should handle NA (missing) values and whether the column is detected as
dvs
s` (see https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)MetadataControl.detect_headers
API with different combination of tables/argumentsargs.allow_nan
(False
by default) as pass it askeep_default_na
toMetadataControl.detect_headers
fe73a17 adds a cosmetic change defining GNU-style aliases of the command-line arguments (
--manual-header
,--allow-nan
) using hyphen as separator. The existing underscore separated flags are preserved.Testing
With these changes, annotating of sparse CSV tables using the default header detection should be functional in all cases.
if the tabular data is dense or containing sparse string columns, the behavior of the command should be unchanged
if the tabular data is dense or containing sparse numerical columns,, the behavior of the command will depend on the
--allow-nan
flagwill detect the sparse numeric column as a
StringColumn
and store the missing values as empty stringswill detect the sparse numeric column as a
DoubleColumn
and store missing values asnan