Some parsing issues #34

emiglietta · 2024-05-01T00:11:40Z

There appear to be some issues when parsing well_id, particularly in the embedding.parquet files from sources 1, 2 and 7 of the JUMP dataset. The well_id listed in the index corresponds to the previous "segment" of the key.
I used this filtering to check it:

df = (
    index
    .unique(subset="well_id")
    .filter(pl.col("is_parsing_error").eq(False)) 
    .select("well_id", "key", "dataset_id", "source_id", "leaf_node")
    .unique(subset=["dataset_id","leaf_node","source_id"])
    .collect(streaming=True)
    )
df

Source 15 of the JUMP dataset seem to have a different key structure than the others sources, which leads to a number of parsing errors (which are not recognised as parsing errors according to the is_parsing_error column of the index):

The dataset_id is 'jump' and not 'cpg0016-jump'
There are 460 unique 'plate_id' values from that source, but only 183 of those follow the expected structure.

# There are 460 plate_ids for source 15 in JUMP, are there really 460 plates? 
# also, plate_id varies in structure!
df = (index
      .filter(pl.col("dataset_id").eq("jump"))
      .filter(pl.col("source_id").eq("source_15"))
      .unique(subset=["plate_id"])
      .select(pl.col(["plate_id"]))
      .collect(streaming=True)
      )
df

# For source 15 in JUMP, are there really 460 plates? 
# There are 183 unique ones matching the regex for the plate name structure
df = (index
      .filter(pl.col("dataset_id").eq("jump"))
      .filter(pl.col("source_id").eq("source_15"))
      .filter(pl.col("plate_id").str.contains("^PE(P|C)[0-9]{8}$"))
      .select(pl.col(["plate_id"]).unique().sort())
      .collect(streaming=True)
      )
df

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some parsing issues #34

Some parsing issues #34

emiglietta commented May 1, 2024

Some parsing issues #34

Some parsing issues #34

Comments

emiglietta commented May 1, 2024