Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some parsing issues #34

Open
emiglietta opened this issue May 1, 2024 · 0 comments
Open

Some parsing issues #34

emiglietta opened this issue May 1, 2024 · 0 comments

Comments

@emiglietta
Copy link

There appear to be some issues when parsing well_id, particularly in the embedding.parquet files from sources 1, 2 and 7 of the JUMP dataset. The well_id listed in the index corresponds to the previous "segment" of the key.
I used this filtering to check it:

df = (
    index
    .unique(subset="well_id")
    .filter(pl.col("is_parsing_error").eq(False)) 
    .select("well_id", "key", "dataset_id", "source_id", "leaf_node")
    .unique(subset=["dataset_id","leaf_node","source_id"])
    .collect(streaming=True)
    )
df

Source 15 of the JUMP dataset seem to have a different key structure than the others sources, which leads to a number of parsing errors (which are not recognised as parsing errors according to the is_parsing_error column of the index):

  • The dataset_id is 'jump' and not 'cpg0016-jump'
  • There are 460 unique 'plate_id' values from that source, but only 183 of those follow the expected structure.
# There are 460 plate_ids for source 15 in JUMP, are there really 460 plates? 
# also, plate_id varies in structure!
df = (index
      .filter(pl.col("dataset_id").eq("jump"))
      .filter(pl.col("source_id").eq("source_15"))
      .unique(subset=["plate_id"])
      .select(pl.col(["plate_id"]))
      .collect(streaming=True)
      )
df
# For source 15 in JUMP, are there really 460 plates? 
# There are 183 unique ones matching the regex for the plate name structure
df = (index
      .filter(pl.col("dataset_id").eq("jump"))
      .filter(pl.col("source_id").eq("source_15"))
      .filter(pl.col("plate_id").str.contains("^PE(P|C)[0-9]{8}$"))
      .select(pl.col(["plate_id"]).unique().sort())
      .collect(streaming=True)
      )
df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant