Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes empty string as input moves entire names #16

Open
Rekyt opened this issue Feb 14, 2023 · 6 comments
Open

Sometimes empty string as input moves entire names #16

Rekyt opened this issue Feb 14, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Rekyt
Copy link

Rekyt commented Feb 14, 2023

I also had a strange bug when having the empty string as input (but not the same as #14).

Sometimes, if the dataset is "big enough", testing with a data.frame with an empty string actually removes from the input and moves the next name to that ID instead. This causes many issues if you want to match back the names through the ID.

See my reproducible example:

library("dplyr")
#> Warning: le package 'dplyr' a été compilé avec la version R 4.2.1
#> 
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#> 
#>     filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#> 
#>     intersect, setdiff, setequal, union

ex_data = structure(
  list(
    ID = c(
      "splot-1", "splot-2", "splot-3", "splot-4", 
      "splot-5", "splot-6", "splot-7", "splot-8", "splot-9", "splot-10", 
      "splot-11", "splot-12", "splot-13", "splot-14", "splot-15", "splot-16", 
      "splot-17", "splot-18", "splot-19", "splot-20", "splot-21", "splot-22", 
      "splot-23", "splot-24", "splot-25", "splot-26", "splot-27", "splot-28", 
      "splot-29", "splot-30"
    ),
    Name_submitted = c(
      "Chlorophytum platt", 
      "Echinochloa", "Indigofera lange bl-stiele", "Polygala", "", 
      "Species", "Fabaceae", "Stumpf", "Schwarz pkt.", "-wirtelig b", 
      "Borstig", "-kantig", "Glatt langweilig", "Herzblatt", "Oval", 
      "Versetzt lanzettlich", "Species", "Kn verdichtet", "Wirtelig nadelig", 
      "Aa", "Aaaaaaa", "Aa achalensis", "Aa argyrolepis", "Aa aurantiaca", 
      "Aa calceata", "Achnella", "Aa colombiana", "Aa denticulata", 
      "Aa erosa", "Aa fiebrigii"
    )
  ),
  row.names = c(NA, 30L),
  class = "data.frame"
)

# Note that name splot-5 is empty
head(ex_data)
#>        ID             Name_submitted
#> 1 splot-1         Chlorophytum platt
#> 2 splot-2                Echinochloa
#> 3 splot-3 Indigofera lange bl-stiele
#> 4 splot-4                   Polygala
#> 5 splot-5                           
#> 6 splot-6                    Species

# Also many of the names are quite messy but some are genuine
ex_data[["Name_submitted"]]
#>  [1] "Chlorophytum platt"         "Echinochloa"               
#>  [3] "Indigofera lange bl-stiele" "Polygala"                  
#>  [5] ""                           "Species"                   
#>  [7] "Fabaceae"                   "Stumpf"                    
#>  [9] "Schwarz pkt."               "-wirtelig b"               
#> [11] "Borstig"                    "-kantig"                   
#> [13] "Glatt langweilig"           "Herzblatt"                 
#> [15] "Oval"                       "Versetzt lanzettlich"      
#> [17] "Species"                    "Kn verdichtet"             
#> [19] "Wirtelig nadelig"           "Aa"                        
#> [21] "Aaaaaaa"                    "Aa achalensis"             
#> [23] "Aa argyrolepis"             "Aa aurantiaca"             
#> [25] "Aa calceata"                "Achnella"                  
#> [27] "Aa colombiana"              "Aa denticulata"            
#> [29] "Aa erosa"                   "Aa fiebrigii"

# There are 30 names from 'splot-1' to 'splot-30'
dim(ex_data)
#> [1] 30  2

# Matching
ex_match = TNRS::TNRS(ex_data)

# Only 28 rows
dim(ex_match)
#> [1] 28 45

# Two IDs are not found 'splot-6' and 'splot-17'
setdiff(ex_data$ID, ex_match$ID)
#> [1] "splot-6"  "splot-17"

# There are both corresponding to two names that are 'Species'
ex_data %>%
  filter(ID %in% setdiff(ex_data$ID, ex_match$ID))
#>         ID Name_submitted
#> 1  splot-6        Species
#> 2 splot-17        Species

# However, 'Species' has been matched as 'splot-5'
head(ex_match)[,1:3]
#>        ID             Name_submitted Overall_score
#> 1 splot-1         Chlorophytum platt     0.5000000
#> 2 splot-2                Echinochloa     1.0000000
#> 3 splot-3 Indigofera lange bl-stiele     0.4004262
#> 4 splot-4                   Polygala     1.0000000
#> 5 splot-5                    Species            NA
#> 6 splot-7                   Fabaceae     1.0000000

# And regular 'splot-5' is nowhere to be seen
ex_match[,1:5] %>%
  filter(ID == "splot-5")
#>        ID Name_submitted Overall_score Name_matched_id     Name_matched
#> 1 splot-5        Species            NA                 [No match found]

Created on 2023-02-14 with reprex v2.0.2

I dived into the issue by looking at the query done through TNRS_core() and it seems all fine.
The data JSON looks like this:

[["splot-1","Chlorophytum platt"],["splot-2","Echinochloa"],["splot-3","Indigofera lange bl-stiele"],
["splot-4","Polygala"],["splot-5",""],["splot-6","Species"],["splot-7","Fabaceae"],["splot-8","Stumpf"],
["splot-9","Schwarz pkt."],["splot-10","-wirtelig b"],["splot-11","Borstig"],["splot-12","-kantig"],
["splot-13","Glatt langweilig"],["splot-14","Herzblatt"],["splot-15","Oval"],["splot-16","Versetzt lanzettlich"],
["splot-17","Species"],["splot-18","Kn verdichtet"],["splot-19","Wirtelig nadelig"],
["splot-20","Aa"],["splot-21","Aaaaaaa"],["splot-22","Aa achalensis"],["splot-23","Aa argyrolepis"],
["splot-24","Aa aurantiaca"],["splot-25","Aa calceata"],["splot-26","Achnella"],
["splot-27","Aa colombiana"],["splot-28","Aa denticulata"],["splot-29","Aa erosa"],["splot-30","Aa fiebrigii"]] 

So perfectly fine. But the API returns the same table as above with the names moved up.
So it seems to be rather an issue with the API.

@ojalaquellueva
Copy link
Member

@Rekyt @bmaitner As with #14 and #15, the source of this issue (and any potential fix) is almost certainly the perl controller. I really wanted to avoid messing with the controller, but re-assigning IDs is a serious issue. I'll take a look.

@Rekyt
Copy link
Author

Rekyt commented Feb 14, 2023

Thank you again for your quick answers @ojalaquellueva!
And good luck with the controller...

@ojalaquellueva ojalaquellueva self-assigned this Feb 14, 2023
@ojalaquellueva ojalaquellueva added the bug Something isn't working label Feb 14, 2023
@ojalaquellueva
Copy link
Member

ojalaquellueva commented Feb 16, 2023

@Rekyt I am still scoping this out. Assuming I can isolate and replicate the issue within the perl controller, I will transfer the issue to https://github.com/ojalaquellueva/TNRSbatch. In the meantime, you can bypass issues #14, #15 and #16 by processing your names as follows:

  1. Exclude any names which are all whitespace, NULL, NA, or empty string
  2. Do a unique on the remaining names, extracting them to the new data frame and assigning them a new integer ID (e.g., "unique.name.ID")
  3. Submit the pre-processed names + unique.name.IDs to the TNRS
  4. After processing you can transfer the result back to the original data frame by joining on the (now unique) Name_submitted in the TNRS results.

@ojalaquellueva
Copy link
Member

@bmaitner See my recommendation above. This is how I have always pre-processed names, and explains why I never noticed the issues spotted by @Rekyt. For now, I suggest that both of us add these pre-processing recommendations to our documentation.

@Rekyt
Copy link
Author

Rekyt commented Feb 16, 2023

@Rekyt I am still scoping this out. Assuming I can isolate and replicate the issue within the perl controller, I will transfer the issue to https://github.com/ojalaquellueva/TNRSbatch. In the meantime, you can bypass issues #14, #15 and #16 by processing your names as follows:

1. Exclude any names which are all whitespace, NULL, NA, or empty string

2. Do a unique on the remaining names, extracting them to the new data frame and assigning them a new integer ID (e.g., "unique.name.ID")

3. Submit the pre-processed names + unique.name.IDs to the TNRS

4. After processing you can transfer the result back to the original data frame by joining on the (now unique) Name_submitted in the TNRS results.

Thanks for the very detailed process! Will follow right away :)

@bmaitner
Copy link
Collaborator

Added a note to the readme. Thanks for catching this, @Rekyt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants