-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility of WAT results #458
Comments
Thank you for using GERBIL 🙂 I hope we can find the difference, together 👍 For A2KB, we send a request to "https://wat.d4science.org/wat/tag/tag". Apart from the document text and our API key, we do not use any additional parameters. I assume that the difference comes from how we make use of confidence scores. We choose the confidence threshold that gives us the best Micro F1 score. You can find the chosen threshold in the "confidence threshold" column of the results. If you forward the confidence scores, too, you should achieve the same results. The received Wikipedia article title is used to directly create a DBpedia IRI. With our sameAs retrieval approach described in our journal paper, we should end up with a set of IRIs including the DBpedia and Wikipedia IRIs. I hope that this issue didn't consume a lot of your time. Please let us know if you think that the behavior of GERBIL is unreasonable and should be changed or improved. 🙂 |
Thanks a lot for the quick reply! My first intuition is that setting the confidence threshold individually for each benchmark gives systems like WAT, which delegate the task of finding a good confidence threshold to the user, an unfair advantage over other systems. I personally don't think it would be unfair to take the results that the API outputs as they are without any filtering at all since these are the results a user can expect, if they don't do any additional tweaking. Right now, the results are the upper bound of what a user can expect from the linker (without changing the API parameters). It's an interesting problem and very relevant for me as I'm currently writing an analysis and comparison of different entity linkers, so I also need to figure out how to best deal with this... I would love to hear your point of view on it! Again, thank you for the quick reply and clarifications, it really spared me a headache! |
Yes, I am also slightly unhappy with the way we implemented comparison. I think that we could offer much more information and insights to the user about the confidence scores and their impact on the evaluation scores. While I agree with your negative points (results become an upper bound; the comparison can be seen as unfair since we use our knowledge about the test set gold standard to find the condidence threshold), I would like to point out that previous works had a "barrier" between systems with and without confidence scores and we tried to get rid of this separation. I also think that the confidence score is actually a nice, additional feature. On the other hand, I understand the argument that a user may not make use of it 😉 With respect to your comparison of linkers, I guess the main goal has to be the fairness of comparison. There could be different ways to handle it (I do not know the exact context of your work, so my suggestions might be wrong 😅):
Your work sounds very interesting and I would like to know more about it. Feel free to write me a mail if you have questions or if you would like to discuss how we could support your work. |
Thanks a lot for your input on this! One more aspect of this: The only reference I found in the GERBIL Wiki was here: https://github.com/dice-group/gerbil/wiki/Experiment-types. Please let me know if I missed anything. I only skimmed through the papers and documentation. Given that this behavior from what I can tell is currently not well documented, my assumption is that authors often (or at least sometimes) are not aware that if they provide a score to GERBIL, it will be used to tune the results using knowledge about the test data (please let me know if you think otherwise). I think it would help a lot if you would make it clearer in GERBIL's documentation how provided scores are used. However, I still think it would be fairer to force systems (or the user) to decide on a threshold (I agree that making no use of the scores at all can also be problematic). Thanks a lot for your recommendations and I'll definitely get back to the offer when more questions come up! I will probably do something along the lines of 2.3: Use a fixed threshold based on the recommended threshold from the paper or API documentation and then compare it to the upper bound results. |
|
Dear authors,
First of all, thank you for the great work you do in making entity linking results more comparable.
My question is specifically about GERBIL's WAT annotator:
I get different results when selecting WAT as an annotator in the A2KB task versus when I use my own NIF API which simply forwards requests from GERBIL to the official WAT API.
My setup is as follows:
I built my own NIF API which forwards the text GERBIL posts to the WAT API at https://wat.d4science.org/wat/tag/tag.
I do not provide any additional parameters to the WAT API.
I take the result from the WAT API, extract the span start and end from the fields
start
andend
and the entity title from thetitle
field.I create an entity URI as follows (in Python):
Then I send the span and the entity URI back to GERBIL.
The results I get using this approach differ from those I get when simply selecting WAT as annotator in GERBIL. On KORE50 for example, I get a Micro InKB F1 score of 0.5512 using my NIF API and 0.5781 when selecting WAT as annotator.
See this experiment: http://gerbil.aksw.org/gerbil/experiment?id=202409170001
I was wondering if GERBIL sets any additional parameters in the call to the API or filters the returned entities by score using a threshold. Looking at the GERBIL code, I didn't see any of that though.
Can you confirm that GERBIL does not use additional API parameters and does not filter results by score? This would already help me to narrow down the problem.
I just realized that the results for the recognition task are the same, so the problem might be in the URI matching.
How exactly does GERBIL create URIs from the Wikipedia titles predicted by WAT?
Any other hints to where this discrepancy could come from are highly appreciated.
Many thanks in advance!
The text was updated successfully, but these errors were encountered: