Speed up parsing of annotations (XMLs) #70
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @martvanrijthoven,
I have some huge XML files in which I store my slide-level detection inference (>100.000 cells/detections). I noticed that loading these from ASAP-formatted XMLs can take a very long time and that loading them from JSONs takes about equally as long (I was under the impression that converting the XMLs to JSON via
scripts/convert_asapxml_to_json.py
would result in a speed-up, but perhaps not at the scales I am working at).Here is a code snippet for making a large annotations file:
I did some profiling and found out that during the initiation of the WholeSlideAnnotation object a significant time is spent on inserting points into the rtree used for
WholeSlideAnnotation.select_annotation
calls:With some digging I found out that you can instantiate an
rtree.index
from a stream and that this offers a significant speed-up:Now, the biggest timesink is initiating GEOS objects in Shapely.
Some benchmarking with timeit confirms the optimization boost:
Cool crisp 40% speedup 😎
Let me know what you think! Btw, I read that shapely 2.0 contains a lot of optimizations, so we can probably get this time down even further. Do you know what would be the biggest hurdle for moving to shapely 2.0 in WSD? Is this even feasible/desirable?