Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up parsing of annotations (XMLs) #70

Merged
merged 1 commit into from
Oct 27, 2024

Conversation

leandervaneekelen
Copy link
Collaborator

@leandervaneekelen leandervaneekelen commented Oct 24, 2024

Hi @martvanrijthoven,

I have some huge XML files in which I store my slide-level detection inference (>100.000 cells/detections). I noticed that loading these from ASAP-formatted XMLs can take a very long time and that loading them from JSONs takes about equally as long (I was under the impression that converting the XMLs to JSON via scripts/convert_asapxml_to_json.py would result in a speed-up, but perhaps not at the scales I am working at).

Here is a code snippet for making a large annotations file:

# Generate random whole slide annotation file
import random
from shapely.geometry import Point
from wholeslidedata.annotation.types import PointAnnotation
from wholeslidedata.annotation.labels import Label
from wholeslidedata.interoperability.asap.annotationwriter import write_asap_annotation

seed = 0
random.seed(seed) # Fixed seed

n_points = int(1e5)
points = ((Point(random.randint(0, 100), random.randint(0, 100))) for _ in range(n_points))
label = Label("0", 0)
annotations = [PointAnnotation(i, label, p) for i, p in enumerate(points)]
write_asap_annotation(annotations, "/tmp/random_annotations.xml")

I did some profiling and found out that during the initiation of the WholeSlideAnnotation object a significant time is spent on inserting points into the rtree used for WholeSlideAnnotation.select_annotation calls:

from wholeslidedata.annotation.wholeslideannotation import WholeSlideAnnotation
import cProfile
import pstats
from pstats import SortKey
cProfile.run('WholeSlideAnnotation("/tmp/random_annotations.xml")', filename='profile_baseline')
p = pstats.Stats('./profile_baseline')
p.strip_dirs().sort_stats(SortKey.TIME).print_stats()

         14714081 function calls (14714075 primitive calls) in 27.252 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   161690   11.174    0.000   14.100    0.000 index.py:415(insert)
   161690    2.093    0.000    2.130    0.000 types.py:76(__init__)
   161690    1.354    0.000    1.438    0.000 point.py:196(geos_point_from_py)
   161690    1.178    0.000    1.939    0.000 index.py:339(get_coordinate_pointers)
   161690    0.829    0.000    1.894    0.000 coords.py:69(__getitem__)
   323380    0.518    0.000    0.555    0.000 coords.py:44(_update)
   323380    0.511    0.000    0.723    0.000 index.py:1558(get_dimension)
   161690    0.492    0.000    4.250    0.000 types.py:128(bounds)
   161690    0.436    0.000    0.581    0.000 predicates.py:23(__call__)
   161691    0.387    0.000    0.538    0.000 index.py:1537(get_index_type)
        1    0.379    0.379   18.730   18.730 selector.py:21(__init__)
   161690    0.379    0.000    0.632    0.000 coords.py:48(__len__)
   485070    0.364    0.000    0.364    0.000 base.py:224(empty)
   161690    0.360    0.000    0.360    0.000 {built-in method numpy.array}
        1    0.344    0.344    7.294    7.294 parser.py:101(parse)
   323380    0.321    0.000    0.321    0.000 base.py:69(geometry_type_name)
   161690    0.321    0.000    1.010    0.000 base.py:696(is_empty)
        6    0.306    0.051    0.333    0.056 parser.py:108(<listcomp>)

With some digging I found out that you can instantiate an rtree.index from a stream and that this offers a significant speed-up:

Thu Oct 24 14:26:58 2024    [./profile](https://file+.vscode-resource.vscode-cdn.net/home/leander/Desktop/profile)

         13743958 function calls (13743952 primitive calls) in 13.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   161690    1.269    0.000    1.346    0.000 point.py:196(geos_point_from_py)
   161690    1.094    0.000    1.111    0.000 types.py:76(__init__)
   161691    0.924    0.000    5.288    0.000 index.py:1239(py_next_item)
   161690    0.671    0.000    1.519    0.000 coords.py:69(__getitem__)
        2    0.583    0.291    5.871    2.935 index.py:1410(__init__)
   161696    0.429    0.000    0.495    0.000 labels.py:25(__init__)
   323380    0.416    0.000    0.451    0.000 coords.py:44(_update)
   323381    0.406    0.000    0.406    0.000 __init__.py:506(cast)
        1    0.377    0.377    6.469    6.469 parser.py:101(parse)
   161690    0.353    0.000    3.281    0.000 types.py:128(bounds)
   161690    0.347    0.000    0.421    0.000 index.py:1182(deinterleave)
   323380    0.323    0.000    0.323    0.000 base.py:69(geometry_type_name)
   161690    0.315    0.000    0.438    0.000 predicates.py:23(__call__)
   161690    0.309    0.000    0.805    0.000 labels.py:133(_label_from_dict)
        6    0.304    0.051    0.330    0.055 parser.py:108(<listcomp>)
   161690    0.292    0.000    5.321    0.000 types.py:64(create)
   161690    0.288    0.000    0.507    0.000 coords.py:48(__len__)
   485070    0.283    0.000    0.283    0.000 base.py:224(empty)

Now, the biggest timesink is initiating GEOS objects in Shapely.

Some benchmarking with timeit confirms the optimization boost:

from wholeslidedata.annotation.wholeslideannotation import WholeSlideAnnotation
%timeit -r 5 -n 5 WholeSlideAnnotation("/tmp/random_annotations.xml")
  • original: 11.3 s ± 457 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
  • optimized: 6.71 s ± 136 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Cool crisp 40% speedup 😎

Let me know what you think! Btw, I read that shapely 2.0 contains a lot of optimizations, so we can probably get this time down even further. Do you know what would be the biggest hurdle for moving to shapely 2.0 in WSD? Is this even feasible/desirable?

@coveralls
Copy link

Pull Request Test Coverage Report for Build 11500271529

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.02%) to 72.611%

Totals Coverage Status
Change from base Build 11386276311: -0.02%
Covered Lines: 2219
Relevant Lines: 3056

💛 - Coveralls

@martvanrijthoven
Copy link
Collaborator

Hi Leander,

This is amazing, very nice speedup. Thank you so much!
I think there are no hurdles for moving to shapely2.0 and i think it should already work.

@martvanrijthoven martvanrijthoven merged commit 41e501c into DIAGNijmegen:main Oct 27, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants