Bulk process many slides (e.g. 30k slides) #223

usuyama · 2022-09-26T04:02:37Z

Has anyone tried running HistoQC on PySpark/Databricks?

I'm thinking about ways to run HistoQC on a large dataset like all 30k slides from TCGA.

I guess it should work with some modifications (data-loading/library-versions/etc.), but wonder if anyone in the community has experience.

choosehappy · 2022-09-28T09:41:03Z

Interesting! As far as i know not yet! That said, we were just awarded some additional funding specifically for scaling up our histotools suite (histoqc.com, patchsorter.com, quickannotator.com, cohortfinder.com) I think we'll end up using the Ray distributed computing framework, which is now sufficiently mature for this sort of thing :) There has been some work with HistoQC for reading files from cloud storage: https://github.com/ap--/HistoQC/tree/feature/cloud-support-via-tiffslide#accessing-cloud-storage if you have any thoughts/comments, I would love to hear them!

…

On Mon, Sep 26, 2022 at 6:02 AM Naoto Usuyama ***@***.***> wrote: Has anyone tried running HistoQC on PySpark/Databricks? I'm thinking about ways to run HistoQC on a large dataset like all 30k slides from TCGA. I guess it should work with some modifications (data-loading/library-versions/etc.), but wonder if anyone in the community has experience. — Reply to this email directly, view it on GitHub <#223>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJ3XTFNT6KJLJSUI4F7CB3WAEN6RANCNFSM6AAAAAAQVMWGSE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ap-- · 2023-02-09T10:28:42Z

I'm thinking about ways to run HistoQC on a large dataset like all 30k slides from TCGA.

A use case like this, was exactly why I started prototyping using HistoQC directly from cloud storage. The idea was to prototype a pipeline for continuous quality monitoring. Since the slide-scanners in this workflow would automatically upload the scanned images to cloud buckets anyways, it made sense to run the QC tests in the cloud too.

Sadly it never got past the poc linked in the fork above, due to lack of time on my end.
But I believe, that once more of the legacy pathology slide formats are supported via tiffslide it becomes a viable option to default to using tiffslide instead of openslide for a cloud native implementation of HistoQC.

Cheers,
Andreas

choosehappy · 2023-02-14T10:49:03Z

Very interesting, thanks for the information!

We're in the process now of hiring someone to do the scalability mentioned above; I could also image a tiffslide integrated version. there are increasingly other formats that we need to be able to support, like dicom, so a generic abstracted approach to loading WSI can address a lot of these points

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk process many slides (e.g. 30k slides) #223

Bulk process many slides (e.g. 30k slides) #223

usuyama commented Sep 26, 2022

choosehappy commented Sep 28, 2022 via email

ap-- commented Feb 9, 2023

choosehappy commented Feb 14, 2023

Bulk process many slides (e.g. 30k slides) #223

Bulk process many slides (e.g. 30k slides) #223

Comments

usuyama commented Sep 26, 2022

choosehappy commented Sep 28, 2022 via email

ap-- commented Feb 9, 2023

choosehappy commented Feb 14, 2023