You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My gut feeling is that most of the checks will need a combination of SPARQL and Python logic to implement (efficiently at least, or at all).
I'd originally imagined the EDAM verification would be a series of QC steps - i.e. a set of invoked scripts or queries - which would return 0 (no error) or 1, 2 or 3 (INFO, WARN or ERROR) with ERROR causing a build fail. But now I'm not sure ...
Just loading EDAM.owl into a graph (without any subsequent processing) takes under a minute on my (very fast) workstation. But if that's is multiplied by 30 (now, in future more) then this doesn't scale so well. But is this OK ?
If it's not OK, we might instead need a single overarching script (denoted as src/edamverify.pyhere which returns 0-3 (as above) and which invokes the individual QC checks - collating their return values & validation outputs into a single error file. That would allow a single load function - but implies either (or perhaps a combination of):
a monolithic Juypter notebook
a modular python structure / library
My other gut feeling is that (especially for queries that aren't amenable to SPARQL) we'll need convenience functions which can be reused by multiple checks. Which then leads us into the territory of writing an EDAM Python library - which is something I've been mulling for a while and could be extremely useful in it's own right.
What do you think? I will continue experimenting to see what can be SPARQLed but before investing very heavily in time, I want us to discuss and agree a sensible architecture that is efficient in the long term.
@hmenager @albangaignard
My gut feeling is that most of the checks will need a combination of SPARQL and Python logic to implement (efficiently at least, or at all).
I'd originally imagined the EDAM verification would be a series of QC steps - i.e. a set of invoked scripts or queries - which would return 0 (no error) or 1, 2 or 3 (INFO, WARN or ERROR) with ERROR causing a build fail. But now I'm not sure ...
Just loading EDAM.owl into a graph (without any subsequent processing) takes under a minute on my (very fast) workstation. But if that's is multiplied by 30 (now, in future more) then this doesn't scale so well. But is this OK ?
If it's not OK, we might instead need a single overarching script (denoted as
src/edamverify.py
here which returns 0-3 (as above) and which invokes the individual QC checks - collating their return values & validation outputs into a single error file. That would allow a single load function - but implies either (or perhaps a combination of):My other gut feeling is that (especially for queries that aren't amenable to SPARQL) we'll need convenience functions which can be reused by multiple checks. Which then leads us into the territory of writing an EDAM Python library - which is something I've been mulling for a while and could be extremely useful in it's own right.
What do you think? I will continue experimenting to see what can be SPARQLed but before investing very heavily in time, I want us to discuss and agree a sensible architecture that is efficient in the long term.
opinions please @hmenager @albangaignard @hansioan @matuskalas @veitveit
Cheers!
The text was updated successfully, but these errors were encountered: