-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiment with supporting eXist-db #1570
Comments
Find paragraphs containing the word "paphlagonian" marked as a noun: xquery version "3.1";
for $par in collection("/db/books")//p[.//noun[starts-with(lower-case(.), "paphlagonian")]]
return <result doc="{util:document-name(root($par))}">{$par}</result> But this returns two |
I had a similar query: <regions>
<h1>{ count(collection("...?select=Med_*.xml")//region) } regions defined </h1>
<ul> {
for $med in collection("...?select=Med_*.xml")/meditatio
order by number($med/@id)
return <li> Meditatio {data($med/@id)} <ul>
{
for $reg in $med//region
return <li id="{data($reg/@id)}"> {data($reg/@name)} </li>
}
</ul></li>
}</ul>
</regions> I just iterate over the documents' root elements (one element per document). However, per root element there could still be several occurrences of the same element. |
What about group by? https://stackoverflow.com/questions/14030255/xquery-group-by-and-count |
Thanks, group by does it! xquery version "3.1";
for $par in collection("/db/books")//p[.//noun[starts-with(lower-case(.), "paphlagonian")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> |
Search for xquery version "3.1";
for $par in collection("/db/books")//p[.//adj[starts-with(lower-case(.), "paphlagonian")] and .//noun[starts-with(lower-case(.), "soul")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> |
Uploading books to eXist-db:
|
With all the books uploaded, the query in #1570 (comment) takes 8 seconds. Knora did it it in 1 second, using Lucene to optimise the query. I'm going to see if I can do a similar optimisation in eXist-db. |
It looks like eXist-db can use Lucene to optimise the query, but there's a limitation: you have to configure the Lucene index (in a configuration file), specifying the names of the XML elements whose content you want to index:
http://exist-db.org/exist/apps/doc/lucene.xml So for example, I could create an index on all |
Or I guess I could just make an index on the root XML element, which is in effect what Knora does. |
Ah, wait a minute, these configuration "files" are actually XML documents. So it wouldn't be a problem to create one per project. |
With Lucene indexes on
|
Searches optimised with project-specific Lucene indexesSearch for for $par in collection("/db/books")//p[.//adj[ft:query(., "paphlagonian")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> Search for for $par in collection("/db/books")//p[.//noun[ft:query(., "soul")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> Search for for $par in collection("/db/books")//p[.//adj[ft:query(., "paphlagonian")] and .//noun[ft:query(., "soul")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> Search for for $par in collection("/db/books")//p[.//adj[ft:query(., "full")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> |
So is it more efficient than Knora? |
Search for for $par in collection("/db/books")//p[.//adj[ft:query(., "full")] and ..//noun[ft:query(., "Euchenor")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result>
The
|
eXist then eventually restarts. Aha, now I see that there is a bug in my query ( |
Search for full and Euchenor in the same paragraph (fixed query): 1 result in 1.6 seconds: for $par in collection("/db/books")//p[.//adj[ft:query(., "full")] and .//noun[ft:query(., "Euchenor")]]
group by $doc := util:document-name(root($par))
return <result doc="{$doc}">{$par}</result> This is the query that was very slow in Knora here: dhlab-basel/knora-large-texts#2 (comment) |
Yes, in this use case:
To make this efficient in eXist, you have to configure a Lucene index specifically for each tag that you want to search for. Even so, it's not lightning-fast: a typical query takes 1-2 seconds. But this seems OK to me. Probably if I split up each book into many small fragments (a few pages each) and stored the texts in the triplestore that way, Gravsearch would be able to do these kinds of queries more efficiently. In practice, I think there are other good reasons for doing that (e.g. it makes it easier to display and edit the texts). |
If I split each book into small fragments (e.g. 1000 words each), I can make a structure where each |
Also, it takes a lot longer to import all this data into Knora than to import it into eXists. I'm importing the same books as in #1570 (comment) into Knora (fragmented), and it looks like it's going to take about 3 hours. |
To make the import faster, I added a config option to prevent Knora from verifying the data after it's written. |
I did some tests with the books split into 1000-word fragments, and it doesn't make |
Would the Lucene connector allow for more flexibility? |
Looking at that now. |
It looks like the GraphDB Lucene connector can't do this. It just indexes whatever strings you give it, but in this case we would want to index substrings at specific positions, and it doesn't seem to have that feature. The only way I can see to do this would be to store the substrings themselves in the standoff tags. But this would mean duplicating a lot of text in the triplestore, and would make importing data even slower. |
A drawback of XQuery is that I don't see any way to search within overlapping hierarchies. For example, given this document from one of our tests: <?xml version="1.0" encoding="UTF-8"?>
<lg xmlns="http://www.example.org/ns1" xmlns:ns2="http://www.example.org/ns2">
<l>
<seg foo="x" ns2:bar="y">Scorn not the sonnet;</seg>
<ns2:s sID="s02"/>critic, you have frowned,</l>
<l>Mindless of its just honours;<ns2:s eID="s02"/>
<ns2:s sID="s03"/>with this key</l>
<l>Shakespeare unlocked his heart;<ns2:s eID="s03"/>
<ns2:s sID="s04"/>the melody</l>
<l>Of this small lute gave ease to Petrarch's wound.<ns2:s eID="s04"/>
</l>
</lg> I can't figure out any way to find the |
I think I found an eXist-db function for this: |
That function seems to be buggy: |
More implementations here: |
In any case, I would expect this to be very slow, because you can't make a Lucene index for the content between two milestones, only for the content of an ordinary XML element. |
Full-text search in different open-source XML databases:
This means that to support multiple XML databases, we would need to develop something like Gravsearch for XQuery, and generate the appropriate syntax for each database. There is also an XQuery and XPath Full Text 3.0 spec, but I haven't found any implementations. |
MarkLogic is a commercial database server that supports both RDF and XML. You can mix SPARQL and XQuery in a single query: https://docs.marklogic.com/guide/semantics/semantic-searches#id_77935 |
@benjamingeer there is also a RDF+SPARQL plugin for eXist-db that integrates Apache Jena - https://github.com/ljo/exist-sparql You might also be interested in FusionDB as an alternative to eXist-db; It is 100% API compatible with eXist-db. |
Very interesting!
Thanks!
Lukas Rosenthaler
Von meinem iPad gesendet
Am 02.03.2020 um 12:45 schrieb Adam Retter <[email protected]>:
@benjamingeer<https://github.com/benjamingeer> there is also a RDF+SPARQL plugin for eXist-db that integrates Apache Jena - https://github.com/ljo/exist-sparql
You might also be interested in FusionDB<https://www.fusiondb.com> as an alternative to eXist-db; It is 100% API compatible with eXist-db.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#1570>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABJX3TGRJ5S6OXLZJEHNMCDRFOS5VANCNFSM4KDB76XQ>.
|
No description provided.
The text was updated successfully, but these errors were encountered: