Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3026: spatial index per graph and kryo serialization #3027

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Aklakan
Copy link
Contributor

@Aklakan Aklakan commented Feb 21, 2025

GitHub issue resolved #3026

Pull request Description: Adds support for spatial index per graph and kryo serialization.

  • It will take some time to clean this PR and disentangle it with some unrelated changes.

  • Tests are included.
  • Documentation change and updates are provided for the Apache Jena website
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@Aklakan Aklakan marked this pull request as draft February 21, 2025 20:46
@afs
Copy link
Member

afs commented Feb 26, 2025

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 3 times, most recently from 0829aa5 to 01bad5d Compare February 26, 2025 12:51
@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 26, 2025

It should build now. Its about 90% done. One issue is, that @LorenzBuehmann used ShapeSerde from sedona-spark which actually only does JTS geometry de-/serialization. Probably it is faster than other approaches (otherwise it wouldn't exist) - but it causes enforcer conflicts and draws in quite a lot of dependencies for what it does.

Right now I switched to JTS native WKB serialization which makes the build work normally (in principle, the geometry-serializer classname could be made part of a spatial index metadata header - such a header does not exist yet).
I am waiting for input from Lorenz about the performance differences between the serializers.

I also added a little compatibility layer for the old index IO which used java serialization. So old index files should continue to load. Saving indexes will use the new serializer.

@afs
Copy link
Member

afs commented Feb 26, 2025

dependencies

The dependencies will need to be checked - kryo is 3-clause BSD so that has implications.

(I haven't looked at the new dependencies, just kyro.)

Apache Sedona has some entries in its LICENSE
https://github.com/apache/sedona/blob/ebd6f67c5029554055db6b81ffba5d3ecaad3b62/LICENSE

enforcer conflicts

These are a nuisance because they are not guaranteed to be maintained by dependency updates. 😞

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 26, 2025

About performance differences, @SimonBin found this article: https://kontinuation.one/posts/2022/12/improving-geometry-serde-performance-of-apache-sedona/

image

This suggests that Sedona's ShapeSerde is generally a lot faster. Not sure if the current implementation is perhaps already even one of the optimized variants.

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 26, 2025

Some notes about the remaining issues I am working on to resolve.

I am a bit worried that a hard dependency on sedona-spark may cause too many issues, so I am trying to decouple the geometry serde (serializer/deserializer) from the rest of the spatial-index serde.

For this, I added a plain text JSON header field to the spatial index format with a field for the geometry serde. This field can be read and used to configure the remaining kryo serde. The index reader makes a lookup with Class.forName() and tries to create an instance via the default constructor.

With this approach it would be possible to use the slower JTS-based geometry serialization for now and have a compatible upgrade path to a faster approach at a later point.

Perhaps the geometrySerde header field is not needed, because I see that in Sedona the implementation is serializable and apparently their approach is to just read/write the geometry serde instance out using plain Java serialization during the kryo serialization process. So I need to check what is the best way, but something along these lines should work to support the same spatial-index serde with different geometry serdes in a compatible way.

This snippet is an example for how the header could be set up:

// SpatialIndexIoKryo.java

    public static void writeToOutputStream(OutputStream os, SpatialIndexPerGraph index) {
        // geometrySerde could later be switched to sedona's ShapeSerde.
        // (Need to check whether I can readily reuse an interface
        // from JTS/Sedona instead of my own GeometrySerdeAdapter)
        GeometrySerdeAdapter geometrySerde = new GeometrySerdeAdapterJtsWkb(); 

        JsonObject header = new JsonObject();
        header.addProperty("type", "jena-spatial-index");
        header.addProperty("version", "2.0.0");
        header.addProperty("geometrySerde", geometrySerde.getClass().getName());

        Kryo kryo = new Kryo();
        JtsKryoRegistrator.registerClasses(kryo, geometrySerde);
        try (Output output = new Output(os)) {
            writeJson(output, header);
            writeIndex(kryo, output, index);
            output.flush();
        }

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 2 times, most recently from b0a9b8b to ae4ba13 Compare February 27, 2025 21:00
Comment on lines 31 to 32
/* This file is an adapted copy of org.locationtech.jts.index.strtree.IndexSerde. */
/* The change is, that geometry read/write is delegated to a GeometrySerdeAdapter instance. */
Copy link
Contributor Author

@Aklakan Aklakan Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is copy&adapt from sedona-common 1.7.0 (latest)- they actually used this package hack to access non-visible stuff. in the long run, jts should fix this design.

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 27, 2025

So I had to copy and adapt a total of 3 1 classes from sedona-common:

They all have an Apache V2 header - so it should be fine?
I also excluded nearly all dependencies from sedona-commons and the tests work.

@Aklakan
Copy link
Contributor Author

Aklakan commented Feb 28, 2025

The spatial index does not store any geometries - just their envelopes and feature nodes - so geometry serialization performance is not an issue in the first place 🤦‍♂️ - but good having talked about it 😀

  • I managed to remove sedona as a dependency alltogether - only sedona's copy for the STRtree serializer (from JTS) is needed.
  • Kryo is needed.
  • I added tests for the corner case where the same feature exists with different geometries in different graphs.
    For the original spatial index implementation, some results are incorrect - for the new one they are correct.
  • The kryo serializer for RDF1.2 Nodes is now also tested.

@Aklakan Aklakan force-pushed the coypu-5.4.0 branch 2 times, most recently from 52317c9 to fcae47a Compare March 2, 2025 02:08
@Aklakan
Copy link
Contributor Author

Aklakan commented Mar 2, 2025

I now also added the benchmark that evaluates loading a spatial index from disk - so w.r.t. index loading during server startup, the kryo version is (still) significantly faster.

Benchmark                   (param0_jenaVersion)  (param1_geometryMixes)  Mode  Cnt  Score   Error  Units
BenchmarkSpatialIndex.load               current                    1000  avgt    5  0.002 ± 0.001   s/op
BenchmarkSpatialIndex.load               current                   10000  avgt    5  0.006 ± 0.001   s/op
BenchmarkSpatialIndex.load               current                  100000  avgt    5  0.064 ± 0.004   s/op

BenchmarkSpatialIndex.load                 5.1.0                    1000  avgt    5  0.062 ± 0.009   s/op
BenchmarkSpatialIndex.load                 5.1.0                   10000  avgt    5  0.565 ± 0.471   s/op
BenchmarkSpatialIndex.load                 5.1.0                  100000  avgt    5  7.439 ± 0.442   s/op

A geometry mix comprises 8 different geometry types (point, linestring, linearring, polygon, multipoint, multilinestring, multipolygon, geometrycollection). So the actual number of envelopes in the spatial index is 8 x geometryMixes.

@Aklakan
Copy link
Contributor Author

Aklakan commented Mar 3, 2025

@afs What would be the best way to add a Fuseki module for re-building the index for selected graphs? In an ideal world, the spatial index would be synchronized with the dataset, but for the time being we created a little Fuseki-Mod, where one can select the graphs for which to rebuild the index. Back then we forked your afs/fuseki-mods git repo - but this no longer exists. At first glance, it seems they were moved to jena-fuseki2/jena-fuseki-main/src/main/java/org/apache/jena/fuseki/mod/*. I suppose the mod could be added to jena-geosparql, but I think the geosparql support should be just a fuseki mod (IIRC jena-geosparql is not a mod) - perhaps with a bundle release for convenience, but not as a fork - so perhaps there is a better way to handle such a mod.

This is what the spatial index rebuilder looks like. The html is currently a single self-contained file (ai generated) - so no extra frontend build is needed.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spatial Index improvements (index-per-graph + kryo)
3 participants