qlever index for Wikidata doing nothing #139

michaelbrunnbauer · 2025-02-20T13:24:03Z

As indexing the olympics dataset works and as the Wikidata dump is much bigger than I thought (already taking up 25% of available disk space in compressed form), I suspect hardware requirements are not met.

Indexing just gets stuck from the start, though and I can see no error messages.

-Output (top shows no CPU or io activity whatsoever):

Command: index

echo '{ "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 5000000 }' > wikidata.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.wikidata adfreiburg/qlever -c 'ulimit -Sn 1048576; IndexBuilderMain -i wikidata -s wikidata.settings.json -f <(lbzcat -n 4 latest-all.ttl.bz2) -g - -F ttl -p true -f <(lbzcat -n 1 latest-lexemes.ttl.bz2) -g - -F ttl -f <(cat dcatap.nt) -g - -F nt --stxxl-memory 10G | tee wikidata.index-log.txt'

2025-02-20 13:23:49.889 - INFO: QLever IndexBuilder, compiled on Wed Feb 19 16:11:23 UTC 2025 using git hash caaf76
2025-02-20 13:23:49.890 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2025-02-20 13:23:49.890 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2025-02-20 13:23:49.890 - INFO: You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2025-02-20 13:23:49.890 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2025-02-20 13:23:49.890 - INFO: Processing triples from 3 input streams ...
2025-02-20 13:23:49.891 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...

-Software:

Ubuntu 24.04.2 LTS
Python 3.12.3
qlever 0.5.18

-Hardware:

13th Gen Intel(R) Core(TM) i5-13500 with 64GB RAM and 512GB NVMe

-Steps to reproduce:

apt-get install build-essential python3.12-venv docker.io lbzip2 unzip
adduser qlever
usermod -aG docker qlever
su - qlever
python3 -m venv qlever
cd qlever
./bin/pip install qlever
mkdir wikidata
cd wikidata
../bin/qlever setup-config wikidata
../bin/qlever get-data
../bin/qlever index

hannahbast · 2025-02-20T13:57:14Z

@michaelbrunnbauer Can you please send the output of qlever system-info

michaelbrunnbauer · 2025-02-20T14:01:47Z

Command: system-info

Show system information and Qleverfile

System Information

Version: 0.5.18 (qlever --version)
OS: Linux (Ubuntu 24.04.2 LTS)
Arch: x86_64
Host: ac
RAM: 62.6 GB total, 61.5 GB available
CPU: 14 Cores, 20 Threads @ 4.28 GHz
CWD: /home/qlever/qlever
Disk space in /dev/md2 @ / is (ext4): 300.80 GB free / 435.53 GB total
User and group on host: uid=1000(qlever) gid=1000(qlever) groups=1000(qlever),100(users),110(docker)
User and group in container: uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu)

Contents of Qleverfile

No Qleverfile found

michaelbrunnbauer · 2025-02-20T15:33:28Z

I sometimes get this instead of the hanging behaviour. Something very timing-sensitive must be going on?

Command: index

echo '{ "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 5000000 }' > wikidata.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.wikidata adfreiburg/qlever -c 'ulimit -Sn 1048576; IndexBuilderMain -i wikidata -s wikidata.settings.json -f <(lbzcat -n 4 latest-all.ttl.bz2) -g - -F ttl -p true -f <(lbzcat -n 1 latest-lexemes.ttl.bz2) -g - -F ttl -f <(cat dcatap.nt) -g - -F nt --stxxl-memory 10G | tee wikidata.index-log.txt'

2025-02-20 16:29:17.368 - INFO: QLever IndexBuilder, compiled on Wed Feb 19 16:11:23 UTC 2025 using git hash caaf76
2025-02-20 16:29:17.368 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2025-02-20 16:29:17.368 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2025-02-20 16:29:17.368 - INFO: You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2025-02-20 16:29:17.368 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2025-02-20 16:29:17.368 - INFO: Processing triples from 3 input streams ...
2025-02-20 16:29:17.369 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2025-02-20 16:29:18.323 - ERROR: Parse error at byte position 320: @Prefix or @base directives need to be at the beginning of the file when using the parallel parser. Use '--parse-parallel false' if you can't guarantee this. If the reason for this error is that the input is a concatenation of Turtle files, each of which has the prefixes at the beginning, you should feed the files to QLever separately instead of concatenated
The next 500 bytes are:

@Prefix xsd: http://www.w3.org/2001/XMLSchema# .
@Prefix ontolex: http://www.w3.org/ns/lemon/ontolex# .
@Prefix dct: http://purl.org/dc/terms/ .
@Prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
@Prefix owl: http://www.w3.org/2002/07/owl# .
@Prefix wikibase: http://wikiba.se/ontology# .
@Prefix skos: http://www.w3.org/2004/02/skos/core# .
@Prefix schema: http://schema.org/ .
@Prefix cc: http://creativecommons.org/ns# .
@Prefix geo: http://www.opengis.net/ont/geosparql# .

joka921 · 2025-02-20T18:20:43Z

Thank you very much for that information, it seems to be that there is either a bug in the parser or an error in the input files,
which leads to the exception, and the exception sometimes leads to a deadlock instead of being reported.
Can you please paste the beginning of the latest-all.ttl.bz2 e.g. via lbzcat latest-all.ttl.bz2 | head -n 30?
Best regards

michaelbrunnbauer · 2025-02-20T18:24:45Z

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wikibase: <http://wikiba.se/ontology#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix data: <https://www.wikidata.org/wiki/Special:EntityData/> .
@prefix s: <http://www.wikidata.org/entity/statement/> .
@prefix ref: <http://www.wikidata.org/reference/> .
@prefix v: <http://www.wikidata.org/value/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .
@prefix p: <http://www.wikidata.org/prop/> .
@prefix ps: <http://www.wikidata.org/prop/statement/> .
@prefix psv: <http://www.wikidata.org/prop/statement/value/> .
@prefix psn: <http://www.wikidata.org/prop/statement/value-normalized/> .
@prefix pq: <http://www.wikidata.org/prop/qualifier/> .
@prefix pqv: <http://www.wikidata.org/prop/qualifier/value/> .
@prefix pqn: <http://www.wikidata.org/prop/qualifier/value-normalized/> .
@prefix pr: <http://www.wikidata.org/prop/reference/> .
@prefix prv: <http://www.wikidata.org/prop/reference/value/> .
@prefix prn: <http://www.wikidata.org/prop/reference/value-normalized/> .
@prefix wdno: <http://www.wikidata.org/prop/novalue/> .

hannahbast · 2025-02-20T23:52:16Z

@michaelbrunnbauer I just tried this myself and had the same problem.

@RobinTF I investigated and found that ad-freiburg/qlever#1807 broke the index build. Can you please have a look?

@michaelbrunnbauer While we are fixing this, you can just use one of the Docker image from two days ago or earlier. For example, adfreiburg/qlever:pr-1816 should work (edit the Qleverfile or call qlever index with the option --image adfreiburg/qlever:pr-1816.

This reverts commit 8678731, which breaks the index build, see ad-freiburg/qlever-control#139

RobinTF · 2025-02-21T00:17:25Z

@joka921 You adjusted the code in ad-freiburg/qlever#1807 to ensure the first error is always the one getting reported. Is there a chance this has something to do with this? A deadlock maybe? I just had a second look and I didn't see anything out of the ordinary, but the added parallelParser_.waitUntilFinished(); looks suspicious, but it works in the unit tests.

…1827) This reverts commit 8678731, which breaks the index build, see ad-freiburg/qlever-control#139

michaelbrunnbauer · 2025-02-21T13:26:08Z

I can confirm that with the option --image adfreiburg/qlever:pr-1816, the indexing actually starts.

Should I even try to let it finish with 512GB disk space?

hannahbast · 2025-02-22T12:17:51Z

@michaelbrunnbauer The total size of the index file for Wikidata will be around 430 GB in the end, which is very compact. During the index building, you will need more than that, so I doubt that 512 GB of disk space will be sufficient.

How about buying a larger disk? For example, a 2 TB NVMe SSD is really cheap these days, and even 4 TB or 8 TB are pretty affordable. 512 GB is really little.

hannahbast pushed a commit to ad-freiburg/qlever that referenced this issue Feb 20, 2025

Revert "Better error message on parallel turtle parsing ... (#1807)"

42fcbb7

This reverts commit 8678731, which breaks the index build, see ad-freiburg/qlever-control#139

hannahbast mentioned this issue Feb 21, 2025

Revert "Better error message on parallel turtle parsing ... (#1807)" ad-freiburg/qlever#1827

Merged

hannahbast added a commit to ad-freiburg/qlever that referenced this issue Feb 21, 2025

Revert "Better error message on parallel turtle parsing ... (#1807)" (#…

36cdbd4

…1827) This reverts commit 8678731, which breaks the index build, see ad-freiburg/qlever-control#139

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qlever index for Wikidata doing nothing #139

qlever index for Wikidata doing nothing #139

michaelbrunnbauer commented Feb 20, 2025

hannahbast commented Feb 20, 2025

michaelbrunnbauer commented Feb 20, 2025

michaelbrunnbauer commented Feb 20, 2025

joka921 commented Feb 20, 2025

michaelbrunnbauer commented Feb 20, 2025

hannahbast commented Feb 20, 2025

RobinTF commented Feb 21, 2025

michaelbrunnbauer commented Feb 21, 2025 •

edited

Loading

hannahbast commented Feb 22, 2025

qlever index for Wikidata doing nothing #139

qlever index for Wikidata doing nothing #139

Comments

michaelbrunnbauer commented Feb 20, 2025

hannahbast commented Feb 20, 2025

michaelbrunnbauer commented Feb 20, 2025

System Information

Contents of Qleverfile

michaelbrunnbauer commented Feb 20, 2025

joka921 commented Feb 20, 2025

michaelbrunnbauer commented Feb 20, 2025

hannahbast commented Feb 20, 2025

RobinTF commented Feb 21, 2025

michaelbrunnbauer commented Feb 21, 2025 • edited Loading

hannahbast commented Feb 22, 2025

michaelbrunnbauer commented Feb 21, 2025 •

edited

Loading