Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qlever index for Wikidata doing nothing #139

Open
michaelbrunnbauer opened this issue Feb 20, 2025 · 9 comments
Open

qlever index for Wikidata doing nothing #139

michaelbrunnbauer opened this issue Feb 20, 2025 · 9 comments

Comments

@michaelbrunnbauer
Copy link

As indexing the olympics dataset works and as the Wikidata dump is much bigger than I thought (already taking up 25% of available disk space in compressed form), I suspect hardware requirements are not met.

Indexing just gets stuck from the start, though and I can see no error messages.

-Output (top shows no CPU or io activity whatsoever):

Command: index

echo '{ "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 5000000 }' > wikidata.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.wikidata adfreiburg/qlever -c 'ulimit -Sn 1048576; IndexBuilderMain -i wikidata -s wikidata.settings.json -f <(lbzcat -n 4 latest-all.ttl.bz2) -g - -F ttl -p true -f <(lbzcat -n 1 latest-lexemes.ttl.bz2) -g - -F ttl -f <(cat dcatap.nt) -g - -F nt --stxxl-memory 10G | tee wikidata.index-log.txt'

2025-02-20 13:23:49.889 - INFO: QLever IndexBuilder, compiled on Wed Feb 19 16:11:23 UTC 2025 using git hash caaf76
2025-02-20 13:23:49.890 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2025-02-20 13:23:49.890 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2025-02-20 13:23:49.890 - INFO: You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2025-02-20 13:23:49.890 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2025-02-20 13:23:49.890 - INFO: Processing triples from 3 input streams ...
2025-02-20 13:23:49.891 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...

-Software:

Ubuntu 24.04.2 LTS
Python 3.12.3
qlever 0.5.18

-Hardware:

13th Gen Intel(R) Core(TM) i5-13500 with 64GB RAM and 512GB NVMe

-Steps to reproduce:

apt-get install build-essential python3.12-venv docker.io lbzip2 unzip
adduser qlever
usermod -aG docker qlever
su - qlever
python3 -m venv qlever
cd qlever
./bin/pip install qlever
mkdir wikidata
cd wikidata
../bin/qlever setup-config wikidata
../bin/qlever get-data
../bin/qlever index

@hannahbast
Copy link
Member

@michaelbrunnbauer Can you please send the output of qlever system-info

@michaelbrunnbauer
Copy link
Author

Command: system-info

Show system information and Qleverfile

System Information

Version: 0.5.18 (qlever --version)
OS: Linux (Ubuntu 24.04.2 LTS)
Arch: x86_64
Host: ac
RAM: 62.6 GB total, 61.5 GB available
CPU: 14 Cores, 20 Threads @ 4.28 GHz
CWD: /home/qlever/qlever
Disk space in /dev/md2 @ / is (ext4): 300.80 GB free / 435.53 GB total
User and group on host: uid=1000(qlever) gid=1000(qlever) groups=1000(qlever),100(users),110(docker)
User and group in container: uid=1000(ubuntu) gid=1000(ubuntu) groups=1000(ubuntu)

Contents of Qleverfile

No Qleverfile found

@michaelbrunnbauer
Copy link
Author

I sometimes get this instead of the hanging behaviour. Something very timing-sensitive must be going on?

Command: index

echo '{ "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 5000000 }' > wikidata.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.wikidata adfreiburg/qlever -c 'ulimit -Sn 1048576; IndexBuilderMain -i wikidata -s wikidata.settings.json -f <(lbzcat -n 4 latest-all.ttl.bz2) -g - -F ttl -p true -f <(lbzcat -n 1 latest-lexemes.ttl.bz2) -g - -F ttl -f <(cat dcatap.nt) -g - -F nt --stxxl-memory 10G | tee wikidata.index-log.txt'

2025-02-20 16:29:17.368 - INFO: QLever IndexBuilder, compiled on Wed Feb 19 16:11:23 UTC 2025 using git hash caaf76
2025-02-20 16:29:17.368 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2025-02-20 16:29:17.368 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2025-02-20 16:29:17.368 - INFO: You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2025-02-20 16:29:17.368 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2025-02-20 16:29:17.368 - INFO: Processing triples from 3 input streams ...
2025-02-20 16:29:17.369 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2025-02-20 16:29:18.323 - ERROR: Parse error at byte position 320: @Prefix or @base directives need to be at the beginning of the file when using the parallel parser. Use '--parse-parallel false' if you can't guarantee this. If the reason for this error is that the input is a concatenation of Turtle files, each of which has the prefixes at the beginning, you should feed the files to QLever separately instead of concatenated
The next 500 bytes are:

@Prefix xsd: http://www.w3.org/2001/XMLSchema# .
@Prefix ontolex: http://www.w3.org/ns/lemon/ontolex# .
@Prefix dct: http://purl.org/dc/terms/ .
@Prefix rdfs: http://www.w3.org/2000/01/rdf-schema# .
@Prefix owl: http://www.w3.org/2002/07/owl# .
@Prefix wikibase: http://wikiba.se/ontology# .
@Prefix skos: http://www.w3.org/2004/02/skos/core# .
@Prefix schema: http://schema.org/ .
@Prefix cc: http://creativecommons.org/ns# .
@Prefix geo: http://www.opengis.net/ont/geosparql# .

@joka921
Copy link
Member

joka921 commented Feb 20, 2025

Thank you very much for that information, it seems to be that there is either a bug in the parser or an error in the input files,
which leads to the exception, and the exception sometimes leads to a deadlock instead of being reported.
Can you please paste the beginning of the latest-all.ttl.bz2 e.g. via lbzcat latest-all.ttl.bz2 | head -n 30?
Best regards

@michaelbrunnbauer
Copy link
Author

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ontolex: <http://www.w3.org/ns/lemon/ontolex#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix wikibase: <http://wikiba.se/ontology#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix wd: <http://www.wikidata.org/entity/> .
@prefix data: <https://www.wikidata.org/wiki/Special:EntityData/> .
@prefix s: <http://www.wikidata.org/entity/statement/> .
@prefix ref: <http://www.wikidata.org/reference/> .
@prefix v: <http://www.wikidata.org/value/> .
@prefix wdt: <http://www.wikidata.org/prop/direct/> .
@prefix wdtn: <http://www.wikidata.org/prop/direct-normalized/> .
@prefix p: <http://www.wikidata.org/prop/> .
@prefix ps: <http://www.wikidata.org/prop/statement/> .
@prefix psv: <http://www.wikidata.org/prop/statement/value/> .
@prefix psn: <http://www.wikidata.org/prop/statement/value-normalized/> .
@prefix pq: <http://www.wikidata.org/prop/qualifier/> .
@prefix pqv: <http://www.wikidata.org/prop/qualifier/value/> .
@prefix pqn: <http://www.wikidata.org/prop/qualifier/value-normalized/> .
@prefix pr: <http://www.wikidata.org/prop/reference/> .
@prefix prv: <http://www.wikidata.org/prop/reference/value/> .
@prefix prn: <http://www.wikidata.org/prop/reference/value-normalized/> .
@prefix wdno: <http://www.wikidata.org/prop/novalue/> .

@hannahbast
Copy link
Member

@michaelbrunnbauer I just tried this myself and had the same problem.

@RobinTF I investigated and found that ad-freiburg/qlever#1807 broke the index build. Can you please have a look?

@michaelbrunnbauer While we are fixing this, you can just use one of the Docker image from two days ago or earlier. For example, adfreiburg/qlever:pr-1816 should work (edit the Qleverfile or call qlever index with the option --image adfreiburg/qlever:pr-1816.

@RobinTF
Copy link

RobinTF commented Feb 21, 2025

@joka921 You adjusted the code in ad-freiburg/qlever#1807 to ensure the first error is always the one getting reported. Is there a chance this has something to do with this? A deadlock maybe? I just had a second look and I didn't see anything out of the ordinary, but the added parallelParser_.waitUntilFinished(); looks suspicious, but it works in the unit tests.

hannahbast added a commit to ad-freiburg/qlever that referenced this issue Feb 21, 2025
@michaelbrunnbauer
Copy link
Author

michaelbrunnbauer commented Feb 21, 2025

I can confirm that with the option --image adfreiburg/qlever:pr-1816, the indexing actually starts.

Should I even try to let it finish with 512GB disk space?

@hannahbast
Copy link
Member

@michaelbrunnbauer The total size of the index file for Wikidata will be around 430 GB in the end, which is very compact. During the index building, you will need more than that, so I doubt that 512 GB of disk space will be sufficient.

How about buying a larger disk? For example, a 2 TB NVMe SSD is really cheap these days, and even 4 TB or 8 TB are pretty affordable. 512 GB is really little.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants