-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClueWeb22 #210
Comments
Excellent, thanks @heinrichreimer! A while back I requested that they include offset files to facilitate random lookups, and it looks like it made it into the final spec! This will make adding the datasets much easier, since we won't need to save zlib states and release our own checkpoint files. |
Sean is correct. Each warc.gz file is compressed by record and has a
companion offset file. To get the HTML of a specific document, open the
appropriate .warc.offset file (can be determined from the ClueWeb
docid), fseek to find the byte offsets of the start/end of the document
(also determined from the ClueWeb docid), open the .warc.gz file, fseek
to the start of the document, read the bytes, and uncompress them.
We can provide data samples if you need them.
We are trying to apply this or a similar architecture to all other types
of data in the dataset, so that everything can be accessed quickly given
a ClueWeb docid.
Best,
Jamie
…On 10/5/2022 7:09 AM, Sean MacAvaney wrote:
Excellent, thanks @heinrichreimer <https://github.com/heinrichreimer>!
A while back I requested that they include offset files to facilitate
random lookups, and it looks like it made it into the final spec! This
will make adding the datasets much easier, since we won't need to save
zlib states and release our own checkpoint files.
—
Reply to this email directly, view it on GitHub
<#210 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABRMTZKVX77ZWOQ2LFPDT6TWBVOWHANCNFSM6AAAAAAQ5O3JSY>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for explaining the file structure! |
We're still in the process of requesting the data here. A sample would indeed be helpful for getting started. @heinrichreimer -- I've got a pretty busy couple of weeks coming up, would you be able to take a stab at the implementation? |
Sure, I'll try my best. I guess most of the code can be "recycled" from ClueWeb12 anyway. |
Awesome, thanks! The most challenging bit is doing lookups, but with the offset file that's included, this should be much easier. Feel free to reach out if you have problems/questions/etc. Thanks! |
As ClueWeb22 also features language tags and is structured in a way to efficiently filter by language, I'll also include subsets like this:
|
Great, thanks. This is aligned with |
As the categories are subsets of the larger ones, I've now also added "views" that can, for example, be used to just parse the plain text from the B category. The keys would be |
Dataset Information:
ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research on information retrieval, natural language processing and related human language technologies. This new dataset is being developed by the Lemur Project with significant assistance and support from Microsoft Corporation.
The ClueWeb22 dataset has several novel characteristics compared with earlier ClueWeb datasets.
Authors: Arnold Overwijk, Chenyan Xiong (@xiongchenyan), Jamie Callan (@jamiecallan), Cameron VandenBerg, Xiao Lucy Liu
Links to Resources:
Dataset ID(s) & supported entities:
clueweb22/a
: 200M docs, queries, qrels, scoreddocs?clueweb22/b
: 2B docs, queries?, qrels?, scoreddocs?clueweb22/l
: 10B docs, queries?, qrels?, scoreddocs?Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/clueweb22.py
)tests/integration/clueweb22.py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/clueweb22.yaml
)ir_datasets/etc/downloads.json
) Manual download requirded.Download verification action (in.github/workflows/verify_downloads.yml
). Only one needed pertopid
.Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected indownloads.json
.Additional comments/concerns/ideas/etc.
The dataset is planned to be used for shared tasks in the near future.
I also personally think it is of very high value to have this in ir_datasets.
Open Questions
VDOM-Paragraph
the same asVDOM-Passage
in the WARC headers??
in the inlink format anchor type description?The text was updated successfully, but these errors were encountered: