-
Notifications
You must be signed in to change notification settings - Fork 25
Dataset Generation
The warc-hadoop-indexer project also contains metadata extraction (MDX) tools that can be applied as follows.
Breaking the whole 1996-2013 collection into 6 chunks (at least 86,000 ARCs or WARCs per chunk), we then run the WARCMDXGenerator to create a set of MDX metadata objects stored in a manner that allows further processing.
It works by running the same indexer over the resources as we use in the Solr-indexing workflow, but rather than sending the data to Solr it is converted to a simple map of properties and stored in map-reduce-friendly sequence files. This provides a means for re-duplication to be carrier out, as the sequence files are sorted by the hash of the entity body.
Invocation looked like this:
hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.WARCMDXGenerator -i mdx_arcs_b -o mdx-arcs-b
where mdx_arcs_b
is a list of the ARC and WARC files to process. The sequence files are then deposited in mdx-arcs-b
.
The time taken to process the six chunks is indicated here (although note that other tasks were running on the cluster at various times, hence the variability).
chunk | time | records | NULLs | Errors | HDFS bytes read | MDXSeq size |
---|---|---|---|---|---|---|
a | 33hrs, 57mins, 1sec | 538,761,419 | 116,127,700 | 0 | 5,661,783,544,289 (5.66 TB) | 186,053,418,102 |
b | 19hrs, 23mins, 6sec | 475,516,515 | 77,813,176 | 0 | 6,279,242,837,014 | 158,101,113,403 |
c | 18hrs, 56mins, 11sec | 539,713,722 | 93,696,252 | 0 | 5,802,813,832,422 | 180,454,498,713 |
d | 28hrs, 56mins, 11sec | 524,143,344 | 89,077,559 | 8 | 5,918,383,825,363 | 177,392,768,820 |
e | 35hrs, 26mins, 40sec | 501,582,602 | 101,396,811 | 1 | 6,505,417,693,908 | 180,565,740,110 |
f | 72hrs, 19mins, 35sec | 1,353,142,719 | 332,045,791 | 14 | 29,129,360,605,968 | 462,439,132,652 |
So, the output is about 2-3% of the input.
Merging all into one set, sorted by hash, and re-duplicating revisit records.
hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.MDXSeqMerger -i mdx-arcs-ef-parts -o mdx-arcs-ef -r 50
Note that when merging just chunks e and f (which are the only ones containing WARCs), there were 162,103,566 revisits out of 1,854,725,336 records, but 2,482,073 were unresolved!
for everything
hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.MDXSeqMerger -i mdx-a-f -o mdx-merged -r 50
11hrs, 50mins, 39sec
NUM_RESOLVED_REVISITS 0 159,621,493 159,621,493 NUM_REVISITS 0 162,103,566 162,103,566 NUM_RECORDS 0 3,932,860,344 3,932,860,344 NUM_UNRESOLVED_REVISITS 0 2,482,073 2,482,073
FILE_BYTES_READ 1,379,467,889,493 2,940,283,760,167 4,319,751,649,660 HDFS_BYTES_READ 1,348,985,804,233 0 1,348,985,804,233 FILE_BYTES_WRITTEN 2,717,555,784,717 2,940,286,352,547 5,657,842,137,264 HDFS_BYTES_WRITTEN 0 1,354,124,391,916 1,354,124,391,916
At this point, we can run other profilers and samplers over the merged, reduplicated MDX files and generate a range of datasets for researchers.