Dataset Generation

The warc-hadoop-indexer project also contains metadata extraction (MDX) tools that can be applied as follows.

Step 1 - create metadata extractions for every resource

Breaking the whole 1996-2013 collection into 6 chunks (at least 86,000 ARCs or WARCs per chunk), we then run the WARCMDXGenerator to create a set of MDX metadata objects stored in a manner that allows further processing.

It works by running the same indexer over the resources as we use in the Solr-indexing workflow, but rather than sending the data to Solr it is converted to a simple map of properties and stored in map-reduce-friendly sequence files. This provides a means for re-duplication to be carrier out, as the sequence files are sorted by the hash of the entity body.

Invocation looked like this:

hadoop jar warc-hadoop-indexer/target/warc-hadoop-indexer-2.2.0-BETA-6-SNAPSHOT-job.jar uk.bl.wa.hadoop.mapreduce.mdx.WARCMDXGenerator -i mdx_arcs_b -o mdx-arcs-b

where mdx_arcs_b is a list of the ARC and WARC files to process. The sequence files are then deposited in mdx-arcs-b.

chunk	time	records	NULLs	Errors	HDFS bytes read
a	33hrs, 57mins, 1sec	538,761,419	116,127,700	0	5,661,783,544,289 (5.66 TB)
b
c
d	28hrs, 56mins, 11sec	524,143,344	89,077,559	8	5,918,383,825,363
e	35hrs, 26mins, 40sec	501,582,602	101,396,811	1	6,505,417,693,908
f	72hrs, 19mins, 35sec	1,353,142,719	332,045,791	14	29,129,360,605,968

Step 2 - merge the entire set of MDX sequence files

Merging all into one set, sorted by hash, and re-duplicating revisit records.

Note that when merging just chunks e and f (which are the only ones containing WARCs), there were 162,103,566 revisits out of 1,854,725,336 records, but 2,482,073 were unresolved!

Step 3 - run statistic and sample generators over the merged MDX sequence files

At this point, we can run other profilers and samplers over the merged, reduplicated MDX files and generate a range of datasets for researchers.

CC-BY

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Generation

Step 1 - create metadata extractions for every resource

Step 2 - merge the entire set of MDX sequence files

Step 3 - run statistic and sample generators over the merged MDX sequence files

Clone this wiki locally