Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the JSONL output #307

Open
12 tasks
anjackson opened this issue Mar 27, 2023 · 0 comments
Open
12 tasks

Improve the JSONL output #307

anjackson opened this issue Mar 27, 2023 · 0 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Mar 27, 2023

Following #299

  • Name fields so the common ones are consistent with the CDXJ specification, i.e. make this an extension of CDXJ.
  • Document usage, noting that dynamic content can't be easily extracted because it's dynamic.
  • Store content_first_bytes without spaces? Store content_ffb as well or instead of content_first_bytes?
  • Consider allowing payload inclusion if small, e.g. smaller HTML files or initial binary chunk.
  • Consider extending the API so consumers can use the reference (name/offset) to get the payload InputStream.
  • Include warcinfo records in the JSONL output (currently skipped by the windex.extract).
  • Should boiler_pipe extraction be used?
  • Should extracted links be normalised?
  • Should image and/or PDF analysis be enabled?
  • Should the original payload be included if small enough? Or just for text?
  • Should there be an option to only output the term frequency or colocation statistics of the text? So we can do this for everything? Perhaps that's better as a post-processing step?
  • Both the Tika configuration extract_all_metadata and the experimental WARCStats code show there are lots of other metadata fields that might be of interest. These could be stored in some kind of hash, but not that Parquet/Avro schema reflection does not support hashes directly. The MementoRecord class illustrates that the Memento bean could be implemented on top of an extensible hash-map, which might make dynamic Parquet schema generation possible.

Example WARC Stats code output:

INFO  WARCStatsToolIntegrationTest - {"timestamp":"20080430204830","url":"http:\/\/www.archive.org\/services\/collection-rss.php","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"50832","length":"50831","source-offset":"18283","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"18283","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/services\/collection-rss.php","HEADER-WARC-Date":"2008-04-30T20:48:30Z","HEADER-Content-Length":"50832","HEADER-WARC-Record-ID":"<urn:uuid:8399ab93-1fee-4787-aa60-0f1ce83cb885>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:JXXJNHJX4GEM44C4NOM3RJWKMKVBIGHF","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:29 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-X-Powered-By":"PHP\/5.0.5-2ubuntu1.4","HTTP-Connection":"close","HTTP-Content-Type":"text\/xml","host":"www.archive.org","year":"2008"}
INFO  WARCStatsToolIntegrationTest - {"timestamp":"20080430204825","url":"http:\/\/www.archive.org\/robots.txt","source-file":"hdfs:\/\/localhost:58536\/user\/anj\/inputs\/IAH-20080430204825-00000-blackbook-truncated.warc.gz","content-type":"application\/http; msgtype=response","content-length":"782","length":"781","source-offset":"707","HEADER-reader-identifier":"IAH-20080430204825-00000-blackbook-truncated.warc.gz","HEADER-WARC-Payload-Digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","HEADER-WARC-IP-Address":"207.241.229.39","HEADER-absolute-offset":"707","HEADER-WARC-Target-URI":"http:\/\/www.archive.org\/robots.txt","HEADER-WARC-Date":"2008-04-30T20:48:25Z","HEADER-Content-Length":"782","HEADER-WARC-Record-ID":"<urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>","HEADER-WARC-Type":"response","HEADER-Content-Type":"application\/http; msgtype=response","record-type":"warc.response","digest":"sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U","status-code":"200","HTTP-Date":"Wed, 30 Apr 2008 20:48:24 GMT","HTTP-Server":"Apache\/2.0.54 (Ubuntu) PHP\/5.0.5-2ubuntu1.4 mod_ssl\/2.0.54 OpenSSL\/0.9.7g","HTTP-Last-Modified":"Sat, 02 Feb 2008 19:40:44 GMT","HTTP-ETag":"\"47c3-1d3-11134700\"","HTTP-Accept-Ranges":"bytes","HTTP-Content-Length":"467","HTTP-Connection":"close","HTTP-Content-Type":"text\/plain; charset=UTF-8","host":"www.archive.org","year":"2008"}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant