The milestones for releases are held as GitHub issues now.
- Add the Wikipedia-Miner as an entity extractor?
- Apache Stanbol may be a useful base:
- Add in enhanced Tika features as used in the format profiler (e.g. DOC generator app, PDF issues, no recursion, etc.)
- Explicitly handler crawl referrers somehow? (i.e. do we want SolrFields.REFERRER?)
- Canonicalize outlinks and hosts?
- Deduplicating solr indexer: keys on content hash, populate solr once per hash, with multiple crawl dates? That requires URL+content hash. Also hash only and cross reference? Same as ?
- NOTE that this only work when all sources are processed in the same Hadoop job.
- And therefore not really scaleable. Nor easily scaleable in Solr as grouping does not work across shards.
- Facets like log(size), or small, medium, large, to boost longer texts
- Support a publication_date? Or an interval: published_after, published_before?
- BBC Use: (DONE)
- Other publisher-based examples may be found here:
- PDF, can use: creation date as lower bound.
- Full temporal realignment. Using crawl date, embedded date, and relationships, to rebuild the temporal history of a web archive.
- See also
- Add Welsh and other language or dialect detection?
- Extend license extraction support. Ensure target is specified, and support other forms of embedded metadata.
- See also this general metadata extraction process:
- Add error code as facet for large-scale bug analysis.
- Add rounded log(error count) or similar to track format problems.
- Switch to Nanite/Extended Tika to extract
- Software and format versions, integrate DROID, etc.
- Published, Company, Keywords? Subject? Last Modified?
- Higher quality XMP metadata?
- Add Welsh and other language or dialect detection?
- Require >= 3-grams for ssdeep to reduce hits/false positives.
- Deadness (Active, Empty, Gone)
- Facets like log(size), or small, medium, large, to boost longer texts
- Compression ratio/entropy or other info content measure? Actually very difficult as decompression happens inline and so re-compression would be needed!
- Fussy hashes of the text.
- Compression ratio/entropy or other info content measure?
- Events integration with SOLR.
- Image analysis, sizes, pixel thumb to spot rescaled versions, sift features plus fuzzy hash?
- Create reduced size image, and run clever algorithms on it...
- Interesting regions
- Faces, and missing faces, ones that used to re-appear and are now gone? Could record ratios of key points, or just the number of faces. Would be fun to play with. See which gives keypoints, or
- Also, look for emotional connections
- Index by histogram entropy?
- Similarly, audio fingerprints etc.
- Named entities or other NLP features, based on text from Tika.
- If that worked, one could train Eigenfaces (e.g. using proper nouns associated with images and then use that for matching, perhaps?
- TEI aware indexing? Annotated text with grammatical details.
- Related:
- Integration with DBPedia Spotlight for concept identification:
- Hyphenation for syllable counting, e.g. sonnet spotting
- Detect text and even handwriting in images (
- By dominant colour (
Face detection option 2:
import detection.Detector;
String fileName="yourfile.jpg"; Detector detector=Detector.create("haarcascade_frontalface_default.xml"); List res=detector.getFaces(fileName, 1.2f,1.1f,.05f, 2,true);
- [](What techniques are there for detecting similar images at large scale?)
Two approaches: N-Gram Matching and Fuzzy Search. Both seem to work rather well, but the overall goal is to see which performs better at scale.
NOTE that the N-Gram approach may also be useful for spotting similar binaries.
Stanford Named Entity Recognizer (NER) appears to be a sound option, although using it would mean relicensing this project as GPL. Has multiple classes of recogniser:
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Time, Location, Organization, Person, Money, Percent, Date