The milestones for releases are held as GitHub issues now.
- Add the Wikipedia-Miner as an entity extractor?
- Apache Stanbol may be a useful base:
- Add in enhanced Tika features as used in the format profiler (e.g. DOC generator app, PDF issues, no recursion, etc.)
- Explicitly handler crawl referrers somehow? (i.e. do we want SolrFields.REFERRER?)
- Canonicalize outlinks and hosts?
- Deduplicating solr indexer: keys on content hash, populate solr once per hash, with multiple crawl dates? That requires URL+content hash. Also hash only and cross reference? Same as ?
- NOTE that this only work when all sources are processed in the same Hadoop job.
- And therefore not really scaleable. Nor easily scaleable in Solr as grouping does not work across shards.
- Facets like log(size), or small, medium, large, to boost longer texts
- Support a publication_date? Or an interval: published_after, published_before?
- BBC Use: (DONE)
- Other publisher-based examples may be found here: http://en.wikipedia.org/wiki/User:Rjwilmsi/CiteCompletion
- PDF, can use: creation date as lower bound.
- Full temporal realignment. Using crawl date, embedded date, and relationships, to rebuild the temporal history of a web archive.
- See also http://ws-dl.blogspot.co.uk/2013/04/2013-04-19-carbon-dating-web.html
- Add Welsh and other language or dialect detection?
- Extend license extraction support. Ensure target is specified, and support other forms of embedded metadata.
- See also this general metadata extraction process: http://webdatacommons.org/#toc4
- Add error code as facet for large-scale bug analysis.
- Add rounded log(error count) or similar to track format problems.
- Switch to Nanite/Extended Tika to extract
- Software and format versions, integrate DROID, etc.
- Published, Company, Keywords? Subject? Last Modified?
- Higher quality XMP metadata?
- Add Welsh and other language or dialect detection?
- Require >= 3-grams for ssdeep to reduce hits/false positives.
- Deadness (Active, Empty, Gone)
- Facets like log(size), or small, medium, large, to boost longer texts
- Compression ratio/entropy or other info content measure? Actually very difficult as decompression happens inline and so re-compression would be needed!
- Fussy hashes of the text.
- Compression ratio/entropy or other info content measure?
- Events integration with SOLR.
- Image analysis, sizes, pixel thumb to spot rescaled versions, sift features plus fuzzy hash?
- Create reduced size image, and run clever algorithms on it...
- Interesting regions http://news.ycombinator.com/item?id=4968364
- Faces, and missing faces, ones that used to re-appear and are now gone? Could record ratios of key points, or just the number of faces. Would be fun to play with. See http://www.openimaj.org/tutorial/finding-faces.html which gives keypoints, or https://code.google.com/p/jviolajones/
- Also, look for emotional connections http://discontents.com.au/archives-of-emotion/
- Index by histogram entropy? http://labs.cooperhewitt.org/2013/default-sort-or-what-would-shannon-do/
- Similarly, audio fingerprints etc.
- Named entities or other NLP features, based on text from Tika.
- If that worked, one could train Eigenfaces (e.g. faint.sf.net) using proper nouns associated with images and then use that for matching, perhaps?
- TEI aware indexing? Annotated text with grammatical details.
- Related: http://hassetukda.wordpress.com/2013/03/25/automatic-evaluation-recommendations-report/
- Integration with DBPedia Spotlight for concept identification:
- Hyphenation for syllable counting, e.g. sonnet spotting http://sourceforge.net/projects/texhyphj/
- Detect text and even handwriting in images (http://manuscripttranscription.blogspot.co.uk/2013/02/detecting-handwriting-in-ocr-text.html)
- By dominant colour (http://stephenslighthouse.com/2013/02/22/friday-fun-the-two-ronnies-the-confusing-library/)
Face detection option 2:
https://code.google.com/p/jviolajones/
import detection.Detector;
String fileName="yourfile.jpg"; Detector detector=Detector.create("haarcascade_frontalface_default.xml"); List res=detector.getFaces(fileName, 1.2f,1.1f,.05f, 2,true);
- [http://qanda.digipres.org/58/what-techniques-there-detecting-similar-images-large-scale](What techniques are there for detecting similar images at large scale?)
- LIRE:
- http://pastebin.com/Pj9d8jt5 ImagePHash.java
- http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
Two approaches: N-Gram Matching and Fuzzy Search. Both seem to work rather well, but the overall goal is to see which performs better at scale.
NOTE that the N-Gram approach may also be useful for spotting similar binaries.
Stanford Named Entity Recognizer (NER) appears to be a sound option, although using it would mean relicensing this project as GPL. Has multiple classes of recogniser:
3 class Location, Person, Organization
4 class Location, Person, Organization, Misc
7 class Time, Location, Organization, Person, Money, Percent, Date