The Extensible Web Retrieval Toolkit (eWRT) is a modular open-source Python API which
- offers a unified interface for retrieving social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia,
- includes various helper classes for effective caching and data management,
- provides components for low-level natural language processing functionalities such as language detection, phonetic string similarity measures, and methods for string normalization.
adjust
eWRT/src/siteconfig.py-sample
to your setting and save it to
~/.eWRT/siteconfig.py
(user specific settings) and/or/etc/eWRT/siteconfig.py
(system wide settings)
eWRT.access
- file, Web and database accessdb
- database accessfile
- file accesshttp
- access web resources supporting authentication (basic, digest), compression, etc.javascript
- control Firefox to extract AJAX pages
eWRT.input
- input and cleanup modulesclean
- clean and normalize text phrasesconv
- convert doc, html and pdf files to text documents; convert XCL to rdfcorpus
- input readers for the Reuters and BBC corpuscsv
- read and analyze csv filesstock
- stock quotes
eWRT.ontology
- tools for comparing, evaluating and visualizing ontologiescompare
- compare ontology nodes, relations, and relation typeseval
- determine the coherence of ontology nodesvisualize
- visualize ontologies
eWRT.stat
- the eWRT statistics packagescoherence
- compute the coherence between terms (Dice, PMI)metrics
- evaluation metrics (precision, recall, F1)language
- simple language detectionstring
- word (Levenshtein, Damerau-Levenshtein, Soundex, ...) and document (Vector Space Model) similarity metrics
eWRT.util
- utility classes for transparent caching, logging, monitoring, etc.advLogging
- log to SNMP handlerassert
- assertion based counters (decorators)async
- asynchronous procedure calls (experimental)cache
- transparent memory and disk caching of function calls (decorators)exception
- SNMP exception handlingloggerProfile
- simplified loggingmodule_path
- compute relative pathsmonitoring
- support for Nagios NSCA servicespickleIterator
- iterate over objects stored in pickle filesprofile
- python profilingtiming
- time python methods (decorators)
eWRT.visualize
- eWRT visualization libraryeWRT.ws
- Web service access (REST, Amazon, Flickr, Facebook, ...)amazon
conceptnet
delicious
facebook
flickr
geonames
google
googlealerts
googletrends
linkedin
opencalais
rest
- efficiently access/publish REST servicesrss
technorati
twitter
wikidata
wikipedia
wordnet
wot
yahoo
youtube
- python-libraries:
- facebook api - http://code.google.com/p/pyfacebook/
- google-trends api - http://github.com/suryasev/unofficial-google-trends-api/tree/master
- oauth - http://oauth.googlecode.com/
- simplejson - http://pypi.python.org/pypi/simplejson/
- tango - http://tango.ryanmcgrath.org/
- python-rdflib
- python-nltk
- python-feedparser (eWRT.ws.rss)
- pywikibot (eWRT.ws.wikidata)
- text conversion (eWRT.input.conv):
- lynx
- pdftotext (poppler-utils)
- antiword