Positional Index vs Inverted index
- Traversing a directory of documents
- Reading the document and extracting and tokenizing all of the text
- Computing counts of documents and terms
- Building a dictionary of unique terms that exist within the corpus
- Writing out to a disk file, a sorted term dictionary
- Select doc,
- tokenize,
- add to dictionary,
- count occurrences,
- sort for searching.
-
Number of documents processed
-
Total number of terms parsed from all documents
-
Total number of unique terms found and added to the index
Implementation of hadoop Search technology Simple db keeping track of dictionary
- Both memory and time consuming at internet scale
- Potentially billions of documents
- Need more efficient solution e.g. : document of (see bob run see spot throw)
- count term: in order to reduce by combine and summerize
see 2
bob 1
spot 1
throw 1
- key every term then count
- Then merge each output toghether
- Then sort them
-
Take all output, combine and reduce them
-
Map (key=url, val=contents):
→ For each word w in contents, emit (w, "1")
→ Reduce(key=word, values=uniq_counts):
- Sum all "1"s in value list
- Emit result "(word, sum)"
Automatic parallel execution in mapReduce MapReduce in Hadoop
- To store the term vocabulary,
- document frequency,
- pointers to each postings list
Example: An array of struct as a naïve dictionary
- Hashtables
Each vocabulary term is hashed to an interger
Pros: faster lookup than tree
Cons: no easy way to find minor variants; no prefix search; expensive operation of rehasing.
- Trees
Simplest: Binary tree
More usual: B-trees
Pros: solves the prefix problem
Cons:
+ Slower O(logM)
+ Rebalancing binary tree is expensive, but B-trees mitigate the rebalancing problem.
e.g. binary tree in sort order
Always slipping into half for searching.
Every node always has two outputs
e.g. B-Tree
every node has a number of children
Any particular level may have two or more outcomes. Level multiple options.
e.g. *mon: to find words ending with mon
- Find everything that maches with term e.g Find word related to home: home*
- This result in the execution of many Boolean and queries e.g. home* AND house*
- Handle wildcard:
B-trees handle * at the end of query
Permuterm index: handle * at the middle
e.g. finding hello → hello$, ello$h, lo$hel, o$hell execute different kind of search.
Cons: increase number of term in the dictionary
- rotate query wild-card to the right
- use B-tree lookup
- Permuterm problem : = quadruples lexicon size
Pros: use a lot more space for indexes
Finds term based on a query consisting of k-grams
- Dictionary compression
Aims to fit in the memory with an at least large portion of dictionaries.
The dictionary as a string that sorts the vocabulary lexicographically and stores it in an array of fixed-width entries or blocked storage by grouping terms into the string into blocks of size k and keeping a term pointer for the first term of each blog
- Rule of 30
The 30 most common words account for 30% of the tokens in the written text.
Thus, the lossy method could be used for compression without losing its effectiveness in encoding the data.
- Lossy Compression
Amount of data is lost during this process
- Lossless Compression
No data is lost during compression
- Heap’s law
To estimate the number of unique terms in a collection based upon constants k and b and the number of terms or tokens (T) parsed from all documents.
M = kTß
in which T is the number of tokens in a collection, k and ß are parameters values
- Zipf’s law
cfi = cik
as one of the types of the power law
-
Power law
-
Front Coding
-
Variable Byte Encoding
-
Nibble
-
Unary Code
A string of n 1s followed by a 0
- Encoding
Two type of methods such as bytewise and bitwise.
As such variable byte encoding uses the integral number of byte to encode a gap instead of docID.
-
Entropy
-
δ Codes
Asymptotically optimal for entropy H(P) → ∞
Purpose: to get the information that is available on a website Process: As described by Manning (2009, chapter 15)
- Begin with URL(s) constituting a seed set
- Picking a URL from this seed set then fetches the web page at that URL
- Parse the fetched page to extract links and texts
- Feed the extracted texts to a text indexer
- Add the extracted links to URL frontier
- Corresponding pages-URL(s) are fetched by the crawler
- URL frontier contains seed set
- Corresponding URL are deleted from URL frontier when pages are fetched.
- Entire process as traversing the web graph
Figure : The basic crawler architecture extracted from Figure 20.1 (Manning, 2009, chapter 19)
Static: the same prebuilt content each time the page is loaded Dynamic: content is changed and can be generated on the fly.
Boolean Retrieval vs Wildcard Queries vs Phrase Queries
Improve of Computing Score and Rank
Manning, C.D., Raghaven, P., & Schütze, H. (2009). An Introduction to Information Retrieval (Online ed.). Cambridge, MA: Cambridge University Press. Available at http://nlp.stanford.edu/IR-book/information-retrieval-book.html