Skip to content

Commit

Permalink
Merge pull request #70 from CPSSD/notebooks
Browse files Browse the repository at this point in the history
Fixed progression of LUCAS explanation and moved the data nb to the wiki.
  • Loading branch information
Deniall authored Nov 17, 2018
2 parents 0e8fcf5 + 206f984 commit 8a43307
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 146 deletions.
34 changes: 21 additions & 13 deletions LUCAS/notebooks/LUCAS.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Naive Bayes assumes features are independent"
"## Naive Bayes assumes features are independent"
]
},
{
Expand Down Expand Up @@ -97,7 +97,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multivariate Bernoulli Naive Bayes"
"## Multivariate Bernoulli Naive Bayes"
]
},
{
Expand Down Expand Up @@ -128,14 +128,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multinomial Naive Bayes"
"## Multinomial Naive Bayes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modelling and Classification\n",
"### Modelling and Classification\n",
"\n",
"Multinomial Naive Bayes models the distribution of words in a document as a multinomial. A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other. For classification, we assume that there are a fixed number of classes, $ c {\\in} \\{1, 2, . . . , m\\} $, each with\n",
"a fixed set of multinomial parameters. The parameter\n",
Expand Down Expand Up @@ -180,7 +180,7 @@
"The weights for the decision boundary defined by the MNB classifier are the log parameter estimates,\n",
"(7) $$ \\hat w_{ci} = log \\hat {\\theta} _{ci}. $$\n",
"\n",
"## Complement Naive Bayes\n",
"### Complement Naive Bayes\n",
"\n",
"It has been shown (see fig. 1) that skewed training data (that is, more training samples in one class than another) will cause the classifier to have bias towards that class, and prefer it. To deal with the problem, we introduce the 'Complement' class of Naive Bayes, CNB. Instead of the regular MNB, which uses training data from a single class, $c$, to estimate training weights, CNB estimates using data from every class except $c$.\n",
"\n",
Expand All @@ -199,7 +199,7 @@
"\n",
"CNB is related to the one-versus-all-but-one technique that is frequently used in multi-label classification, where each example may have more than one label. Berger (1999) and Zhang and Oles (2001) have found that one-vs-all-but-one MNB works better than regular MNB. The combined classification rule is basically the above equation (8) except it has the complement weights subtracted from the normal weights. However, CNB performs better than OVA MNB and regular MNB because it eliminates the biased regular MNB weights.\n",
"\n",
"## Weight Magnitude Errors\n",
"### Weight Magnitude Errors\n",
"\n",
"Because of the independence assumption, Naive Bayes can be biased to give more weight to those classes that most violate the indepence assumption. A good example of this is:\n",
"\n",
Expand All @@ -218,7 +218,7 @@
"\n",
"We call this, combined with CNB, Weight-normalized Complement Naive Bayes (WCNB). Experiments indicate that WCNB is effective. \n",
"\n",
"## Transforming Term Frequency\n",
"### Transforming Term Frequency\n",
"\n",
"It was found that term distributions had heavier tails than predicted by the multinomial model, instead appearing like a power-law distribution. Using a simple transform, we can make these power-law-like term distributions look more multinomial.\n",
"\n",
Expand All @@ -232,7 +232,7 @@
"\n",
"<sup>(Although setting d = 1 does not match the data as well as an optimized d, it does produce a distribution that is much closer to the empirical distribution than the best fit multinomial.)<sup>\n",
" \n",
"## Transforming by Document Frequency\n",
"### Transforming by Document Frequency\n",
"\n",
"Another useful transform discounts terms that occur in many documents. Common words are unlikely to be related to the class of a document, but random variations can create apparent fictitious correlations.\n",
"This adds noise to the parameter estimates and hence the classification weights. Since common words appear often, they can hold sway over a classification decision even if their weight differences between classes is small. For this reason, it is advantageous to downweight these words.\n",
Expand All @@ -243,7 +243,7 @@
"\n",
"where ${\\delta}_{ij}$ is 1 if word i occurs in document j, 0 otherwise, and the sum is over all document indices (Salton & Buckley, 1988). Rare words are given increased term frequencies; common words are given less weight.\n",
"\n",
"## Transforming Based on Length\n",
"### Transforming Based on Length\n",
"\n",
"Documents have strong word inter-dependencies. After a word first appears in a document, it is more likely to appear again. Since MNB assumes occurrence independence, long documents can negatively effect parameter estimates.\n",
"We normalize word counts to avoid this problem. We again use a common transform that is not seen with Naive Bayes. We discount the influence of long documents by transforming the term frequencies according to\n",
Expand All @@ -254,7 +254,7 @@
"yielding a length 1 term frequency vector for each document. The transform keeps any single document\n",
"from dominating the parameter estimates.\n",
"\n",
"## Conclusion and derivation of steps to arrive at TWCNB\n",
"### Conclusion and derivation of steps to arrive at TWCNB\n",
"\n",
"Let \n",
"- $ \\vec{d} = (\\vec{d}_1, . . . , \\vec{d}_n) $ be a set of documents; $d_{ij}$ is the count of word $i$ in document $j$.\n",
Expand Down Expand Up @@ -320,8 +320,16 @@
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"# K-Nearest Neighbours"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Support Vector Machine"
]
}
],
"metadata": {
Expand All @@ -340,7 +348,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
"version": "3.7.1"
},
"toc": {
"base_numbering": 1,
Expand Down
126 changes: 0 additions & 126 deletions LUCAS/notebooks/The Data.ipynb

This file was deleted.

Loading

0 comments on commit 8a43307

Please sign in to comment.