Skip to content

Latest commit

 

History

History
 
 

A9

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Assignment 9

Written in Python 3.6.

Python Dependencies used:

  • feedparser

To install dependencies use the following command:

$ pip3 install -r requirements.txt

Assignment Details

Assignment #9

Due: 11:59pm May 1 2017

Support your answer: include all relevant discussion, assumptions, examples, etc.

  1. Choose a blog or a newsfeed (or something similar with an Atom or RSS feed). Every student should do a unique feed, so please "claim" the feed on the class email list (first come, first served). It should be on a topic or topics of which you are qualified to provide classification training data. Find something with at least 100 entries (or items if RSS).

Create between four and eight different categories for the entries in the feed:

examples:

work, class, family, news, deals

liberal, conservative, moderate, libertarian

sports, local, financial, national, international, entertainment

metal, electronic, ambient, folk, hip-hop, pop

Download and process the pages of the feed as per the week 12 class slides.

Be sure to upload the raw data (Atom or RSS) to your github account.

Create a table with 100 rows, like:

title			classification
-----			--------------
Ric Ocasek -		80s 
"Something To Grab 
For" (forgotten song)	

Weezer - "Pinkerton" 	alternative
(LP Review)

Schon & Hammer - 	80s
"No More Lies" 
(forgotten song)

etc. This is your "ground truth" (or "gold standard") data.

  1. Train the Fisher classifier on the first 50 entries (the "training set"), then use the classifier to guess the classification of the next 50 entries (the "test set").

Create a table with 50 rows, like

title			actual		predicted
-----			------		---------
Donnie Iris - 		80s		80s
"Ah! Leah!" 
(Forgotten Song)	

Black Sabbath - 	metal		metal
"Vol. 4" (LP Review)

Catherine Wheel - 	alternative	metal
"Ferment" (LP Review)

Assess the performance of your classifier in each of your categories by computing precision, recall, and F-measure. Use the "macro-averaged" label based method, as per:

http://stats.stackexchange.com/questions/21551/how-to-compute-precision-recall-for-multiclass-multilabel-classification

For example, if you have 5 categories (e.g., 80s, metal, alternative, electronic, cover), you will compute precision, recall, and F-measure for each category, and then compute the average across the 5 categories.

  1. Repeat question #2, but use the first 90 entries to train your classifier and the last 10 entries for testing.

The questions below is for 5 points extra credit

  1. Rerun question 3, but with "10-fold cross validation". What was the change, if any, in precision and recall (and thus F-Measure)?