Skip to content
Teemu Roos edited this page Sep 4, 2017 · 14 revisions

Exercise 1 - Titanic

Download the Titanic dataset from Kaggle, and complete the following exercises. The dataset consists of personal information of all the people on board the Titanic, along with the information whether they survived the ice berg or not. For now we'll stick to preprocessing and general handling of the data.

  1. Read the data in your favorite language. Hint: with Python use can use pandas

  2. Have a look at the data. We will build new representations of the dataset that are better suited for a particular purpose. Some of the columns, e.g "name", simply identify a person and cannot be useful for prediction tasks - remove them.

  3. The column "Cabin" contains a letter and a number. A clever data scientist might conclude that the letter stands for a deck on the ship (which is indeed true) and that having just the deck information might improve the results of a classifier predicting an output variable. Add a new column to the dataset, which is simply the deck letter.

  4. You'll notice that some of the columns, such as the previously added deck number are categorical. Their representation as a string is not efficient for further computation. Transform them into numeric values so that a unique integer id corresponds to each distinct category. Hint. pandas can do this for you.

  5. Some of the rows in the data have missing values, e.g when the cabin number of a person is not known. Most machine learning algorithms have trouble with missing values, and they need to be handled in preprocessing:

    a) For continous values, replace the missing values with the average of the non-missing values of that column.

    b) For discrete and categorical values, replace the missing values with the median of the column.

    This is known as imputation.

  6. At this point, all data are numeric. Write the data, with the modifications, to a .csv file. Then, write another file, this time in the json format, with the following structure:

[
    {
        "Deck": 0,
        "Age": 20,
        "Survived", 0
        ...
    },
    {
        ...
    }
]

Exercise 2 - Text data

Next we'll look at some text data. We'll be looking into Amazon reviews, and the steps needed to transform a raw dataset into one more suitable for prediction tasks.

  1. Download the automotive 5-core dataset from here. Extract it to find the data in json format. You can also download one of the bigger ones, if you are feeling ambitious.

  2. The reviewText field contains the unstructured review text written by the user. When dealing with natural language, it is important to notice that while, for example, the words "Copper" and "copper." are represented by two different strings, they have the same meaning. When applying statistical methods on this data, it is useful to ensure that words with the same meaning are represented by the same string.

    To do this, we usually normalize the data, by for example removing punctuation and capitalization differences. A related issue is that, for example, while again the words "swims" and "swim" are distinct string, they both refer to swimming. Stemming refers to the process of mapping words in inflected form to their base form: swims -> swim, etc.

    Finally, another popular approach is to remove so called stop-words, words that are very common and have little to do with the actual content matter. There's plenty of openly available lists of stop-words for almost any (natural) language.

  3. Do the following:

    a) Open the json file in your favorite environment, e.g python

    b) Access the reviewText field, and downcase the contents

    c) Remove all punctuation, as well as the stop-words. You can find a stop-word list for English e.g here

    d) Apply a stemmer on the paragraphs, so that inflected forms are mapped to the base form. For example, for python the popular natural language toolkit nltk has an easy-to-use stemmer.

    e) Filter the data by selecting reviews where the field overall is 4 or 5, and store the records in file pos.txt. Similarly, select reviews with rating 1 or 2 and store the records in file neg.txt. (Ignore the reviews with overall rating 3.) Each line in the two files should contain exactly one preprocessed review.

Clone this wiki locally