-
Notifications
You must be signed in to change notification settings - Fork 2
/
KCompetition_TextData.R
90 lines (49 loc) · 4.38 KB
/
KCompetition_TextData.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# KAGGLE COMPETITION - DEALING WITH THE TEXT DATA
# This script file is intended to help you deal with the text data provided in the competition data files
# If you haven't already, start by reading the data into R
# Make sure you have downloaded these files from the Kaggle website, and have navigated to the directory where you saved the files on your computer
# We are adding in the argument stringsAsFactors=FALSE, since we have some text fields
eBayTrain = read.csv("eBayiPadTrain.csv", stringsAsFactors=FALSE)
eBayTest = read.csv("eBayiPadTest.csv", stringsAsFactors=FALSE)
# Now, let's load the "tm" package.
library(tm)
# Then create a corpus from the description variable. You can use other variables in the dataset for text analytics, but we will just show you how to use this particular variable.
# Note that we are creating a corpus out of the training and testing data.
CorpusDescription = Corpus(VectorSource(c(eBayTrain$description, eBayTest$description)))
# You can go through all of the standard pre-processing steps like we did in Unit 5:
CorpusDescription = tm_map(CorpusDescription, content_transformer(tolower), lazy=TRUE)
# Remember this extra line is needed after running the tolower step:
CorpusDescription = tm_map(CorpusDescription, PlainTextDocument, lazy=TRUE)
CorpusDescription = tm_map(CorpusDescription, removePunctuation, lazy=TRUE)
CorpusDescription = tm_map(CorpusDescription, removeWords, stopwords("english"), lazy=TRUE)
CorpusDescription = tm_map(CorpusDescription, stemDocument, lazy=TRUE)
# Now we are ready to convert our corpus to a DocumentTermMatrix, remove sparse terms, and turn it into a data frame.
# We selected one particular threshold to remove sparse terms, but remember that you can try different numbers!
dtm = DocumentTermMatrix(CorpusDescription)
sparse = removeSparseTerms(dtm, 0.995)
DescriptionWords = as.data.frame(as.matrix(sparse))
# Let's make sure our variable names are okay for R:
colnames(DescriptionWords) = make.names(colnames(DescriptionWords))
# Now we need to split the observations back into the training set and testing set.
# To do this, we can use the head and tail functions in R.
# The head function takes the first "n" rows of DescriptionWords (the first argument to the head function), where "n" is specified by the second argument to the head function.
# So here we are taking the first nrow(eBayTrain) observations from DescriptionWords, and putting them in a new data frame called "DescriptionWordsTrain"
DescriptionWordsTrain = head(DescriptionWords, nrow(eBayTrain))
# The tail function takes the last "n" rows of DescriptionWords (the first argument to the tail function), where "n" is specified by the second argument to the tail function.
# So here we are taking the last nrow(eBayTest) observations from DescriptionWords, and putting them in a new data frame called "DescriptionWordsTest"
DescriptionWordsTest = tail(DescriptionWords, nrow(eBayTest))
# Note that this split of DescriptionWords works to properly put the observations back into the training and testing sets, because of how we combined them together when we first made our corpus.
# Before building models, we want to add back the original variables from our datasets. We'll add back the dependent variable to the training set, and the WordCount variable to both datasets. You might want to add back more variables to use in your model - we'll leave this up to you!
DescriptionWordsTrain$sold = eBayTrain$sold
DescriptionWordsTrain$WordCount = eBayTrain$WordCount
DescriptionWordsTest$WordCount = eBayTest$WordCount
# Remember that you can always look at the structure of these data frames to understand what we have created
# Now let's create a logistic regression model using all of the variables:
DescriptionWordsLog = glm(sold ~ ., data=DescriptionWordsTrain, family=binomial)
# And make predictions on our test set:
PredTest = predict(DescriptionWordsLog, newdata=DescriptionWordsTest, type="response")
# Now we can prepare our submission file for Kaggle:
MySubmission = data.frame(UniqueID = eBayTest$UniqueID, Probability1 = PredTest)
write.csv(MySubmission, "SubmissionDescriptionLog.csv", row.names=FALSE)
# You should upload the submission "SubmissionDescriptionLog.csv" on the Kaggle website to use this as a submission to the competition
# This script file was just designed to help you get started - to do well in the competition, you will need to build better models!