Gaston Sanchez
- Work with the package
"stringr"
- String manipulation
- More regular expressions
- A bit of data cleaning
- Write your descriptions, explanations, and code in an
Rmd
(R markdown) file. - Name this file as
lab11-first-last.Rmd
, wherefirst
andlast
are your first and last names (e.g.lab11-gaston-sanchez.Rmd
). - Knit your
Rmd
file as an html document (default option). - Submit your
Rmd
andhtml
files to bCourses, in the corresponding lab assignment. - Due date displayed in the syllabus (see github repo).
You’ll be working with the data file text-emotion.csv
available in the
course github
repository.
https://raw.githubusercontent.com/ucb-stat133/stat133-labs/master/data/text-emotion.csv
The original source is the data set “Emotion in Text” from the website Crowd Flower Data for Everyone https://www.crowdflower.com/data-for-everyone/
The file contains four columns:
tweet_id
: tweet identifiersentiment
: class or sentiment labelauthor
: username author of the tweetcontent
: content of the tweet
In your Rmd
file write R code to do computations in order to answer
each of the following questions
Also, while you work on this lab you may want to look at the cheat sheets for:
-
Count the number of characters in the tweet contents; create a vector for this purpose. It is possible that you find tweets containing more than 140 characters. This has to do with the so-called predefined XML entities such as
&
which represents an ampersand&
"
which represent quotes"
<
which represents less-than symbol<
>
which represents greater-than symbol>
-
Display the
summary()
of the vector obtained above. -
Likewise, graph a histogram of these counts. To plot the histogram, use a bin width of 5 units: 1-5, 6-10, 11-15, 16-20, etc. In other words: the first bin involves tweets between 1 and 5 characters (inclusive), the second bin involves tweets containing between 6 and 10 characters (inclusive), and so on.
-
Are there any tweets with 0 characters? (write a command that answers this question).
-
Are there any tweets with 1 character? If yes (write commands that answer these questions):
- how many?
- what is their content?
- what is their location (i.e. index or position)?
-
What is the tweet with the most characters (i.e. max length)? (write a command that answers these questions).
- the number of characters
- display its content
- what is its location (i.e. index or position)?
According to Twitter, usernames:
- cannot be longer than 15 characters
- can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores (i.e. cannot contain any symbols, dashes or spaces, except underscores)
- If you want to know more about twitter usernames, visit:
https://help.twitter.com/en/managing-your-account/twitter-username-rules
Confirm that the values in column author
follow each of the rules for
valid usernames:
-
No longer than 15 characters (if you find usernames longer than 15 characters, display them)
-
Contain alphanumeric characters and underscores. If you find usernames containing other symbols, display them. Hint: The non-word character class is
"\\W"
is your friend. -
What is the number of characters of the shortest usernames? And what are the names of these authors? (write commands to answer these questions)
-
How many tweets contain at least one caret symbol
"^"
(write a command to answer this question). -
How many tweets contain three or more consecutive dollar symbols
"$"
(write a command to answer this question). -
How many tweets do NOT contain the characters
"a"
or"A"
(write a command to answer this question). -
Display the first 10 elements of the tweets that do NOT contain the characters
"a"
or"A"
(write a command to answer this question). -
Number of exclamation symbols
"!"
: compute a vector with the number of exclamation symbols in each tweet, and display itssummary()
. -
What’s the tweet (content) with the largest number of exclamation symbols
!
? Display its content. (write a command to answer this question)
-
What are the different types of sentiments (i.e. categories)? (write a command that answers this question)
-
Compute the frequencies (i.e. counts) of each sentiment (and display these frequencies).
-
Graph the relative frequencies (i.e. proportions) with a horizontal barplot (bars horizontally oriented) in decreasing order, including names of sentiment types.
-
Sentiment and length of tweets: compute a table with the average length of characters per sentiment (i.e. average number of characters for
neutral
tweets, forhappy
tweets, etc.). Display this table.
People use the hashtag symbol #
before a relevant keyword or phrase in
their tweet to categorize those Tweets and help them show more easily in
Twitter search. According to Twitter, hashtags cannot contain spaces or
punctuation symbols, and cannot start with or use only numbers.
-
Count the number of (valid) hashtags in the tweet contents.
-
Display such frequencies, and make a barplot of these counts (i.e. number of tweets with 0 hashtags, with 1 hashtag, with 2 hashtags, etc).
-
What is the average length of the hashtags?
-
What is the most common length (i.e. the mode) of the hashtags?
-
How many tweets contain the individual strings
"omg"
or"OMG"
(write a command to answer this question). For example:omg I just saw them again
(this would be a match)OMG I just saw them again
(this would be a match)I just saw them again omg
(this would be a match)I just saw them again OMG
(this would be a match)I just saw them omg can't believe it
(this would be a match)I just saw them OMG can't believe it
(this would be a match)omg: I just saw them again
(this would NOT be a match)OMG,I just saw them again
(this would NOT be a match)I just saw them again omg!!!
(this would NOT be a match)I just saw them again omgomgomg
(this would NOT be a match)I just saw them again lol-omg!!!
(this would NOT be a match)