forked from janithwanni/RLadies-data-wrangling-April-2021
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
742 lines (503 loc) · 25 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
---
title: "Data Wrangling with R"
subtitle: "RLadies Colombo - April 2021 Meetup"
output:
xaringan::moon_reader:
css: [default, kunoichi,"index.css"]
lib_dir: libs
seal: false
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: center, middle, first-slide, inverse
<span style="font-family: 'Playfair Display', serif; font-weight: 500; font-size: 54pt">
Data Wrangling with R
</span>
<hr style="width:60%;margin: 0 auto;"/>
<span style="font-family: 'Lato', sans-serif;font-size: 24pt;font-weight:400;">
R Ladies Colombo <br/>
<span style="font-weight:300;font-size: 18pt;">April 2021 Meetup</span>
</span>
.center[<img src="images/bg4.png" style="position:fixed;bottom:0px;left:-0%"/>]
---
<h1 style="text-align:center;color:#363a59">Hi there</h1>
.pull-left[
.center[<img src="images/profile_picture.jpg" style="
height: auto;
border-radius: 50%;
border: 3px solid #414770;
width:45%;"/>] <hr/> <br/>
.center[
<h2 style="color:#5b85aa"> Janith Wanniarachchi </h2>
<br/>
.status-highlight[BSc. Statistics (Special) Undergraduate] at University of Sri Jayewardenepura, Sri Lanka,
.status-highlight[Data Science Intern] at Trabeya,
.status-highlight[Consultant Data Analyst] at Creative Hub.
]
]
.pull-right[
Co-author of two R packages maintained by Dr. Thiyanga Thalagala, [DSJobTracker](https://github.com/thiyangt/DSjobtracker) and [tsdataleaks](https://github.com/thiyangt/tsdataleaks).
.center[<img src="https://github.com/thiyangt/DSjobtracker/raw/master/hexsticker/figures/hexsticker.png" style="width: 30%"> <img src="https://github.com/thiyangt/ThiTal/raw/master/static/Slides/isf2020/hexsticker.png" style="width: 30%">]
Runner's up in two datathons and finalist in two datathons.
<br/>
<img src="images/2019_hacktoberfest.jpg" style="width:50%;transform: rotate(-40deg);" /><img src="images/2020_hacktoberfest.jpg" style="width:50%;transform: rotate(40deg);" />
]
---
# What is this data *wrangling*?
.center[<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/dplyr_wrangling.png" style="width:40%">]
.footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)]
* Data wrangling is where we take a raw dataset and transform it into a format that fits for different tasks
* Simply it's like meal prep where we get the data ready to make the main dish (in this case it could be a model or visualization)
* There are several packages that can help with data wrangling but for today we would be scratching the surface of the `dplyr` package in the tidyverse ecosystem.
---
<h1 style="text-align:center">What are we going to do today</h1>
.center[<img src="https://i.imgflip.com/55djd3.jpg"/>]
???
* I won't be teaching you a list of functions and how to use them today, instead I will be walking you through an actual scenario where these functions might be needed.
* The main outcome of this should be to have an idea about how to approach a problem and how to combine different tools that you have under your belt.
---
```{r,echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
spotify <- read_csv("data/spotify-songs.csv")
```
## Here's the scene,
.left-column[
<br/><br/>
<img src="images/spotify_logo.png"/>
<br/><br/>
.center[<span style="font-size:64pt"> + </span>]
<br/>
<img src="images/rstudio_logo_cropped.png"/>
]
.right-column[* You are the new Data Science intern at Spotify! and you are told to work on a simple csv dataset that resembles the actual dataset.
* You are given a snapshot of the dataset from 2020 to work on and your supervisor will be making the entire user interface of the recommender system.
* She told us to use the [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md) data on Spotify from 2020-01-21.
* At the end she will talk about extending this into an actual real time application.
* The supervisor will let you make the R codes locally using the dataset that was given to you and she will on the other end work on making the functions that show the plots and tables to the user.
* Usually most of your time will be spent on data wrangling hence she entrusts you to get the job done.
]
---
### Glimpse at the data
```{r,echo=FALSE,comment=NA}
spotify %>% glimpse(width = 70)
```
---
class: middle, center
<img src="images/numbers_mason_meme.jpg" style="width: 150%;height: 150%"/>
---
# What are all these numbers?
* The entire data dictionary was found [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md#spotify_songscsv) and you manage to jot it down as follows.
* We have the id,genre and subgenre of each playlist.
* The id,name and release date of each album.
* Finally id,name,artist and popularity of each track.
* In addition we have the following features of songs.
+ danceability : is it suitable for dancing (range: [0,1])
+ energy : how intense and active is the song (range: [0,1])
+ loudness : the overall loudness of track in dB (range: [-60,0])
+ speechiness : is there a lot of speech in it? (range: [0,1])
+ accousticness : is the song an acoustic song (range: [0,1])
+ instrumentalness : does the song have no vocal content (range: [0,1])
+ liveness : is there an audience in the song (range: [0,1])
+ valence : how positive is the song (range: [0,1])
+ key : the key of the song (CAUTION NA's means that there is no key)
+ mode : 1 if its a major or 0 if its a minor
+ tempo : BPM of the song
---
class: middle, center, title-only-slide
<h1>Task No. 1 <br/>(Spot the Cool Kids)</h1>
---
# Task No. 1 (Spot the Cool Kids)
.pull-left[
* As the first piece of the user interface for the recommmendation system, your supervisor thought of starting off with the basics.
* She has decided to put up a table showing the top 5 popular tracks and their artists.
* Your supervisor has asked you to make the table that goes out on the front page of the user interface.
]
.pull-right[
* Let's break it down.
]
--
<div class="pull-right" style="clear:none">
<ul>
<li>First let's select the tracks and their artist names and how popular they are. </li>
<li>Then we can think of a way to get the top five from that table someway. </li>
<li>This should work right? right????</li>
</ul>
<span class="center">
<img src="https://i.pinimg.com/originals/6c/06/7f/6c067f95ccf0401be464131c4e4f6c32.jpg" style="width:60%"/>
</span>
</div>
---
## How do we *select* column names?
```{r,eval=FALSE}
spotify %>%
select(track_name,track_artist,track_popularity) #<<
```
* The `select` command helps us select columns that we want, and we can give a range of columns or even the columns that we don't want to have.
* The `select` function works as follows.
+ The first argument is the tibble or data frame, in this case we have piped the spotify dataset using the pipe operator.
+ The rest of the arguments are the different ways of selecting the variables.
+ For example if we want to select columnA from tibble t we would say `select(t,A)`. Likewise if we want to select everything except column B from tibble t we would say `select(t,-B)`.
+ you can even use a set of functions called tidy selectors to select columns based on how the name starts or ends, (For example `select(t,starts_with("LKR_"))`)
Right now you know how to select the top rows of a tibble with the head function, so if you can get the tracks arranged by their popularity in ascending order then everything's all set!
---
# Order! in the dataset
We need a function that can sort the rows based on one or more columns.
--
This is where the `arrange` function comes to play. The `arrange` function takes up a tibble as the first input and a set of columns to order by increasing or decreasing order.
--
```{r,eval=FALSE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(track_popularity) %>% #<<
head(5)
```
```{r,echo=FALSE,eval=TRUE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(track_popularity) %>%
head(5) %>%
knitr::kable()
```
--
Wait something's not right, this isn't in descending order! We have to tell the `arrange` function to sort them in descending order.
---
You can tell the `arrange` function to order in descending order by using a minus operator in front, somewhat similar to the `select` function.
--
```{r,eval=FALSE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>% #<<
head(5)
```
```{r,echo=FALSE,eval=TRUE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
head(5) %>%
knitr::kable()
```
--
Hold up, why is it repeating like that?
One track can be in multiple playlists! That's why! Now what do we do?
Let's select only the distinct or unique rows.
--
For that we can use the `distinct` function. The `distinct` function takes in a tibble as the first argument and then a set of columns to find distinctive rows or uses all columns if no arguments are given. You can see the pattern here right?
---
## Task No. 1 is completed!
Finally by using the `distinct` function which returns only the unique rows in the dataframe we were able to get the top 5 track names and artists.
```{r,echo=TRUE,eval=FALSE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>% #<<
head(5) %>%
knitr::kable()
```
```{r,echo=FALSE,eval=TRUE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>%
head(5) %>%
knitr::kable()
```
Let's send this out to your supervisor now!
---
## Task No. 1 last final changes
.pull-left[Ding! Ding!
There's a new message from your supervisor on Slack.
She wants this to be more presentable on the user interface by renaming the columns and removing the popularity.
<img src="https://i.imgflip.com/2lhzmp.jpg" style="height:80%;width:80%;"/>
]
.pull-right[
Well we already know how to select columns that we don't need, but what about renaming columns?
]
--
<div class="pull-right" style="clear:none"/>
<p>Based on your gut feeling you google up "rename tidyverse function" and land on the rename function to find out the following.</p>
<p>The <code class="remark-inline-code">rename</code> function takes in pairs of arguments in the form of </p>
<p><em>new_name = old_name</em> </p>
<p>so we can rename the columns that we want with that function. </p>
<p>This is in addition to the first argument of a tibble to do all of these on. </p>
</div>
---
Using the `rename` function we were able to change the names of the column names as well
```{r,echo=TRUE,eval=FALSE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>%
head(5) %>%
select(-track_popularity) %>%
rename(Song = track_name, Artist = track_artist) #<<
```
```{r,echo=FALSE,eval=TRUE}
spotify %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>%
head(5) %>%
select(-track_popularity) %>%
rename(Song = track_name, Artist = track_artist) %>%
knitr::kable()
```
---
## Recap of Task No. 1
What did we learn from trying out Task No. 1?
* `select` function helps us select the columns that we want
* `arrange` function helps us arrange the table by columns
* `rename` function helps us rename columns
* `distinct` functions helps us select only the unique rows in the table
Congratulations! We finished one task at our internship at Spotify!
.center[<img src="https://media.giphy.com/media/kBZBlLVlfECvOQAVno/giphy.gif" />]
---
class: center,middle, title-only-slide
<h1>Task No. 2 <br/> (Missing the Good old days)</h1>
---
# Task No. 2 (Missing the Good old days)
.pull-left[
* Your supervisor just pointed out a funny thing in the table that you sent her. It contains the most popular songs of all times!
* She is all right with it since we worked really hard on it, but she wants to show a table of the top 5 most popular songs in the past year.
* The snapshot she gave us is from January 2020 but our code will be used in real time even after 2021 according to her.
]
.pull-right[
* First we need to make a new column indicating the year that the album was released in.
```{r,comment=NA}
spotify %>%
select(track_album_release_date) %>%
head(2)
```
]
--
<div class="pull-right" style="clear:none"> Uh-oh, These are in a character format and you don't know how to get the year out of a character like this. Let's ask our supervisor on how to solve this. </div>
---
## lubricate can fix your string of horrible dates?
Ding! Ding!
Your supervisor replied back telling you to use the function `as.Date` to make the column into a date and then the `year` function from the `lubridate` package.
--
<h2>What is this <span class="remark-inline-code" style="font-size:32pt">lubridate</span>?</h2>
lubridate is a package that helps in analysing and extracting useful information from dates, like the year, month, day or even difference between dates in different units!
.center[<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/lubridate_ymd.png" style="width:50%"/>]
.footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)]
---
You go through the documentation and Google online and land on a StackOverflow page to find out that this code works out to give you the years that you need.
--
```{r,eval=FALSE}
as.Date(spotify$track_album_release_date) %>% lubridate::year() #<<
```
--
Before you send this out to your supervisor you want to make sure that the code that you wrote is correct, so you start matching with the documentation to get a rough idea about what happens.
1. You create a vector of Dates from a vector of characters
2. You get the year as a number from the `year` function in `lubridate`
Sounds good!
.center[<img src="https://media.tenor.com/images/356d20c65288742f2ba36e4514f156b3/tenor.gif" width="40%"/>]
---
## How do we modify the existing columns?
.pull-left[We have a way to get an entire vector of data based on the pre existing column in the dataset.
Now how do we add a new column to a dataset? We could simply do this, `dataset$new_column = vector_of_data`.
But you suddenly remember how your supervisor brings in changes suddenly, therefore instead of making a break in the pipe in the middle, you try to find alternatives that can fit into the `%` operator pipeline.
]
--
.pull-right[
We need to modify the existing dataset that comes after the last step in the pipeline. We can use the `mutate` function for this. The argument format goes along as `new_column_name = modification`. The modification should give out a vector that fits the number of rows in the dataset.
.center[<img src="images/dplyr_mutate.png" style="width:95%"/>]
]
.footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)]
---
## Let's mutate the dataset!
We will first mutate the original dataset to check whether this works. Remember you just have to give the name of the column you want the operation to happen on, and there's no need to have the dataset before that followed with a `$`
--
```{r,comment=NA}
spotify %>%
mutate(release_year = lubridate::year(as.Date(track_album_release_date))) %>% #<<
select(release_year) %>%
head(2)
```
Success! We managed to get the years out. Now we need to get the maximum of these years.
--
```{r,eval=FALSE}
spotify_with_year <- spotify %>%
mutate(release_year = lubridate::year(as.Date(track_album_release_date)))
latest_year <- max(spotify_with_year$release_year,na.rm = TRUE)
```
```{r,eval=TRUE,echo=FALSE}
spotify_with_year <- spotify %>%
mutate(release_year = lubridate::year(as.Date(track_album_release_date)))
latest_year <- max(spotify_with_year$release_year,na.rm = TRUE)
print(latest_year)
```
---
## How can we filter the songs that we want?
The latest year is with us now! All we need to do now is to filter and get only the tracks that was released on a year that is equal to the latest year.
--
The `filter` function can be used to get a portion of a dataset that ticks the boxes of a set of one or more logical conditions. Basically any logical condition that gives out a logical vector matching the length of the dataset can be used to filter out the rows that you need.
It's kind of like an actual filter where the barriers are the rows with `FALSE`.
.center[<img src="images/dplyr_filter_sm.png" style="width:50%"/>]
.footnote[Image Credits goes to stats illustrations from [Allison Horst](https://github.com/allisonhorst/stats-illustrations)]
---
Let's use the `filter` function to filter out the rows that we need. We have to a logical condition as an argument to filter and you can combine different conditions using logical operators like &,|, ~ etc.
--
```{r,comment=NA}
spotify_with_year %>%
filter(release_year == (latest_year-1)) %>% #<<
select(track_name,release_year) %>%
head(5)
```
Great! Now we need to plug these into our previous code and we are good to go!
---
## Task 2 is done and dusted
```{r,echo=TRUE,eval=FALSE}
spotify_with_year %>%
filter(release_year == 2019) %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>%
head(5) %>%
select(-track_popularity) %>%
rename(Song = track_name, Artist = track_artist)
```
```{r,echo=FALSE,eval=TRUE}
spotify_with_year %>%
filter(release_year == 2019) %>%
select(track_name,track_artist,track_popularity) %>%
arrange(-track_popularity) %>%
distinct() %>%
head(5) %>%
select(-track_popularity) %>%
rename(Song = track_name, Artist = track_artist) %>%
knitr::kable()
```
---
## Recap of Task No. 2
What did we learn from trying out Task No.2 ?
* `as.Date` function converts character columns to Date columns.
* `year` function from `lubridate` package gets the year from a Date column.
* `mutate` function mutates existing columns to create new ones.
* `filter` function helps to filter the rows based on a condition.
That was a huge leap from what we learned in Task No. 1!
.center[<img src="https://en.meming.world/images/en/thumb/a/aa/Something_of_a_Scientist.jpg/600px-Something_of_a_Scientist.jpg" style="width:60%"/>]
---
class: center,middle, title-only-slide
<h1>Task No. 3 <br/> (Playlists for your moods)</h1>
---
# Task No. 3 (Playlists for your moods)
The supervisor has one final task for you.
You need to give the average scores for energy, danceability, instrumentalness, valence for each playlist to so that the supervisor can use that dataset in her Plotly Radar plot function.
This is going to be a tough one! You sit down and sketch out how the final dataset should look like and this is what you got,
```{r,echo=FALSE}
tibble(
playlist_genre = c("some name 1","some name 2"),
avg_energy = round(runif(2),1),
avg_dance = round(runif(2),1),
avg_instu = round(runif(2),1),
avg_valen = round(runif(2),1),
) %>%
knitr::kable()
```
So now we know what sort of outcome we expect at the end. Now the question remains how do you group these playlists together.
---
## Understanding group_by
We can use the `group_by` function in dplyr for this. But the `group_by` function can be tricky sometimes so you dig a bit deeper.
* `group_by` function literally groups rows with the same combination of values in the given columns together
* After that any `mutate` function that is applied will apply within the group.
* But remember to ungroup the dataset afterwards, otherwise they would remain in the grouped state and all the actions that you perform on it will be performed group wise.
.center[<img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/group_by_ungroup.png" style="width:50%" />]
---
## Testing out the waters
```{r,echo=FALSE,eval=TRUE}
mean_pop_energy <- spotify[spotify$playlist_genre == "pop",]$energy %>% mean()
```
Let's try seeing how this group_by works by trying to get the average energy of a genre like pop. You type out the values by hand and get the mean value as `r mean_pop_energy`.
--
```{r}
spotify %>%
group_by(playlist_genre) %>%
mutate(mean_energy = mean(energy)) %>%
select(playlist_genre,mean_energy) %>%
head(3)
```
--
So we have to call `distinct` for this as well it seems. And the tibble output shows that the dataset is now a grouped dataset by `playlist_genre`. We should add an ungroup at the end to remove that effect, otherwise we might forget and issues may come up.
---
With that in your head you first write out the `group_by` command and decide to be adventurous and `mutate` a new column for each of the averages.
--
```{r,comment=NA,echo=TRUE,eval=FALSE}
spotify %>%
group_by(playlist_genre) %>% #<<
mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE),
avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>%
rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
Instrumentals = avg_instu, Upbeat = avg_valen) %>%
select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>%
distinct() %>%
ungroup() %>%
head(4)
```
```{r,comment=NA,echo=FALSE,eval=TRUE}
spotify %>%
group_by(playlist_genre) %>% #<<
mutate(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE),
avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>%
rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
Instrumentals = avg_instu, Upbeat = avg_valen) %>%
select(Genre,Energetic,Dance,Instrumentals, Upbeat) %>%
distinct() %>%
ungroup() %>%
head(4) %>% knitr::kable()
```
--
Do we really have to call distinct again like this? Or is there an easier way than having to `group_by` -> `mutate` -> `select` -> `distinct`?
---
## Task No. 3 is over!
Your supervisor noticed what you were typing on your laptop and gave a tip to Google up on the function `summarize` to make the code more readable and safer.
And now after digging on the `summarize` function, you realize that by simply replacing mutate with summarize your code now looks like this. And it automatically ungroups the outermost group after giving the summary statistics.
--
```{r,echo=TRUE,eval=FALSE,warning=FALSE,message=FALSE}
spotify %>%
group_by(playlist_genre) %>%
summarize(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE), #<<
avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>% #<<
rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
Instrumentals = avg_instu, Upbeat = avg_valen) %>%
head(5)
```
```{r,echo=FALSE,warning=FALSE,message=FALSE}
spotify %>%
group_by(playlist_genre) %>%
summarize(avg_energy = mean(energy,na.rm=TRUE),avg_dance = mean(danceability,na.rm=TRUE),
avg_instu = mean(instrumentalness,na.rm=TRUE),avg_valen = mean(valence,na.rm=TRUE)) %>%
rename(Genre = playlist_genre,Energetic = avg_energy, Dance = avg_dance,
Instrumentals = avg_instu, Upbeat = avg_valen) %>%
head(5) %>%
knitr::kable()
```
---
## Recap of Task No. 3
What did we learn from trying out Task No. 3?
* `group_by` helps to group the dataset into groups based on the values of a column
* `summarize` helps to make a summary of a grouped / ungrouped dataset to get summary statistics
Finally! We finished all three tasks! We are now well on our way to start analysing data that come our way.
.center[<img src="https://i.kym-cdn.com/entries/icons/original/000/028/021/work.jpg" style="width:70%"/>]
---
.pull-left[## What did we learn today?
* Data wrangling is kind of like meal prep where you prepare your ingredients(the raw data) for the final dish(the model or visualization)
* There are several functions in different packages such as `filter`,`mutate`,`select`,
`arrange`,`group_by`,`summarize`,
`distinct` from the `dplyr` package and even date functions from the `lubridate` package that help to wrangle data into the format we need.]
--
.pull-right[## Where to go from here?
* Get the Spotify Developer Account and access the APIs to create a dataset that is upto date and matches your style.
* Create an R shiny dashboard that can show interactive graphs of all the artists and where they fall on different indexes which are based on how positive the song is and so on.
* Analyse [#TidyTuesday](https://github.com/rfordatascience/tidytuesday/) datasets that come out weekly and follow the [R for Data Science Community](https://twitter.com/R4DScommunity) on Twitter for more inspiration.]
---
class: center, middle
<br/>
# Thank you!
<br/>
<h2 style="color:#5b85aa">Where can you find me?</h2>
Email: [[email protected]](#)
Twitter: [@janithcwanni](https://twitter.com/janithcwanni)
Github: [@janithwanni](https://github.com/janithwanni)
Linkedin: [Janith Wanniarachchi](https://www.linkedin.com/in/janith-wanniarachchi-462851117/)