Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helped line formatting for readability + Added a section about the importance of this research. #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 113 additions & 22 deletions parkinson.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,87 @@ knitr::opts_chunk$set(echo = TRUE)

### Description and literature:

Parkinson's disease is a neurodegenerative disorder of central nervous system that causes partial or full loss of motor reflexes, speech, behavior, mental processing, and other vital functions [1]. It is generally observed in elderly people and causes disorders in speech and motor abilities (writing, balance, etc.) of 90% of the patients [2]. Ensuing Alzheimer, PD is the second common neurological health problem in elder ages and it is estimated that nearly 10 million people all around the world and approximately 100k people in Turkey are suffering from this disease [3], [4]. Particularly, PD is generally seen in one out of every hundred people aged over 65. Currently, there is no known cure for the disease [5], [6]. Although, there is significant amount of drug therapies to decrease difficulties caused by the disorder, PD is usually diagnosed and treated using invasive methods [7]. Therefore, this complicates the process of diagnosis and treatment of patients who are grieving from the disease. Our main motivation for working with this dataset is to find significant variables in identifying patients suffering from Parkinson's disease with the means of multivariate analysis.

In this study, we will analyze the patients' data who are diagnosed with the disease. Using speech data from subjects is expected to help the development of a noninvasive diagnostic. There are important examples of these kinds of Alzheimer and PD studies all around the world [8]. The studies based on the PD focus on symptoms like slowness in movement, poor balance, trembling, or stiffness of some body parts but especially voice problems. The main reason behind the popularity of PD diagnosis from speech impairments is that tele-diagnosis and tele-monitoring systems based on speech signals are low in cost and easy to self-use [6], [8]. Such systems lower the inconvenience and cost of physical visits of PD patients to the medical clinic, enable the early diagnosis of the disease, and also lessen the workload of medical personnel [7], [8]. People with Parkinsonism (PWP) suffer from speech impairments like dysphonia (defective use of the voice), hypophonia (reduced volume), monotone (reduced pitch range), and dysarthria (difficulty with articulation of sounds or syllables). Even though there are many studies aiming at diagnosing and monitoring PD using these impairments, the origin of these studies leans to diagnose basic voice disorders [8]. Therefore, our analysis in this project will be based on voice parameters of the affected. The following section will illustrate a short description of the dataset formation and how we are planning to approach the problem.
Parkinson's disease is a neurodegenerative disorder of central nervous
system that causes partial or full loss of motor reflexes, speech, behavior,
mental processing, and other vital functions [1]. It is generally observed
in elderly people and causes disorders in speech and motor abilities
(writing, balance, etc.) of 90% of the patients [2]. Ensuing Alzheimer,
PD is the second common neurological health problem in elder ages and it is
estimated that nearly 10 million people all around the world and approximately
100k people in Turkey are suffering from this disease [3], [4]. Particularly,
PD is generally seen in one out of every hundred people aged over 65. Currently,
there is no known cure for the disease [5], [6]. Although, there is significant
amount of drug therapies to decrease difficulties caused by the disorder, PD is
usually diagnosed and treated using invasive methods [7]. Therefore, this
complicates the process of diagnosis and treatment of patients who are
grieving from the disease. Our main motivation for working with this dataset
is to find significant variables in identifying patients suffering from
Parkinson's disease with the means of multivariate analysis.

In this study, we will analyze the patients' data who are diagnosed with the
disease. Using speech data from subjects is expected to help the development
of a noninvasive diagnostic. There are important examples of these kinds of
Alzheimer and PD studies all around the world [8]. The studies based on the
PD focus on symptoms like slowness in movement, poor balance, trembling, or
stiffness of some body parts but especially voice problems. The main reason
behind the popularity of PD diagnosis from speech impairments is that
tele-diagnosis and tele-monitoring systems based on speech signals are low
in cost and easy to self-use [6], [8]. Such systems lower the inconvenience
and cost of physical visits of PD patients to the medical clinic, enable the
early diagnosis of the disease, and also lessen the workload of medical
personnel [7], [8]. People with Parkinsonism (PWP) suffer from speech
impairments like dysphonia (defective use of the voice), hypophonia
(reduced volume), monotone (reduced pitch range), and dysarthria
(difficulty with articulation of sounds or syllables). Even though there
are many studies aiming at diagnosing and monitoring PD using these
impairments, the origin of these studies leans to diagnose basic voice
disorders [8]. Therefore, our analysis in this project will be based on
voice parameters of the affected. The following section will illustrate a
short description of the dataset formation and how we are planning to
approach the problem.

### The Importance of this Research

The most unfortunate aspect is the lack of progress & financial impact its has
had on families, economies & nations alike. For example, the average American
family estimates to pay $10,000 anually, to simply to cover basic Parkinson's
treatment on a regular basis [13]. But that is a droplet compared to the
~$52 billion America has used alone to cover Parkinson's treatment nationwide.
[14] And with hundreds of millions donated to Parkinson's research & development,
why does it seems only like private companies like Nike are the ones making
a difference? Where researchers seems just to help improve said treatment,
Nike has produced actual products to aid the act of tying shoes to those
Parkinson's patients suffering motor disabilities most prominently in their
hands. Hence, it seems like it's up to private companies & the
public to charge for greater awareness by creating our own technology or
performing our own research.

### Data:

The dataset was created by Athanasios Tsanas and Max Little of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation who developed the tele-monitoring device to record the speech signals. The original study [9] used a range of linear and nonlinear regression methods to predict the clinician's Parkinson's disease symptom score on the UPDRS scale.
The dataset was created by Athanasios Tsanas and Max Little of the
University of Oxford, in collaboration with 10 medical centers in the US
and Intel Corporation who developed the tele-monitoring device to record the
speech signals. The original study [9] used a range of linear and nonlinear
regression methods to predict the clinician's Parkinson's disease symptom
score on the UPDRS scale.

This dataset is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a six-month trial of a tele-monitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patient's homes.
This dataset is composed of a range of biomedical voice measurements from
42 people with early-stage Parkinson's disease recruited to a six-month
trial of a tele-monitoring device for remote symptom progression monitoring.
The recordings were automatically captured in the patient's homes.

Columns in the dataset contain subject number, subject age, subject gender, time interval from baseline recruitment date, motor UPDRS, total UPDRS, and 16 biomedical voice measures. Each row corresponds to one of 5,875 voice recording from these individuals.
Columns in the dataset contain subject number, subject age, subject gender,
time interval from baseline recruitment date, motor UPDRS, total UPDRS, and
16 biomedical voice measures. Each row corresponds to one of 5,875 voice
recording from these individuals.

The main aim of the data is to predict the motor and total UPDRS scores ('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.
The main aim of the data is to predict the motor and total UPDRS scores
('motor_UPDRS' and 'total_UPDRS') from the 16 voice measures.

The data is in ASCII CSV format. The rows of the CSV file contain an instance corresponding to one voice recording. There are around 200 recordings per patient, the subject number of the patient is identified in the first column [10], [11].
The data is in ASCII CSV format. The rows of the CSV file contain an
instance corresponding to one voice recording. There are around 200
recordings per patient, the subject number of the patient is identified in
the first column [10], [11].

### Attribute Information:

Expand All @@ -40,15 +106,18 @@ Age: Subject age

Sex: Subject gender '0' - male, '1' - female

Test_time: Time since recruitment into the trial. The integer part is the number of days since recruitment
Test_time: Time since recruitment into the trial.
The integer part is the number of days since recruitment

Motor_UPDRS: Clinician's motor UPDRS score, linearly interpolated

Total_UPDRS: Clinician's total UPDRS score, linearly interpolated

Jitter (%), Jitter(Abs), Jitter. RAP, Jitter. PPQ5, Jitter. DDP: Several measures of variation in fundamental frequency (Frequency parameters)
Jitter (%), Jitter(Abs), Jitter. RAP, Jitter. PPQ5, Jitter. DDP:
Several measures of variation in fundamental frequency (Frequency parameters)

Shimmer, Shimmer (dB), Shimmer. APQ3, Shimmer. APQ5, Shimmer. APQ11, Shimmer. DDA: Several measures of variation in amplitude (Amplitude parameters)
Shimmer, Shimmer (dB), Shimmer. APQ3, Shimmer. APQ5, Shimmer. APQ11, Shimmer.
DDA: Several measures of variation in amplitude (Amplitude parameters)

NHR, HNR: Two measures of ratio of noise to tonal components in the voice

Expand All @@ -62,8 +131,10 @@ PPE: A nonlinear measure of fundamental frequency variation

```{r}
#Read the data file
parkinsons <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data", header = TRUE)
#parkinsons <- read.csv("C:\\Users\\Praveen\\Documents\\MSDS\\ISQS6350-multivariateanalysis\\project\\parkinsons_updrs.csv", header = TRUE, stringsAsFactors = TRUE)
parkinsons <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-dat
abases/parkinsons/telemonitoring/parkinsons_updrs.data", header = TRUE)
#parkinsons <- read.csv("C:\\Users\\Praveen\\Documents\\MSDS\\ISQS6350-mul
tivariateanalysis\\project\\parkinsons_updrs.csv", header = TRUE, stringsAsFactors = TRUE)
str(parkinsons)
```

Expand All @@ -85,11 +156,14 @@ plot(qchisq((1:nrow(x) - 1/2) / nrow(x), df = ncol(x)),
abline(a = 0, b = 1)
```

From the above figure, we can see that our multivariate data is not perfectly normally distributed. There may be some outliers in our data set.
From the above figure, we can see that our multivariate data is not
perfectly normally distributed. There may be some outliers in our data set.

# Data Cleaning and Outlier Removal:

Our first step is going through the dataset and identify any missing value or outlier to take necessary measures. This step is essential to prepare the data for fruitful analysis.
Our first step is going through the dataset and identify any missing value
or outlier to take necessary measures. This step is essential to prepare
the data for fruitful analysis.


#### Check null values
Expand All @@ -103,7 +177,9 @@ There are no missing values in our dataset.
#### Check correlations between the variables
```{r}
library(corrplot)
corrplot(cor(parkinsons), type="full", method ="color", title = "Parkinson correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8, outline= T, tl.col="indianred4")
corrplot(cor(parkinsons), type="full", method ="color", title =
"Parkinson correlatoin plot", mar=c(0,0,1,0), tl.cex= 0.8,
outline= T, tl.col="indianred4")
```
We can see that all the jitter variables highly correlate with Shimmer variables.

Expand All @@ -112,16 +188,23 @@ We can see that all the jitter variables highly correlate with Shimmer variables
summary(parkinsons[,-3])
```

The total_UPDRS (Unified Parkinson's Disease Ratings Score) is the main variable of interest, which determines the clinical impression of Parkinson's disease (PD) severity. Thus, we plot total_UPDRS scores against other variables in our data set to find out outliers.
The total_UPDRS (Unified Parkinson's Disease Ratings Score) is the main
variable of interest, which determines the clinical impression of Parkinson's
disease (PD) severity. Thus, we plot total_UPDRS scores against other
variables in our data set to find out outliers.

```{r}
#Scattered plot to look into data distribution
plot(jitter(total_UPDRS)~., parkinsons)
```

In our scattered plot between total_UPDRS and Jitter, it looks like, we can see out outlier observations in our data. Similarly, in our plots with total_UPDRS vs Shimmer, total_UPDRS vs NHR, total_UPDRS vs RPDE, total_UPDRS vs DFA, and total_UPDRS vs PPE, we can see some outlier observations.
In our scattered plot between total_UPDRS and Jitter, it looks like, we can
see out outlier observations in our data. Similarly, in our plots with
total_UPDRS vs Shimmer, total_UPDRS vs NHR, total_UPDRS vs RPDE, total_UPDRS
vs DFA, and total_UPDRS vs PPE, we can see some outlier observations.

We will now look into bivariate boxplots in our data to look for outlier observations in our data.
We will now look into bivariate boxplots in our data to look for outlier
observations in our data.
```{r}
library(MVA)
#boxplots
Expand All @@ -137,7 +220,9 @@ bvbox(parkinsons[,c(6,21)], xlab = "total_UPDRS", ylab = "DFA")

bvbox(parkinsons[,c(6,22)], xlab = "total_UPDRS", ylab = "PPE")
```
The bivariate boxplot is showing a lot of our observations as outliers. Thus, we want to check our results with Convex Hull method as we don't want to change the distribution of our data by removing the outliers.
The bivariate boxplot is showing a lot of our observations as outliers.
Thus, we want to check our results with Convex Hull method as we don't want
to change the distribution of our data by removing the outliers.

```{r}
#Convex hull method
Expand Down Expand Up @@ -179,11 +264,15 @@ dim(parkinsons)

# Dimensionality Reduction:

Our next step is dimensionality reduction. The dataset is very large with 22 variables and some of the variables have high correlations between them. So we are expecting to reduce the number of dimensions for better interpretation of the data.
Our next step is dimensionality reduction. The dataset is very large with
22 variables and some of the variables have high correlations between them.
So we are expecting to reduce the number of dimensions for better
interpretation of the data.

#### Multi-dimensional scaling

First we try Multi-dimensional scaling which can help us visualizing the variable relationships in 2D graphs.
First we try Multi-dimensional scaling which can help us visualizing the
variable relationships in 2D graphs.

```{r}
#Multi dimensional scaling
Expand Down Expand Up @@ -567,4 +656,6 @@ http://www. parkinsondernegi.org/Icerik.aspx?Page=parkinsonnedir&ID=5

[12] Max A. Little, Patrick E. McSharry, Eric J. Hunter, Lorraine O. Ramig (2009), 'Suitability of dysphonia measurements for telemonitoring of Parkinson's disease', IEEE Transactions on Biomedical Engineering, 56(4):1015-1022

[13] Cobb, D. (2021), Financial options for parkinson's Care &amp; Assistive Needs. Payment Options &amp; Financial Assistance for Senior Care. Online link: https://www.payingforseniorcare.com/parkinsons/financial-assistance

[14] Yang, W., Hamilton, J. L., Kopil, C., Beck, J. C., Tanner, C. M., Albin, R. L., Ray Dorsey, E., Dahodwala, N., Cintina, I., Hogan, P., &amp; Thompson, T. (2020). Current and projected future economic burden of parkinson's disease in the U.S. Nature News. Online link: https://www.nature.com/articles/s41531-020-0117-1