-
Notifications
You must be signed in to change notification settings - Fork 0
/
stock_analysis.Rmd
300 lines (232 loc) · 15.3 KB
/
stock_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "STAT 131 Final Project"
author: "Mina Lee, Isha Vaish, Tom Zhang"
date: "2023-04-20"
output:
html_document:
toc: yes
toc_float: yes
toc_collapsed: yes
toc_depth: 3
number_sections: no
theme: united
pdf_document:
toc: yes
toc_depth: '3'
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
require(astsa)
require(knitr)
library(ggplot2)
library(zoo)
library(TSA)
library(tidyverse)
library(MuMIn)
library(tseries)
library(rugarch)
```
# Part 1: Dataset Exploration
The goal of our project is to predict Korean Stock Market Volatility (VKOSPI) which is equivalent to the VIX in the United States. We will do this using the Options in Korean Stock Market Kaggle Dataset that can be found <a href=https://www.kaggle.com/datasets/ninetyninenewton/vkospi>here</a>.
We begin by loading in the data set...
```{r}
# import the data
ds <- read.csv('options_KR.csv')
head(ds)
```
Next, let's take a look at the KOSPI 200 and VKSOPI trace plots...
```{r}
time <- as.Date(ds$Date)
kospi <- ts(ds$KOSPI200) #daily returns
vkospi <- ts(ds$VKOSPI) #volatility
plot(time, kospi, type='l', col='red', main="KOSPI Time Series")
plot(time, vkospi, type='l',col='blue', main = "VKOSPI Time Series")
```
From the above trace plots, we notice a trend in both the KOSPI (increasing trend) and VKOSPI (decreasing trend). Both timeseries also don't seem to be stationary. First, let's take the log of KOSPI and try de-trending it.
```{r}
# pre-processing KOSPI
log_kospi = log(kospi) #take log
detrended_kospi = residuals(lm(log_kospi ~ seq_along(log_kospi))) #de-trend
plot(time,detrended_kospi, type= 'l', col = 'Red', main = "De-trended log KOSPI", ylab = "residuals(log(KOSPI))")
```
While we seem to have removed the linearly increasing trend, there appears to be apparent seasonality. Let's try differencing with lag 1.
```{r}
plot(diff(detrended_kospi), type='l', col='red',main='Differenced Residual(log(KOSPI))', ylab = "Differenced Residual(log(KOSPI))")
```
This looks so much better! We seem to have removed the linear trend and seasonality! The series looks a lot more stationary now.
Let's check if we calculated the returns correctly (i.e. pre-processed KOSPI correctly). The below formula was found [INSERT SOURCE HERE] and is used to calculate the returns given the Korean blue chip index (KOSPI).
```{r}
# Double checking our pre-processing math and seeing if it matches the formula!
KOSPI200_yesterday <- lag(ds$KOSPI200, n = 1)
return_array = (kospi-KOSPI200_yesterday) / KOSPI200_yesterday
plot(return_array, type='l', main='Return of KOSPI200')
```
It matches! We will know refer to our differenced(residual(log(KOSPI))) series as our Return. Let's check the mean of our returns...
```{r}
kospi_returns = diff(detrended_kospi)
mean(kospi_returns)
```
Great, the mean of our returns doesn't seem to be significantly far from 0! Let's check the ACF and PACF plots and see if the returns are somewhat uncorrelated over time...
```{r}
acf(kospi_returns)
pacf(kospi_returns)
```
There doesn't appear to be any obvious pattern in the ACF or PACF plots. Moreover, the majority of the correlations are not significant. We can conclude that our returns are indeed somewhat uncorrelated over time! Now, let's see if we can detect some ARCH effects...
```{r}
McLeod.Li.test(y=kospi_returns, main = "McLeod Li for KOSPI Returns")
```
All of our p-values are significant! There are definitely ARCH effect present. Let's move on to some modeling!
# Part 2: Modeling
## Model 1 - GARCH
First, let's try fitting a GARCH model. Below we plot the ACF, PACF of the squared returns and the EACF to help us decide the order of our GARCH model...
```{r}
acf(kospi_returns^2, main = "ACF of Squared Returns")
pacf(kospi_returns^2, main = "PACF of Squared Returns")
eacf(kospi_returns)
```
Based on the EACF plot, let's try a GARCH(1,1) model...
```{r}
model1<- garch(x=kospi_returns, order = c(1,1))
summary(model1)
```
The estimated model is $x_t = \sigma_t\epsilon_t$ where $\sigma_t^2 = 1.450e^{-6} + 4.535e^{-2} X_{t-1}^2+9.386e^{-1}\sigma_{t-1}^2$. As we can see from the summary output above, all of our model coefficients are significant at the .05 significance level (good). Moreover, our model passes the Box-Ljung test which shows that our squared residuals are not significantly correlated (good). However, our model fails that Jarque Bera test meaning our residuals are not normally distributed (bad). We plot the QQPLot of the residuals below...
```{r}
qqnorm(model1$residuals)
qqline(model1$residuals)
```
As we can see from the above QQPlot, are residuals are definitely not normal. This means we should try another model that can better fit the data and try to capture some of the non-normality in the residuals. Before we move on, lets plot the estimated conditional variance (i.e. volatility) estimated by the model and compare to the VKOSPI values.
```{r}
plot(time[2:length(time)], (model1$fitted.values[,1])^2, type = 'l', xlab = "time", ylab = 'Estimated Conditional Variance', main = "GARCH(1,1) Estimated Volatility" ) #estimated volatility
```
```{r}
plot(time, vkospi, type = 'l', col = 'blue', ylab = 'VKOSPI')
```
We can see that there is some similarity in trend between the volatility estimated by our model and the VKOSPI. For example, there is a very noticeable peak near 2012 in the estimated volatility as there is in the VKOSPI series. Below we plot the CCF plot of the estimated values from our model against the VKOSPI series. As we can see from the CCF plot, the estimated volatilities from our model are highly correlated with the VKOSPI series. Moreover, the peak of the plot occurs near (but not exactly at) lag 0 showing that our model does pretty well in estimating a similar volatility trend as found in the VKOSPI series.
```{r}
estimated_values = (model1$fitted.values[,1])^2
ccf(estimated_values[2:length(estimated_values)], vkospi[2:length(estimated_values)], main = "CFF of Estimated Volatilities and VKOSPI")
```
## Model 2 - EGARCH
EGARCH (Exponential Generalized Autoregressive Conditional Heteroskedasticity) is an extension of the GARCH (Generalized Autoregressive Conditional Heteroskedasticity) model that allows for asymmetric volatility. The EGARCH model is specified in terms of the logarithm of the conditional variance, which allows for the conditional variance to be negative. The EGARCH model also includes an exponential term, which helps to capture the long-term effects of volatility shocks. The main difference between the EGARCH and GARCH models is that the EGARCH model allows for asymmetric volatility effects, while the GARCH model assumes that volatility responds symmetrically to positive and negative shocks.
Like we did in GARCH, we fitted an EGARCH(1, 1) model on the returns. The distribution of the standardized residuals is assumed to be normal.
```{r}
#Specify EGARCH models
egarch.spec <- ugarchspec(variance.model=list(model="eGARCH",garchOrder=c(1,1)),
mean.model=list(armaOrder=c(0,0)))
#fit EGARCH model
egarch_model11 <- ugarchfit(spec=egarch.spec, data = kospi_returns)
```
```{r}
# check the summary of egarch model
egarch_model11
```
Interpreting output:
The estimated model is $\epsilon_t=\sigma_tz_t$ where $z_t$ is standard Gaussian and $ln(\sigma_t^2) =-0.073 \frac{|-0.1699|+0.0796*(-0.1699)}{\sigma_{t-1}} + 0.98ln(\sigma_{t-1}^2)$. The alpha1 coefficient is statistically significant as its p-value is 0.00000, telling that past volatility has an impact on current volatility. Also, Weighted Ljung-Box Test on Standardized Residuals says no serial correlation meaning that the test found no significant evidence of autocorrelation in the standardized residuals up to a certain lag. In addition, the Adjusted Pearson Goodness-of-Fit Test returns a low number, meaning that the model does not fit the data well. Therefore, we will attempt on a new EGARCH model by alternating to t-distribution.
```{r}
#EGARCH with t distribution
#Specify EGARCH models:
spec = ugarchspec(variance.model=list(model="eGARCH",garchOrder=c(1,1)), mean.model=list(armaOrder=c(0,0)),distribution.model="std")
#fit EGARCH model
egarch_model11.t=ugarchfit(data=kospi_returns, spec=spec)
##summary of EGARCH fit
egarch_model11.t
```
For t distribution model, the estimated model is $\epsilon_t=\sigma_tz_t$ where $ln(\sigma_t^2) = 0.0002 + -0.087 \frac{|-0.172|+0.088*(-0.172)}{\sigma_{t-1}} + 0.98ln(\sigma_{t-1}^2)$. The alpha1 coefficient is also statistically significant as its p-value is 0.00000, telling that past volatility has an impact on current volatility. Weighted Ljung-Box Test on Standardized Residuals also says no serial correlation meaning that the test found no significant evidence of autocorrelation in the standardized residuals up to a certain lag. In addition, even though the Adjusted Pearson Goodness-of-Fit Test is higher than that of the normal distribution, the number is still low so we shoudl conclude that the model does not fit the data well.
For both of the EGARCH models, we may notice that leverage effect presents because alpha1 is smaller than 0.
```{r}
##QQ-plot for fitted GARCH Model with t distribution
options(repr.plot.width=20, repr.plot.height=8)
par(mfrow=c(1,2))
plot(egarch_model11, which=8)
plot(egarch_model11.t, which=8)
plot(egarch_model11, which=9)
plot(egarch_model11.t, which=9)
```
Comparing between two different distributions of EGARCH, the EGARCH-t distribution QQ-plot seems to be more straight. However, we will try to estimate with IGARCH to check if it returns better results.
## Model 3 - IGARCH
The IGARCH (Integrated Generalized Autoregressive Conditional Heteroskedasticity) model is a special case of the GARCH model, where the sum of the AR and MA coefficients is constrained to equal one (thereby exhibiting persistence). This restriction implies that the shocks to the volatility have a permanent effect, and that the volatility process is non-stationary.
As as result, IGARCH is specifically designed to capture non-stationary processes, with the persistence of the volatility shocks lasting indefinitely; therefore, it might provide better forecasts for non-stationary volatilities.
Another advantage to using IGARCH is that they have fewer parameters due to the constraint, which can lead to simpler models and potentially more stable parameter estimates.
Now, we fit an IGARCH(1, 1) model on the returns. The distribution of the standardized residuals is assumed to be normal.
```{r}
# model specifications
spec <- ugarchspec(
variance.model = list(model = "iGARCH", garchOrder = c(1, 1)),
mean.model = list(armaOrder = c(0, 0), include.mean = TRUE),
distribution.model = "norm"
)
# fit model
igarch_11 <- ugarchfit(spec, data = kospi_returns)
igarch_11
```
Interpreting output:
Notably, the alpha1 coefficient is statistically significant with a p-value of 0.00000, indicating that past volatility has a significant impact on current volatility.
The Ljung-Box test results for the standardized residuals and squared residuals indicate no significant serial correlation, suggesting that the model has adequately captured the autocorrelation structure in the returns and volatility.
The adjusted Pearson goodness-of-fit test assesses the fit of the model's assumed distribution to the data. The p-values are very small, indicating that the normal distribution may not be an appropriate assumption for the standardized residuals. This might suggest considering alternative distribution assumptions, such as the Student's t-distribution or the Generalized Error Distribution.
Next, we attempt fitting the same model but with a t-distribution assumption:
```{r}
# model specifications
spec <- ugarchspec(
variance.model = list(model = "iGARCH", garchOrder = c(1, 1)),
mean.model = list(armaOrder = c(0, 0), include.mean = TRUE),
distribution.model = "std"
)
# fit model
igarch_11.t <- ugarchfit(spec, data = kospi_returns)
igarch_11.t
```
Interpreting output:
We observe that parameters mu, alpha1, and shape are all statistically significant, suggesting that the t-distribution assumption improves the model fit.
The Ljung-Box test results for the standardized residuals and squared residuals indicate no significant serial correlation, suggesting that the model has adequately captured the autocorrelation structure in the returns and volatility.
The adjusted Pearson goodness-of-fit test p-values are smaller than those for the normal distribution assumption, but still significant, indicating that the t-distribution provides a better fit than the normal distribution, but there might still be room for improvement.
Finally, to visualize the comparison, we plot their QQ-plots:
```{r}
par(mfrow=c(1,2))
plot(igarch_11, which=9)
plot(igarch_11.t, which=9)
```
# Part 3: Model Comparison
Previously, we fit a few models on predicting volatility. We have GARCH and its variants EGARCH and IGARCH. We are now interested in determining which is the best. To accomplish this, we can use metrics such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), as well as the log-likelihood of each model. The model with the lowest AIC and BIC values and the highest log-likelihood is generally considered the best fitting model.
```{r}
# a helper function
extract_criteria <- function(fit) {
aic <- tryCatch({
infocriteria(fit)[1]
}, error = function(cond) {
log_likelihood <- as.numeric(logLik(fit))
n_params <- length(coef(fit))
-2 * log_likelihood + 2 * n_params
})
bic <- tryCatch({
infocriteria(fit)[2]
}, error = function(cond) {
log_likelihood <- as.numeric(logLik(fit))
n_params <- length(coef(fit))
-2 * log_likelihood + n_params * log(length(kospi_returns))
})
loglik <- tryCatch({
likelihood(fit)
}, error = function(cond) {
as.numeric(logLik(fit))
})
return(c(aic, bic, loglik))
}
# get criteria from each model
criteria <- lapply(list(model1, egarch_model11.t, igarch_11.t), extract_criteria)
# store results
model_comparison <- data.frame(Model = c("GARCH", "EGARCH", "IGARCH"),
AIC = sapply(criteria, "[", 1),
BIC = sapply(criteria, "[", 2),
LogLik = sapply(criteria, "[", 3))
model_comparison
# determine the best model
best_model_aic <- model_comparison[which.min(model_comparison$AIC), "Model"]
best_model_bic <- model_comparison[which.min(model_comparison$BIC), "Model"]
best_model_loglik <- model_comparison[which.max(model_comparison$LogLik), "Model"]
cat("Best model based on AIC:", best_model_aic, "\n")
cat("Best model based on BIC:", best_model_bic, "\n")
cat("Best model based on log-likelihood:", best_model_loglik, "\n")
```
Observations:
We note that the AIC and BIC of GARCH are much lower than those of EGARCH and IGARCH. This could possibly be explained by the model complexity. Notice that while EGARCH and IGARCH can capture certain features of the data that GARCH models cannot (such as asymmetric volatility and integrated behavior), they might introduce additional complexity that is not justified by the improvement in the likelihood. If the extra parameters in EGARCH and IGARCH models do not provide a substantial increase in the likelihood, the AIC and BIC values will penalize these models for the additional complexity, resulting in higher values compared to the GARCH(1,1) model.
Hence, even though EGARCH maximizes log-likelihood, GARCH still beats it on both AIC and BIC. Thus, we conclude that GARCH(1, 1) is our best model.