-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject.Rnw
359 lines (316 loc) · 17.9 KB
/
Project.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% knitr document Knöpfel, Krosigk
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%
% Preamble
%%
\documentclass[10pt]{article}
\usepackage{times}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[english]{varioref}
\usepackage[hidelinks,bookmarksnumbered]{hyperref}
\usepackage{latexsym}
\usepackage{stmaryrd}
\usepackage{soul}
\usepackage{booktabs}
\usepackage{longtable}
\usepackage{setspace}
\onehalfspacing
% Literature
\usepackage[backend=bibtex,style=authoryear-icomp, dashed=false]{biblatex}
\bibliography{Literature}
% Page Numbers
\newcounter{PagenrStorage}
%%
% Document
%%
\begin{document}
%%
% Titlepage
%%
\begin{center}
\includegraphics[scale=.03]{UniSG_Logo}\\[.4cm]
\begin{large}
\textbf{HS16-7,610,1.00 Software Engineering for Economists}\\[.4cm]
\end{large}
University of St. Gallen\\
\today \\[.4cm]
\rule{\textwidth}{1.5pt}\\
\vspace{.2cm}
\textbf{\large Forecasting excess returns of the S\&P 500 using ARIMA models}
\rule{\textwidth}{1.5pt}\\[9.0cm]
\end{center}
\vspace{0.4cm}
\rule{\textwidth}{1pt}\\
\textbf{Author:}
Manuel von Krosigk (16-620-452)\\
\textbf{Keywords:} ARMA, Capital Market Efficiency, Excess returns, Forecasting\\
\thispagestyle{empty}
\newpage
%%
% Contents
%%
\pagenumbering{Roman}
\tableofcontents
\listoffigures \addcontentsline{toc}{section}{List of Figures}
\listoftables \addcontentsline{toc}{section}{List of Tables}
\newpage
\setcounter{PagenrStorage}{\value{page}}
%%
% Body
%%
\pagenumbering{arabic}
% Source: https://gist.github.com/pschmied/9691642
% Source: http://kbroman.org/knitr_knutshell/pages/figs_tables.html
<<setup, include=FALSE, cache=FALSE>>=
# Run the following if you don't have these installed
# install.packages(c("knitr", "ggplot2", "GGally", "xtable", "ggthemes", "xkcd"))
library(knitr)
library(ggplot2) # For plots
# library(GGally) # For pairplots, disable if it causes ggplot errors
library(xtable) # For pretty tables
# library(ggthemes) # plot themes?
# library(xkcd) # How about xkcd themes?
# set global chunk options
opts_chunk$set(fig.path='figure/minimal-', fig.align='center',
fig.show='markup', warning=FALSE)
options(replace.assign=TRUE,width=90)
@
\section{Introduction}
Whether it is possible to forecast excess returns is one of the key indicators for the efficiency of capital markets. Particularly in times of automated high frequency trading, it should not be possible to be able to generate a consistently positive return, using only historical information on a daily basis. Previous research has extensively been trying to find a solution to this problem as has been described in \textcite{Campbell.2008} where the authors test whether it is possible to outperform the historical average in terms of predictive power.
Since homogenous expectations and further efficiency criteria are minimal requirements for the most popular financial models, such as the Capital Asset Pricing Model as used, for instance, in \textcite{Fama.1993}, we aspire to determine whether it is possible to outperform the market on a daily basis using only publicly available data and standard econometric tools which we assume to be common knowledge within the investing society.
\section{Data}
As an approximation for the market, we utilize the S\&P500 from December, 31st 1999 until September, 30th 2016. The risk free rate is taken from the Kenneth French data library.
%(http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)
Since some of the values for the risk free rate are not available for the respective dates of the S\&P500, we approximate them by the most recet preceding value of the risk free rate.
For the predictive model we include different factors taken from the St. Louis Federal Reserve Bank.
%(https://fred.stlouisfed.org/series/)
%
\begin{figure}[htbp]
\centering
<<ExcessReturns, echo=FALSE, fig.width=3, fig.height=3>>=
###
# Read in data and adjust for later analysis
###
# first read given data and adjust
data_given <- read.csv(file="data/SandP_data.csv", header=TRUE, sep=";")
Date <- as.Date(data_given[,1])
# check whether all NA values in 'Date' are in the end of the vector
# sum(1:length(which(!is.na(Date)))==which(!is.na(Date)))==length(which(!is.na(Date)))
# truncate vectors to all values that we actually observed
Date <- Date[1:length(which(!is.na(Date)))]
# calculate daily returns from S&P index in percent
SandP <- (data_given[2:length(Date),2]/data_given[1:length(Date)-1,2]-1)*100
# calculate daily riskfree rate from annualized riskfree in percent, discard first observation
riskfree <- data_given[2:length(Date),3]/365
# sum(is.na(riskfree)) # some values in the risk free rate are NA
# replace them with the last value that is not NA
for(i in 1:length(riskfree)){
if(is.na(riskfree[i])==T){
riskfree[i] <- riskfree[i-1]
}
}
# calculate daily excess returns
E_ret <- data.frame(Date[2:length(Date)],SandP - riskfree)
colnames(E_ret) <- c("Date", "E_ret")
par(mar = c(4, 4, 0.1, 0.1), cex.lab = 0.95, cex.axis = 0.9,
mgp = c(2, 0.7, 0), tcl = -0.3)
plot(E_ret[,1], E_ret[,2], type='l',xlab="Date",ylab="Excess returns in %")
abline(h=mean(E_ret[,2]),col="cornflowerblue")
@
\caption{\label{fig:E_ret} Excess returns of the S\&P500.}
\end{figure}%
Figure~\ref{fig:E_ret} depicts the daily excess returns of the S\&P500 in the stated time period as well as its mean. From visual inspection, we observe a mean of approximately zero, in fact it is \Sexpr{round(mean(E_ret[,2]),4)}, while the volatility varies significantly, especially during the financial crisis.
In a next step, we set up a data frame of the exogenous factors that we want to use to predict the excess returns. Since the data source of the factors is different from the endogenous variables, we use the initial dates of the excess returns and limit our values for the factors to those with a corresponding date. This serves on the one hand to discard all obersavtions outside the relevant period and furthermore to adjust for differences in trading days on different stock exchanges as well as coherence in general. The first factor used is the \textsc{Nasdaq} index, the second the relative change in the \textsc{Vix}. The reason for including these particular factors is that the S\&P500 consists of a high number of companies that are listed in the \textsc{Nasdaq} and the latter thus has some predictive power for the former. The volatility index \textsc{Vix} inherits predictive power as well as it gives indications about uncertainty in the market environment. Table~\ref{tab:SumStatsFact} displays descriptive statistics of both factors. On top of that, we include dummy variables for Tuesday, Wednesday, Thursday and Friday. Monday can be ignored since it is our default day and hence estimated as the intercept.
%
<<Factors, echo=FALSE, warning=FALSE, results="asis">>=
# create data frame of factors
Factors <- data.frame(Date)
# read in first factor
#Factor_1 <- read.csv(file="data/NASDAQCOM.csv", header=TRUE, sep=",",
# na.strings =".", stringsAsFactors=FALSE)
Factor_1 <- read.csv(file="data/TEDRATE.csv", header=TRUE, sep=",",
na.strings =".", stringsAsFactors=FALSE)
Factor_1[,1] <- as.Date(Factor_1[,1])
# read in second factor
Factor_2 <- read.csv(file="data/VIXCLS.csv", header=TRUE, sep=",",
na.strings =".", stringsAsFactors=FALSE)
Factor_2[,1] <- as.Date(Factor_2[,1])
# add column with respective values of Factor_1 and Factor_2 (or arbitrary amount of factors) and fit them to the dates of the dependent variable
for(i in 1:nrow(Factors)){
Factors[i,2] <- Factor_1[which(Date[i]==Factor_1[,1]),2]
Factors[i,3] <- (Factor_2[which(Date[i]==Factor_2[,1]),2]-Factor_2[
which(Date[i]==Factor_2[,1])-1,2]
)/Factor_2[which(Date[i]==Factor_2[,1])-1,2]*100
# Factors[i,3] <- Factor_2[which(Date[i]==Factor_2[,1]),2]
}
#colnames(Factors)[2] <- "NASDAQ"
colnames(Factors)[2] <- "TED"
colnames(Factors)[3] <- "VIX"
NrFact <- 2 # enter how many factors you included in the end
# replace NAs with the last value that is not NA in all columns of data frame
for(j in 2:ncol(Factors)){
# if first value is NA, take 0
if(is.na(Factors[1,j])==T){
Factors[1,j] <- 0
}
for(i in 1:nrow(Factors)){
if(is.na(Factors[i,j])==T){
Factors[i,j] <- Factors[i-1,j]
}
}
}
print(xtable(summary(Factors), caption="Summary statistics of the factors used for predicting excess returns.", label = "tab:SumStatsFact"), booktabs=TRUE, caption.placement="top")
# add day-of-the-week dummies for Tuesday-Friday (Monday being the intercept)
# if running system is English, change names in vector
DayDummies <- c("Dienstag","Mittwoch","Donnerstag","Freitag")
for(d in 1:length(DayDummies)){
for(i in 1:nrow(Factors)){
if(weekdays(Factors[i,1])==DayDummies[d]){
Factors[i,NrFact+1+d] <- 1
}
else {
Factors[i,NrFact+1+d] <- 0
}
}
}
colnames(Factors)[(1+NrFact+1):(
1+NrFact+length(DayDummies))] <- c("Tue","Wed","Thu","Fri")
@
\section{Model}
Now that we have the dependent variable as well as some factors we want to use for prediction, we divide the sample into an in-sample for estimation and out-of-sample for evaluating the forecasting performance. As our first observation of the S\&P500 is \Sexpr{Date[1]}, the first observation that we can use for prediction is from \Sexpr{Date[2]}. We thus choose to use the sample from January 2000 until December 2015 as our in-sample and the remaning obervations to test the forecasting performance of the model. Since we use the factors to predict one day ahead, we use those starting from \Sexpr{Date[2]} and as dependent variable the one-day return of the S\&P500 one day later, i.e. \Sexpr{Date[3]}.
%
<<Model, echo=FALSE, warning=FALSE, message=F, results="asis">>=
###
# Model
###
library(forecast, quietly = T)
# are the series stationary?
# adf.test(E_ret[,2]) # E_ret is
# for(i in 2:ncol(Factors)){
# adf.test(Factors[,i]) # Factors are as well
# }
# choose in-sample for model estimation
Date_In_End <- as.Date("2015-12-31") # last date of in-sample
Date_In_End_1dbefore <- as.Date("2015-12-30") # last date of explanatory var
fit <- Arima(#
# dependent var from 2nd observation to 31.12.2015
E_ret[2:which(E_ret[,1]==Date_In_End),2],
# factors and dummies as explanatory variables
xreg = Factors[2:which(Factors[,1]==Date_In_End_1dbefore),2:ncol(Factors)],
# choose order of ARIMA(p,d,q) model you want to estimate
order = c(2,0,2)
)#
# summary(fit)
err_in <- resid(fit) # store the model residuals
@
We fit a model to the in-sample using the factors described above as well as autoregressive and moving average components which gives us an ARIMA(2,0,2) model following a selection procedure proposed in \textcite{Hyndman.2014} in combination with \textcite{Kleiber.2008}. Visualy inspecting the errors of our model indicates that there is barely any scope for improving the explanatory power of the model significantly without losing too many degress of freedom as can be seen in figure~\ref{fig:ErrorPlot}.
%
\begin{figure}[h]
\begin{center}
<<ErrorPlot, echo=FALSE, warning=FALSE, message=F, tidy=FALSE, fig.width=5, fig.height=5, out.width='0.7\\linewidth'>>=
tsdisplay(arima.errors(fit)) # assess model errors
@
\caption{ARIMA(2,0,2) errors and its serial correlation.}
\label{fig:ErrorPlot}
\end{center}
\end{figure}
%
Utilizing a Box-Ljung test yields a p-value of \Sexpr{round(Box.test(residuals(fit),fitdf=3,lag=10,type="Ljung")$p.value,4)} which indicates that there is no more serial correlation in the errors on a 5\% level. Another indication for the explanatory power of the model is the fit-criterium $R^2$ which we calculated as
\begin{equation*}
R^2 = \frac{\sum_i (\hat{y_i}-\bar{y})^2}{\sum_i (y_i-\bar{y})^2}= \Sexpr{round(
1-sum((residuals(fit))^2)/
sum((E_ret[2:which(E_ret[,1]==Date_In_End),2]-
mean(E_ret[2:which(E_ret[,1]==Date_In_End),2]))^2),
4)}
\end{equation*}
which is the fraction of unexplained variation of the dependent variable over its total variation.
\section{Forecasting Performance}
Finally, we want to test our model for forecasting. For this purpose, we imagine ourselves standing in the end of December 2015 and make a h=\Sexpr{which(Factors[,1]==Date[length(Date)])-which(Factors[,1]==Date_In_End)}-step ahead forecast until the end of the third quarter of 2016 which is depicted in figure~\ref{fig:hstepFcast}.
%
\begin{figure}[h]
\begin{center}
<<h-StepForecast, echo=FALSE, warning=FALSE, message=F, tidy=FALSE, fig.width=5, fig.height=5, out.width='0.7\\linewidth'>>=
fcast <- forecast(fit,
xreg=
Factors[2:which(Factors[,1]==Date_In_End_1dbefore),2:ncol(Factors)],
h=which(Factors[,1]==Date[length(Date)])-which(Factors[,1]==Date_In_End))
#which(Factors[,1]==Date_In_End_1dbefore)-which(Factors[,1]==Date_In_End)-1)
plot(fcast)
@
\caption{h-step ahead forecasts from regression with ARIMA(2,0,2) errors.}
\label{fig:hstepFcast}
\end{center}
\end{figure}
%
On top of that, we consider a rolling one-step-ahead forecast that we use for a final evaluation of our model. This means that we take all information available at the first day of our out-sample and predict the excess return of the next day, then use the information available to that day and so on until we reach the end of our out-sample. At each point in time we evaluate our prediction by comparing it to the actual value on that day. The final evaluation of the model is done by calculating the out-of-sample $R^2_{OS}$ for which we use as benchmark the mean of the series up to the day at which we are forecasting.
%
\begin{figure}[h]
\begin{center}
<<one-StepForecast, echo=FALSE, warning=FALSE, message=F, tidy=FALSE, fig.width=5, fig.height=5, out.width='0.7\\linewidth'>>=
# rolling one-step-ahead forecast
# create empty matrices
pred <- matrix(data=NA,nrow=which(
Factors[,1]==Date[length(Date)])-which(
Factors[,1]==Date_In_End_1dbefore)-1,ncol=1)
benchmark <- matrix(data=NA,nrow=which(
Factors[,1]==Date[length(Date)])-which(
Factors[,1]==Date_In_End_1dbefore)-1,ncol=1)
actual <- matrix(data=NA,nrow=which(
Factors[,1]==Date[length(Date)])-which(
Factors[,1]==Date_In_End_1dbefore)-1,ncol=1)
err_out <- matrix(data=NA,nrow=which(
Factors[,1]==Date[length(Date)])-which(
Factors[,1]==Date_In_End_1dbefore)-1,ncol=1)
# run for constant parameters
for(i in 1:(which(Factors[,1]==Date[length(Date)])-which(
Factors[,1]==Date_In_End_1dbefore)-1)){
# model estimation as above
reg_1 <- Arima(#
E_ret[2:(which(E_ret[,1]==Date_In_End)+i-1),2],
xreg = Factors[2:(which(Factors[,1]==Date_In_End_1dbefore)+i-1),2:ncol(Factors)],
order = c(2,0,2)
)#
# one-step-ahead forecast
fcast <- forecast(#
reg_1, xreg=#
Factors[which(Factors[,1]==Date_In_End_1dbefore)+i,2:ncol(Factors)],
h=1)#
# fill matrices
pred[i] <- fcast$mean # one-step ahead forecast
benchmark[i] <- mean(E_ret[2:(which(E_ret[,1]==Date_In_End)+i-1),2]) # historical average as benchmark
actual[i] <- E_ret[which(E_ret[,1]==Date_In_End)+i,2] # actual value
err_out[i] <- actual[i]-pred[i] # store the error
}
# Plot forecasting performance vs actual and benchmark to further assess your performance
plot(Date[(which(Date==Date_In_End)+1):length(Date)],actual,type="l",col="black",
xlab="Date",ylab="Predicted vs actual")
lines(Date[(which(Date==Date_In_End)+1):length(Date)],pred,col="cornflowerblue")
lines(Date[(which(Date==Date_In_End)+1):length(Date)],benchmark,col="red")
legend("bottomleft",c("Actual","Forecast","Benchmark"),
lty=c(1,1,1),col=c("black","cornflowerblue","red"))
@
\caption[1-step-ahead rolling forecasts from ARIMA model]{One-step-ahead rolling forecasts from regression with ARIMA(2,0,2) errors.}
\label{fig:1stepFcast}
\end{center}
\end{figure}
%
Figure~\ref{fig:1stepFcast} shows the actual development of the excess returns during the out-sample as well as the forecast and the benchmark we utilize. Visual inspection does not give an indication which of both gives the more accurate prediction, the out-of-sample forecast performance criterium, however, calculated as
\begin{equation*}
R^2_{OS}=1-\frac{\sum_i (y_i-\hat{y}_{model})^2}{\sum_i (y_i-\hat{y}_{benchmark})^2}=
\Sexpr{round(1-(sum((actual-pred)^2)/sum((actual-benchmark)^2)),4)}
\end{equation*}
proves that our model outperforms the benchmark slightly.
\section{Conclusion}
As we have shown, the historical average of excess returns can be outperformed in our sample by rather basic econometric models and using publicly available data. Since the predictive power is still very low, however, it can be concluded that in this setting the market approximated by the S\&P500 is rather efficient on a daily basis on average since the daily excess returns cannot be predicted up to a meaningful level. Of course, there is still a magnitude of possible factors that could be included in the model as well as more sophisticated econometric tools but these are out of the scope of this analysis and are left for future research.
%%
% References
%%
\clearpage
\pagenumbering{Roman}
\setcounter{page}{\thePagenrStorage}
\addcontentsline{toc}{section}{Bibliography}
%\newpage
\printbibliography
\end{document}