-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy paththesis.tex
2822 lines (2606 loc) · 163 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Created 2019-05-02 Thu 17:37
% Intended LaTeX compiler: pdflatex
\documentclass[a4paper, 12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{longtable}
\usepackage{wrapfig}
\usepackage{rotating}
\usepackage[normalem]{ulem}
\usepackage{amsmath}
\usepackage{textcomp}
\usepackage{amssymb}
\usepackage{capt-of}
\usepackage{hyperref}
\usepackage[left=2.5cm, right=2.5cm, top=2.5cm, bottom=2.5cm, bindingoffset=1.5cm, head=15pt]{geometry}
\usepackage{setspace}
\usepackage{caption}
\onehalfspacing
\usepackage[official]{eurosym}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{makecell}
\renewcommand\theadalign{cc}
\renewcommand\theadfont{\bfseries}
\renewcommand\theadgape{\Gape[2pt]}
\usepackage{xcolor}
\newcommand{\red}[1]{\textcolor{red}{[#1]}}
\usepackage{lmodern}
\usepackage{xcolor}
\usepackage[newfloat]{minted}
\usepackage{tcolorbox}
\tcbuselibrary{minted}
\usepackage{notation/rl}
\usepackage{notation/model}
\usepackage{template/list}
\let\itemize\hitemize
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead{}
\fancyfoot{}
\fancyhead[LE,RO]{\textsl{\leftmark}}
\fancyhead[RE,LO]{Tobias Richter}
\fancyfoot[C]{\thepage}
\renewcommand{\headrulewidth}{0.4pt}
\renewcommand{\footrulewidth}{0pt}
\usepackage{apacite}
\let\cite\shortcite
\let\textcite\shortciteA
\usepackage[nohyperlinks]{acronym}
\usepackage[bottom]{footmisc}
\interfootnotelinepenalty=10000
\usepackage[notlof,notlot,nottoc]{tocbibind}
\newcommand{\studentID}{5583705}
\newcommand{\thesistype}{Master Thesis}
\newcommand{\supervisor}{Univ.-Prof. Dr. Wolfgang Ketter}
\newcommand{\cosupervisor}{Karsten Schroer}
\pagenumbering{Roman}
\author{Tobias Richter}
\date{\today}
\title{Reinforcement Learning Portfolio Optimization of Electric Vehicle Virtual Power Plants}
\hypersetup{
pdfauthor={Tobias Richter},
pdftitle={Reinforcement Learning Portfolio Optimization of Electric Vehicle Virtual Power Plants},
pdfkeywords={},
pdfsubject={},
pdfcreator={Emacs 26.1 (Org mode 9.2.1)},
pdflang={English}}
\begin{document}
\makeatletter
\begin{titlepage}
\begin{center}
\vspace*{1cm}
\Large
\textbf{\@title{}}
\vspace{1.5cm}
\thesistype{}
\vspace{1cm}
\begin{figure}[htbp]
\centering
\includegraphics[width=.5\linewidth]{./fig/UoC-logo.png}
\end{figure}
\vspace{1cm}
\large
\textbf{Author}: \@author{} (Student ID: \studentID{})\\
\large
\textbf{Supervisor}: \supervisor{}\\
\large
\textbf{Co-Supervisor}: \cosupervisor{}
\vspace{1cm}
\large
Department of Information Systems for Sustainable Society\\
Faculty of Management, Economics and Social Sciences\\
University of Cologne\\
\vspace{1cm}
\@date{}
\end{center}
\end{titlepage}
\makeatother
\clearpage
\thispagestyle{empty}
\section*{Eidesstattliche Versicherung}
\label{sec:SOOA}
\vspace{2.5cm}
% Statement of original authorship - Needs to be in German
% see also here: https://www.wiso.uni-koeln.de/sites/fakultaet/dokumente/PA/formulare/eidesstattliche_erklaerung.pdf
Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit selbstständig und ohne die Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle Stellen, die wörtlich oder sinngemäß aus veröffentlichten und nicht veröffentlichten Schriften entnommen wurden, sind als solche kenntlich gemacht. Die Arbeit ist in gleicher oder ähnlicher Form oder auszugsweise im Rahmen einer anderen Prüfung noch nicht vorgelegt worden. Ich versichere, dass die eingereichte elektronische Fassung der eingereichten Druckfassung vollständig entspricht.
\vspace{1cm}
\noindent
Die Strafbarkeit einer falschen eidesstattlichen Versicherung ist mir bekannt, namentlich die Strafandrohung gemäß § 156 StGB bis zu drei Jahren Freiheitsstrafe oder Geldstrafe bei vorsätzlicher Begehung der Tat bzw. gemäß § 161 Abs. 1 StGB bis zu einem Jahr Freiheitsstrafe oder Geldstrafe bei fahrlässiger Begehung.
\vspace{3cm}
\noindent
\textbf{Tobias Richter}
\vspace{0.5cm}
\noindent
Köln, den 01.05.2019
\clearpage
\thispagestyle{empty}
\topskip0pt
\vspace*{\fill}
\section*{Abstract} Integrating weather-dependent renewable energy sources
into the electricity system imposes challenges on the power grid. Balancing
services are needed which can be provided by virtual power plants (VPP) that
aggregate distributed energy resources (DER) to consume or produce electricity
on demand. Electric vehicle (EV) fleets can use the batteries of idle cars as
combined storage to offer balancing services on smart electricity markets.
However, there are risks associated with this business model extension. The
fleet faces severe imbalance penalties if it can not charge the offered amount
of balancing energy due to unpredicted mobility demand of the vehicles. Ensuring
the fleet can fulfill all market commitments risks denying profitable customer
rentals. We study the design of a decision support system that estimates these
risks, dynamically adjusts the composition of a VPP portfolio, and profitably
places bids on multiple electricity markets simultaneously. Here we show that a
reinforcement learning agent can optimize the VPP portfolio by learning from
favorable market conditions and fleet demand uncertainties. In comparison to
previous research, in which the bidding risks were unknown and fleets could only
offer conservative amounts of balancing power to a single market, our proposed
approach increases the amount of offered balancing power by $48\% - 82\%$ and
achieves a charging cost reduction of the fleet by 25\%. In experiments with
real-world carsharing data of 500 EVs, we found that the accuracy of mobility
demand forecasting algorithms is crucial for a successful bidding strategy.
Moreover, we show that recent advancements in deep reinforcement learning
decrease the convergence time and improve the robustness of the results. Our
results demonstrate how modern RL algorithms can be successfully used for fleet
management, VPP optimization, and demand response in the smart grid. We
anticipate that DER, such as EVs, will play an essential role in providing
reliable back up power for the grid and formulate market design recommendations
to allow easier access to these resources.
\vspace*{\fill}
\clearpage
\setcounter{page}{1}
\tableofcontents
\clearpage
\listoffigures
\clearpage
\listoftables
\clearpage
\section*{List of Abbreviations} \markboth{LIST OF ABBREVIATIONS}{}
\begin{acronym}[GCRM]
\acro{ANN}{Artificial Neural Network}
\acro{DDQN}{Double Deep Q-Networks}
\acro{DER}{Distributed Energy Resources}
\acro{DP}{Dynamic Programming}
\acro{EPEX}{European Power Exchange}
\acro{EV}{Electric Vehicle}
\acro{GCRM}{German Control Reserve Market}
\acro{MDP}{Markov Decision Process}
\acro{RES}{Renewable Energy Sources}
\acro{RL}{Reinforcement Learning}
\acro{TD}{Temporal-Difference}
\acro{TSO}{Transmission System Operator}
\acro{V2G}{Vehicle-to-Grid}
\acro{VPP}{Virtual Power Plant}
\end{acronym}
\clearpage
\section*{Summary of Notation} \markboth{SUMMARY OF NOTATION}{}
Capital letters are used for random variables, whereas lower case letters are used for
the values of random variables and for scalar functions. Quantities that are required to
be real-valued vectors are written in bold and in lower case.
\begin{tabbing}
\=~~~~~~~~~~~~~~~~~~ \= \kill
\>$\defeq$ \> equality relationship that is true by definition\\
\>$\approx$ \> approximately equal\\
\>$\E{X}$ \> expectation of a random variable $X$, i.e., $\E{X}\defeq\sum_x p(x)x$\\
\>$\Re$ \> set of real numbers\\
\>$\leftarrow$ \> assignment\\
\\
\>$\e$ \> probability of taking a random action in an \e-greedy policy\\
\>$\alpha$ \> step-size parameter\\
\>$\gamma$ \> discount-rate parameter\\
\>$\lambda$ \> decay-rate parameter for eligibility traces\\
\\
\>$s, s'$ \> states\\
\>$a$ \> an action\\
\>$r$ \> a reward\\
\>$\S$ \> set of all nonterminal states\\
\>$\A$ \> set of all available actions\\
\>$\R$ \> set of all possible rewards, a finite subset of $\Re$\\
\>$\subset$ \> subset of; e.g., $\R\subset\Re$\\
\>$\in$ \> is an element of; e.g., $s\in\S$, $r\in\R$\\
\\
\>$t$ \> discrete time step\\
\>$T, T(t)$ \> final time step of an episode, or of the episode including time step $t$\\
\>$A_t$ \> action at time $t$\\
\>$S_t$ \> state at time $t$, typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$\\
\>$R_t$ \> reward at time $t$, typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$\\
\>$\pi$ \> policy (decision-making rule)\\
\>$\pi(s)$ \> action taken in state $s$ under {\it deterministic\/} policy $\pi$\\
\>$\pi(a|s)$ \> probability of taking action $a$ in state $s$ under {\it stochastic\/} policy $\pi$\\
\>$G_t$ \> return following time $t$\\
\\
\>$\p(s',r|s,a)$ \> probability of transition to state $s'$ with reward $r$, from state $s$ and\\
\> \> action $a$\\
\>$\p(s'|s,a)$ \> probability of transition to state $s'$, from state $s$ taking action $a$\\
\>$\vpi(s)$ \> value of state $s$ under policy $\pi$ (expected return)\\
\>$\vstar(s)$ \> value of state $s$ under the optimal policy\\
\>$\qpi(s,a)$ \> value of taking action $a$ in state $s$ under policy $\pi$\\
\>$\qstar(s,a)$ \> value of taking action $a$ in state $s$ under the optimal policy\\
\>$V, V_t$ \> array estimates of state-value function $\vpi$ or $\vstar$\\
\>$Q, Q_t$ \> array estimates of action-value function $\qpi$ or $\qstar$\\
\\
\>$d$ \> dimensionality---the number of components of $\w$\\
\>$\w$ \> $d$-vector of weights underlying an approximate value function\\
\>$\hat v(s,\w)$ \> approximate value of state $s$ given weight vector $\w$\\
\>$\mu(s)$ \> on-policy distribution over states\\
\>$\MSVEm$ \> mean square value error\\
\end{tabbing}
\clearpage
\pagenumbering{arabic}
\section{Introduction}
\label{sec:orgf76b476}
Global climate change is one of the most substantial challenges of our time. It
necessitates reducing carbon emissions and shifting to sustainable energy
sources. However, the integration of renewable energy sources (RES) is a complex
matter. Sustainable electricity production is dependent on the weather
conditions, and difficult to integrate into the electrical power system.
Changing weather conditions cause under- and oversupplies of intermittent solar
and wind energy which are destabilizing the power grid. This thesis will explore
a novel approach of integrating renewable energy into the power system by
providing balancing power with virtual power plants (VPPs) of electric vehicles
(EVs). The following chapter is structured as follows: First, the research
motivation is described, the derived research questions are presented next, and
finally, the relevance of this research explained.
\subsection{Research Motivation}
\label{sec:orgcedba0d}
VPPs play an important role in stabilizing the grid by aggregating distributed
energy resources (DER) to consume and produce electricity when it is needed
\cite{pudjianto07_virtual_power_plant_system_integ}. At the same time, carsharing
companies operate large, centrally managed EV fleets in major cities around the
world. These EV fleets can be turned into VPPs by using their batteries as
combined electricity storage. In this way, fleet operators can offer balancing
services to the power grid or trade electricity on the open markets for
arbitrage purposes. Fleet operators can charge the fleet by buying electricity
and discharge the fleet by selling electricity when market conditions are
favorable.
However, renting out EVs to customers is considerably more lucrative than using
EV batteries for trading electricity
\cite{kahlen18_elect_vehic_virtual_power_plant_dilem}. By making EVs available to
be used as a VPP, fleet operators compromise customer mobility and potentially
the profitability of the fleet. Knowing how many EVs will be available for VPP
usage in the future is critical for a successful trading strategy. Accurate
rental demand forecasts help determining the amount of electricity that can be
traded on the market. EV fleet operators can also participate on multiple
electricity markets simultaneously. They can take advantage of distinctive
market properties, such as the auction mechanism or the lead time to delivery,
to optimize their bidding strategy and reduce the risks. Still, the ultimate
risk remains that the fleet could made commitments to the markets it cannot
fulfill due to unforeseen rental demand at the time of electricity delivery.
We state that participating in the balancing market and intraday market
simultaneously can mitigate risks and increase profits of the fleet. In this
research, we propose a portfolio optimizing strategy, in which the best
composition of the VPP portfolio is dynamically determined using a
\emph{Reinforcement Learning} (RL) approach. RL methods have shown to be on par or
better than human level decision making in a variety of domains
\cite{mnih15_human_level_contr_throug_deep_reinf_learn}. An RL agent is able to
adapt the bidding strategy to changing rental demands and market conditions. It
learns from historical data, the observed environment, and realized profits to
adjust the bidding quantities and dynamically determine bidding risks. The
following fleet control tasks need to be performed by the agent in real-time: 1)
Allocate available EVs to the VPP portfolio, 2) Determine the optimal VPP
portfolio composition, and 3) Place bids and asks on multiple electricity
markets with an integrated trading strategy.
\subsection{Research Questions}
\label{sec:org592f613}
Drawing upon the research motivation, this research aims to answer the following
research questions:
\begin{enumerate}
\item \emph{Can EV fleet operators create VPP portfolios to profitably trade electricity
on the balancing market and intraday market simultaneously?} \emph{What would an}
\emph{integrated bidding strategy look like for EV fleet operators that manage VPP
portfolios?}
\item \emph{Can a reinforcement learning agent optimize VPP portfolios by learning the
risks that are associated with bidding on the individual} \emph{electricity
markets?}
\end{enumerate}
\subsection{Relevance}
\label{sec:org2b3a64c}
From a scientific perspective, this thesis is relevant to the stream of
agent-based decision making in smart markets
\cite{bichler10_desig_smart_market,peters13_reinf_learn_approac_to_auton}. It
contributes to the body of Design Science in Information Systems
\cite{hevner04_desig_scien_infor_system_resear} and draws upon work grounded
within a multitude of research areas: VPPs in smart electricity markets
\cite{pudjianto07_virtual_power_plant_system_integ}, fleet management of
(electric) carsharing as a new way of sustainable mobility
\cite{brandt17_evaluat_busin_model_vehic_grid_integ,wagner16_in_free_float}, and
advanced RL techniques for the smart grid
\cite{vazquez-canteli19_reinf_learn_deman_respon}. We specifically build on
research that has been carried out by
\textcite{kahlen17_fleet,kahlen18_elect_vehic_virtual_power_plant_dilem}. In their
papers, the authors concentrate on trading electricity on one market at a time.
As proposed by the authors, we take this research further and use a VPP of EVs
to participate on multiple types of electricity markets simultaneously. In this
way we create a VPP portfolio that offers EV batteries as storage options to the
markets with an integrated bidding strategy.
From a business perspective, this thesis is relevant to carsharing companies
operating EV fleets, such as Car2Go or DriveNow. We show how these companies can
increase their profits, using idle EVs as a VPP to trade electricity on multiple
markets simultaneously. We propose the use of a decision support system (DSS),
which allocates idle EVs to be used as VPP or to be available for rent.
Furthermore, the DSS determines optimal capacity-price pairs to place bids on
the individual electricity markets. Using an event-based simulation platform, we
estimate the profitability of the proposed methods, using real-world data from
German electricity markets and trip data from a German carsharing provider.
This thesis also contributes to the overall welfare of society. First, VPPs of
EVs provide extra balancing services to the power grid. VPPs consume excess
electricity almost instantly and stabilize the power grid. By integrating more
intermittent renewable electricity sources into the grid, such balancing
services will become indispensable in the future. Second, a reduction of
electricity prices for the end-consumer is expected. Integrating VPPs into the
power grid increases the efficiency of the whole system and, hence, will lower
the prices. \textcite{kahlen18_elect_vehic_virtual_power_plant_dilem} show
results, where electricity prices decrease up to 3.4\% on the wholesale market.
We anticipate similar results in our research. Third, VPPs can lead to a
decrease in CO\textsubscript{2} emissions. With an increasing share of renewable energy
production, the supply of sustainable electricity can excess the total
electricity demand at times of good weather conditions. A VPP can consume this
electricity by charging the EV fleet and the sustainable energy production does
not need to be curtailed. EVs that are equipped with special vehicle-to-grid
(V2G) devices can feed the electricity back into the grid when there is more
demand than supply. This mechanism increases the utilization of renewable
electricity generation and reduces the total CO\textsubscript{2} emissions.
\clearpage
\section{Background}
\label{sec:org08b4100}
The following chapter describes the relevant research and the theoretical
background of this thesis. First, we introduce smart electricity markets, give
detailed explanations of two particular types of electricity markets, and
explain applications for EV fleets in the smart grid. Next, we discuss the
related RL literature in the context of controlled EV charging. Finally, the
methodological background is presented in the form of a comprehensive review of
the most important concepts and formulations in RL theory.
\subsection{Smart Electricity Markets}
\label{sec:org57aa97e}
On electricity markets, participants place asks (sale offers) and bids (purchase
orders) in auctions to match the supply of electricity generation and the demand
for electricity consumption. The electricity price is determined by an auction
mechanism, which can take different forms depending on the type of market.
Germany, like many other western countries, has a liberalized energy system in
which the generation and distribution of electricity are decoupled. Multiple
electricity markets exist in a liberalized energy system. They differ in the
auction design and in their reaction time between the order contract and the
delivery of electricity. Day-ahead markets and spot markets have a reaction time
between a day and several hours, whereas in operating reserve markets the
reaction time ranges from minutes to seconds. Most electricity markets work
according to the merit order principle in which resources are considered in
ascending order of the energy price until the capacity demand is met. The
clearing price is determined by the energy price, at the point where supply
meets demand. Payment models differ in the markets: In contrast to day-ahead
markets, where a uniform pricing schema is applied, in secondary reserve markets
and intraday markets, participants get compensated by the price they bid (pay-as-bid
principle).
In smart electricity markets, EV fleet operators can offer the capacity of their
EV batteries to different types of markets. On operating reserve markets, prices
are usually more volatile and consequently more attractive for VPPs
\cite{tomic07_using_fleet_elect_drive_vehic_grid_suppor}. But operating reserve
markets also bear a higher risk for the fleet: Commitments have to be made up to
one week in advance when customer demands are still uncertain. In order to avoid
penalties for unfulfilled commitments only a conservative amount of capacity can
be offered to the market. On the other hand, spot markets allow participants to
continuously trade electricity products up to five minutes prior to delivery.
Just minutes before delivery it is possible to predict the available battery
capacity of the fleet with a high accuracy. This certainty creates the possibility
to aggressively trade the available battery capacity with a low risk at the spot
market.
\subsection{Electricity Market Theory}
\label{sec:org42fc32a}
In the following chapter, we will explain the market design, market properties
and bidding procedures of the balancing market and spot market in more detail,
as they are the markets we include in our research.
\begin{figure}[htbp]
\centering
\includegraphics[width=\linewidth]{fig/electricity-markets.png}
\caption[Electricity Market Design]{Interaction between electricity markets in relation to capacity allocation. \label{fig-electricity-markets}}
\end{figure}
\subsubsection{Balancing Market \label{sec-balancing-market}}
\label{sec:org8aa4420}
The balancing market is a tool to balance frequency deviations in the power
grid. It offers auctions for primary control reserve, secondary control reserve
as well as tertiary control reserve (minute reserve), which primarily differ in
the required ramp-up times of the participants.
As depicted in Figure \ref{fig-electricity-markets}, the balancing market is
conceptually placed at the intersection between the open electricity markets and
the physical power system \cite{veen16_elect_balan_market}. It can be seen as the
last link in a chain of electricity markets, although the lead time of capacity
allocation is considerably longer than the at day-ahead or intraday markets.
In this study, we will look at the German control reserve market (GCRM), one of
the largest frequency regulation markets in the world. However, the presented
concepts can be easily transferred to other balancing markets in liberalized
energy systems, since the market design is similar
\cite{brandt17_evaluat_busin_model_vehic_grid_integ}. Transmission systems
operators (TSO) procure their required control reserve via tender auctions at
the GCRM. The market conducts daily auctions for the three types of control
reserve. This thesis focuses on the secondary operating reserve auction, in
which participants must be able to supply or absorb a minimum of 1 MW of power
over a 4-hour interval within a reaction time of 30 seconds.\footnote{See \url{https://regelleistung.net}, accessed February 15,
2019, for further information on the market design and historical data.} Since EV
batteries can absorb energy almost instantly when they are connected to a
charging station, they are suitable to provide such balancing services that
require fast ramp-up times. Operating reserve providers have to be qualified by
the TSO to participate in the market and prove that they are able to reliably
provide balancing power. Although EV fleets are currently not qualified by the
GCRM to be used as operating reserve, they could theoretically handle the
minimum capacity requirements if they are large enough. Around 220 EVs would
need to simultaneously charge at standard 4.6 kW charging stations to provide 1
MW of downward regulating capacity.
Up until 28\textsuperscript{th} July 2018, auctions were held weekly, with two different
segments each week (peak hours/non-peak hours). Afterwards, the auction
mechanism changed to daily auctions of six four-hour segments of positive and
negative control reserve.\footnote{\url{https://www.bundesnetzagentur.de/SharedDocs/Pressemitteilungen/DE/2017/28062017\_Regelenergie.html},
accessed February 18, 2019} Shorter auction cycles facilitate the
integration of renewable energy generators into the secondary control reserve
market, as they are dependent on accurate (short-term) capacity forecasts.
Positive control reserve is energy that is supplied to the grid, when the grid
frequency falls below 50 Hz. It can be provided by increasing the electricity
generation or by reducing the grid load (i.e., electricity consumption). On the
contrary, negative control reserve is required when the grid frequency rises
above 50 Hz and can be provided by adding grid load or reducing electricity
generation. Since we do not consider V2G in this thesis (see next chapter), the
EV fleets in our model are only able provide negative control reserve, which
we will plainly refer to as \emph{balancing power} until the end of the thesis.
Market participants submit bids in the form \((\Pb{}, \cp{}, \ep{})\) to the
market, where \(\Pb{}\) is the amount of electrical power that can be supplied on
demand in kW, \(\cp{}\) is the capacity price for keeping the power available in
\(\emw\) and \(\ep{}\) is the energy price for delivered energy in \(\emwh\). The TSOs
determine the target quantity of balancing power to acquire per market period
based on historical balancing data. They usually acquire much higher regulation
capacity to minimize risks and activate the needed amount of capacity on demand.
The TSOs accept the bids based on the capacity price in a merit order. Providers
whose bids were accepted instantly get compensated for the provided capacity
\(\Pb{}\). At the time regulation capacity is needed, usually a day to a week
later, the TSO activates the capacity according to a merit order of the
ascending energy prices \(\ep{}\). Hence, providers are also compensated according
to the actual energy \(\Eb{}\) they supplied or consumed. Since providers get paid
according to the submitted prices \(\cp{}\) and \(\ep{}\) (instead of a market
clearing price) this type of auction is called \emph{pay-as-bid} auction.
\subsubsection{Spot Market \label{sec-spot-market}}
\label{sec:org2621c96}
As mentioned in the previous chapter, the equilibrium of electricity supply and
demand is ensured through a sequence of interdependent wholesale markets
\cite{pape16_are_fundam_enoug}. Next to the balancing market at the end of the
sequence, mainly two different types of spot markets exist, the day-ahead market
and the intraday market. In this research, we consider the European Power
Exchange (EPEX Spot) as it is the largest electricity market in Europe, with a
total trading volume of approximately 567 TWh in 2018\footnote{\url{https://www.epexspot.com/en/press-media/press/details/press/Traded\_volumes\_soar\_to\_an\_all-time\_high\_in\_2018},
accessed February 19, 2019\label{org641c9d2}}, but most
spot markets in western economies work with similar market mechanisms.
Germany's most important spot market is the day-ahead market with a trading
volume of over 234 TWh in 2018\textsuperscript{\ref{org641c9d2}}. Participants place asks and bids for
hourly contracts of the following day on the \emph{EPEX Spot Day-ahead Auction}
market until the market closes at 12pm on the day before delivery (see Figure
\ref{fig-electricity-markets}). The day-ahead market plays an important role in
integrating volatile RES into the power system \cite{pape16_are_fundam_enoug}.
Generators forecast the expected generation capacity for the next day and sell
those quantities on the market \cite{karanfil17_role_contin_intrad_elect_market}.
After the market closes, the participants have the opportunity to trade the
difference between the day-ahead forecast and the more precise intraday forecast
(forecasting error) on the intraday market
\cite{kiesel17_econom_analy_intrad_elect_prices}. In this way, RES generators can
cost effectively self-balance their portfolios, instead of relying on balancing
services provided by the TSO, which imposes high imbalance costs on participants
\cite{pape16_are_fundam_enoug}.
On the \emph{EPEX Spot Intraday Continuous} market, electricity products are traded
up until 5 minutes before physical delivery. Hourly contracts, as well as
15-minute and block contracts, can be traded. In contrast to the day-ahead
auction, the intraday market is a continuous order-driven market. Participants
can submit limit orders at any time during the trading window and equally change
or withdraw the order at any time before the order is accepted. Limit orders are
specified as price-quantity pairs \((\Pi{}, \up{})\), where \(\Pi{}\) is the traded
amount of electrical power in kW and \(\up{}\) is the price for the delivered
energy unit (hour/quarter/block) in \(\emwh\). When an order to buy (bid) matches
an order to sell (ask), the trade gets executed immediately. The order book is
visible to all participants, it is known which unmatched orders exist at the
time of interest. The intraday market has a trading volume of 82 TWh, which is
considerably smaller than the day-ahead market's volume. Despite that, the
intraday market plays a vital role to the stability of the grid. All executed
trades on the intraday market potentially reduce the needed balancing power of
the TSOs and thus can be seen as a type of balancing power
\cite{pape16_are_fundam_enoug}.
Purchasing electricity on the continuous intraday market is attractive for EV
fleets with uncertain mobility demand. The short time before delivery allows EV
fleet operators to rely on highly accurate forecasts of available battery
capacity, before submitting an order to buy. In this way, they can reliably
charge at a potentially lower price at the intraday market than the regular
industry tariff. In an integrated bidding strategy, EV fleet operators can
(similarly to RES generators) balance out forecast errors of available battery
capacity on the intraday market. For example can trades on the intraday market
complement bids that have been committed earlier to the balancing market.
\subsection{EV Fleet Control in the Smart Grid}
\label{sec:org7039683}
The increasing penetration of EVs has a substantial effect on electricity
consumption patterns. During charging periods, power flows and grid losses
increase considerably and challenge the grid. Operators have to reinforce the
grid to ensure that transformers and substations do not overload
\cite{sioshansi12_impac_elect_tarif_plug_in,lopes11_integ_elect_vehic_elect_power_system}.
Charging multiple EVs in the same neighborhood, or worse, whole EV fleets, can
even cause brown- or blackouts \cite{kim12_carbit}. Despite these challenges, it
is possible to support the physical reinforcement of the grid by adopting smart
charging strategies. In smart charging, EVs get charged when the grid is less
congested to ensure grid stability. Smart charging reduces peaks in electricity
demand (\emph{peak cutting}) and complement the grid in times of low demand (\emph{valley
filling}). Smart charging has been researched thoroughly in the IS literature,
in the following we will outline some of the most important contributions.
\textcite{valogianni14_effec_manag_elect_vehic_storag} found that using
intelligent agents to schedule EV charging substantially reshapes the energy
demand and reduces peak demand without violating individual household
preferences. Moreover, they showed that the proposed smart charging behavior
reduces average energy prices and thus benefit households economically. In
another study, \textcite{kara15_estim_benef_elect_vehic_smart} investigated the
effect of smart charging on public charging stations in California. Controlling
for arrival and departure times, the authors presented beneficial results for
the distribution system operator and the owners of the EVs. Their approach
resulted in a price reduction in energy bills and a peak load reduction of 37\%.
An extension of the smart charging concept is V2G. When equipped with V2G
devices, EVs can discharge their batteries back into the grid. Existing research
has focused on this technology in respect to grid stabilization effects and
arbitrage possibilities. For instance,
\textcite{schill11_elect_vehic_imper_elect_market} showed that the V2G usage of
EVs can decrease average consumer electricity prices. Excess EV battery capacity
can be used to charge in off-peak hours and discharge in peak hours, when the
prices are higher. These arbitrage possibilities reverse welfare effects of
generators and increase the overall welfare and consumer surplus.
\textcite{tomic07_using_fleet_elect_drive_vehic_grid_suppor} found that the
arbitrage opportunities are especially prominent when a high variability in
electricity prices on the target electricity market exists. The authors stated
that short intervals between the contract of sale and the physical delivery of
electricity increase arbitrage benefits. Consequently, ancillary service
markets, like frequency control and operating reserve markets, are attractive
for smart charging.
\textcite{peterson10_econom_using_plug_in_hybrid} investigated energy arbitrage
profitability with V2G in the light of battery depreciation effects in the US.
The results of their study indicate that the large-scale use of EV batteries for
grid storage does not yield enough monetary benefits to incentivize EV owners to
participate in V2G activities. Considering battery depreciation cost, the
authors arrived at an annual profit of only 6\$ -- 72\$ per EV.
\textcite{brandt17_evaluat_busin_model_vehic_grid_integ} evaluated a business
model for parking garage operators operating on the German frequency regulation
market. When taking infrastructure costs and battery depreciation costs into
account, they conclude that the proposed vehicle-grid integration is not
profitable. Even with idealized assumptions about EV adoption rates in Germany
and altered auction mechanisms, the authors arrived at negative profits.
\textcite{kahlen17_fleet} used EV fleets to offer balancing services to the grid.
Evaluating the impact of V2G in their model, the authors conclude that V2G would
only be profitable if reserve power prices were twice as high. Considering the
negative results from the studies mentioned above, we did not include the
V2G concept in our research.
In order to maximize profits, it is essential for market participants to develop
successful bidding strategies. Several authors have investigated bidding
strategies to jointly participate in multiple markets:
\textcite{mashhour11_biddin_strat_virtual_power_plant_2} used stationary battery
storage to participate in the spinning reserve market and the day-ahead market
at the same time. The authors developed a non-equilibrium model, which solves
the presented mixed-integer program with genetic programming. Contrarily to
their study, we investigate the use of a model-free RL agent that learns an
optimal policy (i.e., a trading strategy) from actions it takes in the
environment (i.e., bidding on electricity markets).
\textcite{he16_optim_biddin_strat_batter_storag} conducted similar research, in
which they additionally incorporated battery depreciation costs in a profit
maximization model, which proved to be a decisive factor. In contrast their
study, we jointly participate in the secondary operating reserve and intraday
market with the \emph{non-stationary} storage of EV batteries. Also, we can exclude
battery depreciation from our model, since shared EVs have to satisfy mobility
demand and be charged in any case.
Previous studies often assume that car owners or households can directly trade
on electricity markets. In reality, this is not possible due to the minimum
capacity requirements of the markets, requirements that single EVs do not meet.
For example, the German Control Reserve Market (GCRM) has a minimum trading
capacity of 1 MW to 5 MW, while a typical EV has a charging power of 3.6 kW to
22 kW. \textcite{ketter13_power_tac} introduced the notion of electricity brokers,
aggregators that act on behalf of a group of individuals or households to
participate in electricity markets.
\textcite{brandt17_evaluat_busin_model_vehic_grid_integ} and
\textcite{kahlen14_balan_with_elect_vehic} showed that electricity brokers can
overcome the capacity issues by aggregating EV batteries. In addition to
electricity brokers, we apply the concept of VPPs. VPPs are flexible portfolios
of DER, which are presented with a single load profile to the system operator,
making them eligible for market participation and ancillary service provisioning
\cite{pudjianto07_virtual_power_plant_system_integ}. Hence, VPPs allow providing
balancing power to the markets without knowing which exact sources provide the
promised capacity until the delivery time. This concept is specially useful when
dealing with EV fleets: VPPs enable carsharing providers to issue bids and asks
based on an estimate of available fleet capacity, without knowing beforehand
which exact EVs will provide the capacity at the time of delivery. Based on the
battery level and the availability of EVs, an intelligent agent decides in
real-time which vehicles provide the capacity.
Centrally managed EV fleets make it possible for carsharing providers to use the
presented concepts as a viable business extension. Free float carsharing is a
popular mobility concept that allows cars to be picked up and parked everywhere,
and the customers are billed is by the minute. This concept offers flexibility
to its users, saves resources, and reduces carbon emissions
\cite{firnkorn15_free_float_elect_carsh_fleet_smart_cities}. Most previous studies
concerned with the usage of EVs for electricity trading, assumed that trips are
fixed and known in advance, e.g., in
\textcite{tomic07_using_fleet_elect_drive_vehic_grid_suppor}. The free float
concept adds uncertainty and nondeterministic behavior, which make predictions
about future rentals a complex issue. \textcite{kahlen17_fleet} showed that it is
possible to use free float carsharing fleets as VPPs to profitably offer
balancing services to the grid. In their study, the authors compared cases from
three different cities across Europe and the US. They used an event-based
simulation, bootstrapped with real-world carsharing and secondary operating
reserve market data from the respective cities. A central dilemma within their
research was to decide whether an EV should be committed to a VPP or free for
rent. Since rental profits are considerably higher than profits from electricity
trading, it is crucial to not allocate an EV to a VPP when it could have been
rented out otherwise. To deal with the asymmetric payoff,
\citeauthor{kahlen17_fleet} used stratified sampling in their classifier. This
method gives rental misclassifications higher weights, reducing the likelihood
of EVs to participate in VPP activities. The authors used a random forest
regression model to predict the available balancing capacity on an aggregated
fleet level. Only at the delivery time, the agent decides which individual EVs
provide the regulation capacity. This heuristic is based on the likelihood that
the vehicle is rented out and on its expected rental benefits.
In a similar study, the authors showed that carsharing companies can participate
in day-ahead markets for arbitrage purposes
\cite{kahlen18_elect_vehic_virtual_power_plant_dilem}. In the paper, the authors
used a sinusoidal time-series model to predict the available trading capacity.
Another central problem for carsharing providers is that committed trades, which
can not be fulfilled, result in substantial penalties from the system operator
or electricity exchange. In other words, fleet operators have to avoid buying
any amount of electricity, which they can not be sure to charge with their
available EVs at the delivery time. To address this issue, the authors developed
a mean asymmetric weighted objective function. They used it for their
time-series based prediction model to penalize committing an EV to VPP when it
would have been rented out otherwise. Because of the two issues mentioned above,
\textcite{kahlen18_elect_vehic_virtual_power_plant_dilem} could only make very
conservative estimations and commitments of overall available trading capacity,
resulting in a high amount of missed profits. This effect is especially
prominent when participating in the secondary operating reserve market, since
commitments have to be made one week in advance when mobility demands are still
uncertain. \textcite{kahlen17_fleet} stated that in 42\% to 80\% of the cases, EVs
are not committed to a VPP when it would have been profitable to do so.
This thesis proposes a solution in which the EV fleet participates in the
balancing market and intraday market simultaneously. With this approach, we
align the potentially higher profits on the balancing markets, with more
accurate capacity predictions for intraday markets
\cite{tomic07_using_fleet_elect_drive_vehic_grid_suppor}. This research followed
\textcite{kahlen17_fleet}, who proposed to work on a combination of multiple
markets in the future.
\subsection{Reinforcement Learning Controlled EV Charging \label{sec-back-rl}}
\label{sec:org85a8820}
Previous research showed that intelligent agents equipped with Reinforcement
Learning (RL) methods can successfully take action in the smart grid. The
following chapter outlines different research approaches of RL in the domain of
smart grids. For a more thorough description, mathematical formulations and
common issues, of RL refer to Chapter \ref{sec-reinforcement-learning}.
\textcite{reddy11_learn_behav_multip_auton_agent,reddy11_strat} used autonomous
broker agents to buy and sell electricity from DER on a \emph{Tariff Market}. The
agents are modelled with a Markov decision process (MDP) and use RL to determine
pricing strategies to profitably participate in the Tariff Market. To control
for a large number of possible states in the domain, the authors used
\emph{Q-Learning} with derived state space features. Based on descriptive statistics,
they defined derived price and market participant features. By engaging with its
environment, the agent learns an optional sequence of actions (policy) based on
the state of the agent.
\textcite{peters13_reinf_learn_approac_to_auton} built on that work and further
enhanced the method by using function approximation. Function approximation
allows to efficiently learn strategies over large state spaces, by learning a
function representation of state values instead of learning single values of
discrete states. By using this technique, the agent can adapt to arbitrary
economic signals from its environment, resulting in better performance than
previous approaches. Moreover, the authors applied feature selection and
regularization methods to explore the agent's adaption to the environment. These
methods are particularly beneficial in smart markets because market design,
structures, and conditions might change in the future. Hence, intelligent agents
should be able to adapt to it \cite{peters13_reinf_learn_approac_to_auton}.
\textcite{vandael15_reinf_learn_heuris_ev_fleet} facilitated learned EV fleet
charging behavior to optimally purchase electricity on the day-ahead market.
Similarly to \textcite{kahlen18_elect_vehic_virtual_power_plant_dilem}, the
problem is framed from the viewpoint of an aggregator that tries to define a
cost-effective day-ahead charging plan in the absence of knowing EV charging
parameters, such as departure time. A crucial point of the study is weighting
low charging prices against costs that have to be paid when an excessive or
insufficient amount of electricity is bought from the market (imbalance costs).
Contrarily, \textcite{kahlen18_elect_vehic_virtual_power_plant_dilem} did not
consider imbalance cost in their model and avoid them by sacrificing customer
mobility in order to balance the market (i.e., not showing the EV available for
rent, when it is providing balancing capacity).
\textcite{vandael15_reinf_learn_heuris_ev_fleet} used a \emph{fitted Q Iteration} to
control for continuous variables in their state and action space. In order to
achieve fast convergence, they additionally optimized the \emph{temperature step}
parameter of the Boltzmann exploration probability.
\textcite{dusparic13_multi} proposed a multi-agent approach for residential demand
response. The authors investigated a setting in which nine EVs were connected to
the same transformer. The RL agents learned to charge at minimal costs, without
overloading the transformer. \textcite{dusparic13_multi} utilized \emph{W-Learning} to
simultaneously learn multiple policies (i.e., objectives such as ensuring
minimum battery charged or ensuring charging at low costs).
\textcite{taylor14_accel_learn_trans_learn} extended this research by employing
Transfer Learning and \emph{Distributed W-Learning} to achieve communication between
the learning processes of the agents in a multi-objective, multi-agent setting.
\textcite{dauer13_market_based_ev_charg_coord} proposed a market-based EV fleet
charging solution. The authors introduced a double-auction call market where
agents trade the available transformer capacity, complying with the minimum
required State of Charge (SoC). The participating EV agents autonomously learn
their bidding strategy with standard \emph{Q-Learning} and discrete state and action
spaces.
\textcite{di13_elect_vehic} presented a multi-agent solution to minimize charging
costs of EVs, a solution that requires neither prior knowledge of electricity
prices nor future price predictions. Similar to
\textcite{dauer13_market_based_ev_charg_coord}, the authors employed standard
\emph{Q-Learning} and the \(\epsilon\)-greedy approach for action selection.
\textcite{vaya14_optim} also proposed a multi-agent approach, in which the
individual EVs are agents that actively place bids in the spot market. Again,
the agents use \emph{Q-Learning}, with an \(\epsilon\)-greedy policy to learn their
optimal bidding strategy. The latter relies on the agents willingness-to-pay
which depends on the urgency to charge. State variables, such as SoC, time of
departure and price development on the market, determine the urgency to charge.
The authors compared this approach with a centralized aggregator-based approach
that they developed in another paper \cite{vaya15_optim_biddin_strat_plug_in}.
Compared to the centralized approach, in which the aggregator manages charging
and places bids for the whole fleet, the multi-agent approach causes slightly
higher costs but solves scalability and privacy problems.
\textcite{shi11_real} consider a V2G control problem, while assuming real-time
pricing. The authors proposed an online learning algorithm which they modeled as
a discrete-time MDP and solved through \emph{Q-Learning}. The algorithm controls the
V2G actions of the EV and can react to real-time price signals of the market. In
this single-agent approach, the action space compromises only charging,
discharging and regulation actions. The limited action spaces makes it
relatively easy to learn an optimal policy.
\textcite{chis16_reinf_learn_based_plug_in} looked at reducing the costs of
charging for a single EV using known day-ahead prices and predicted next-day
prices. A Bayesian ANN was employed for prediction and \emph{fitted Q-Learning} was
used to learn daily charging levels. In their research, the authors used
function approximation and batch reinforcement learning, an offline, model-free
learning method. \textcite{ko18_mobil_aware_vehic_to_grid} proposed a centralized
controller for managing V2G activities in multiple microgrids. The proposed
method considers mobility and electricity demands of autonomous microgrids, as
well as SoC of the EVs. The authors formulated an MDP with discrete state and
action spaces and use standard \emph{Q-Learning} with \(\epsilon\)-greedy policy to
derive an optimal charging policy.
It should be noted that RL methods are not the only solution for problems in the
smart grid, often basic algorithms and heuristics provide satisfactory results
\cite{vazquez-canteli19_reinf_learn_deman_respon}. Another possibility are
stochastic programming approaches, which rigorously model the problem as an
optimization problem under uncertainty that can be analytically solved, for
example \textcite{pandzic13_offer_model_virtual_power_plant} and
\textcite{nguyen16_biddin}. Uncertainties (e.g., renewable energy generation or
electricity prices) are modeled by a set of scenarios that are based on
historical data. Despite that, we consider RL as an optimal fit for the design
of our proposed intelligent agent. Given the ability to learn user behavior
(e.g., mobility demand) and the flexibility to adapt to the environment (e.g.,
electricity prices), RL methods are a promising way of solving complex
challenges in the smart grid.
\subsection{Reinforcement Learning Theory \label{sec-reinforcement-learning}}
\label{sec:orga3e82d3}
The following chapter will give an overview of the most important RL concepts
and will introduce the corresponding mathematical formulations. If not noted
otherwise, the notation, equations, and explanations are adapted from
\textcite{sutton18_reinf}, the de-facto reference book of RL research.
RL is an agent-based machine learning algorithm in which the agent learns to
perform an optimal set of actions through interaction with its environment. The
agents objective is to maximize the rewards it receives based on the actions it
takes. Immediate rewards have to be weighted against long-term cumulative
returns that depend on future actions of the agent. The RL problem is formalized
as Markov Decision Processes (MDPs) which will be introduced in Chapter
\ref{sec-mdp}. A critical task of RL agents is to continuously estimate the value
of the environment's state. State values indicate the long-term desirability of
a state, that is the total amount of reward the agent can expect to accumulate
in the future, following a learned set of actions, called the policy. Policies
and values are covered in Chapter \ref{sec-policies}, whereas the core
mathematical foundations for evaluating policies and updating value functions
are introduced in Chapter \ref{sec-bellman}. When the model of the environment is
fully known, the learning problem is reduced to a planning problem (see Chapter
\ref{sec-dp}) in which optimal policies can be computed with iterative approaches.
Model-free RL approaches can be applied when rewards and state transitions are
unknown, and the agent's behavior has to be learned from experience (see Chapter
\ref{sec-td-learning}). The last two chapters cover novel methods that solve the RL
problem more efficiently, tackle new challenges and are widely used in practice
and research.
\subsubsection{Markov Decision Processes \label{sec-mdp}}
\label{sec:orgba52f07}
MDPs are a classical formulation of sequential decision making and an idealized
mathematical formulation of the RL problem. They allow to derive exact
theoretical statements about the learning problem and possible solutions. Figure
\ref{agent-environment-interaction} depicts the \emph{agent-environment interaction}.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.85\linewidth]{fig/mdp-interaction.png}
\caption[Markov Decision Process]{The agent-environment interaction in a Markov decision process \cite{sutton18_reinf}. \protect\footnotemark \label{agent-environment-interaction}}
\end{figure}
\footnotetext{\textbf{Figure 3.1} from \emph{Reinforcement Learning: An Introduction} by Richard S. Sutton and Andew G. Barto is licencsed under CC BY-NC-ND 2.0 (https://creativecommons.org/licenses/by-nc-nd/2.0/)}
In RL the agent and the environment continuously interact with each other. The
agent takes actions that influence the environment, which in return presents
rewards to the agent. The agent's goal is to maximize rewards over time, trough
an optimal choice of actions. In each discrete timestep \(t\!=\!0,1,2,...,T\) the
RL agent interacts with the environment, which is perceived by the agent as a
representation, called \emph{state}, \(S_t \in \S\). Based on the state, the agents
selects an \emph{action}, \(A_t\in\A\), and receives a numerical \emph{reward} signal,
\(R_{t+1}\in\R\subset\Re\), in the next timestep. Actions influence immediate
rewards and successive states, and consequently also influence future rewards.
The agent has to continuously make a trade-off between immediate rewards and
delayed rewards to achieve its long-term goal.
The \emph{dynamics} of an MDP are defined by the probability that a state \(s'\in \S\)
and a reward \(r\in\R\) occurs, given the preceding state \(s\in\S\) and action
\(a\in\A\). In \emph{finite} MDPs, the random variables \(R_t\) and \(S_t\) have
well-defined probability density functions, which are solely dependent on the
previous state and the agent's action. Consequently, it is possible to define
(\(\defeq\)) the \emph{dynamics} of the MDP as follows:
\begin{equation} \label{eq-dynamics}
p(s',r|s,a) \defeq \Pr{S_t=s',R_t=r|S_{t-1}=s,A_t=a},
\end{equation}
for all \(s',s\!\in\!\S\), \(r\!\in\!\R\) and \(a\!\in\!\A\). Note that each possible
value of the state \(S_t\) depends only on the immediately preceding state
\(S_{t-1}\). When a state includes all information of all previous states, the
state possesses the so-called \emph{Markov property}. If not noted otherwise, the
Markov property is assumed throughout the whole chapter. The dynamics function
\eqref{eq-dynamics} allows computing the \emph{state-transition probabilities},
another important characteristic of an MDP:
\begin{equation}
p(s'|s,a) \defeq \Pr{S_t\!=\!s'|S_{t-1}\!=\!s,A_t\!=\!a} = \sum_{r\in\R}{p(s', r|s, a)},
\end{equation}
for \(s',s\!\in\!\S\), \(r\!\in\!\R\) and \(a\!\in\!\A\).
The use of a \emph{reward signal} \(R_t\) to formalize the agent's goal is a unique
characteristic of RL. Each timestep the agent receives the rewards as a scalar
value \(R_t\in\Re\). The sole purpose of the RL agent is to maximize the
long-term cumulative reward (as opposed to the immediate reward). The long-term
cumulative reward can also be expressed as the \emph{expected return} \(G_t\):
\begin{equation} \label{eq-expected-return}
\begin{split}
G_t &\defeq R_{t+1} + \gamma R_{t+2} + \gamma R_{t+3} + \cdots \\
&= \sum_{k=0}^{\infty}{\gamma^k R_{t+k+1}} \\
&= R_{t+1} + \gamma G_{t+1},
\end{split}
\end{equation}
where \(\gamma\), \(0\leq\gamma\leq 1\), is the \emph{discount rate} parameter. The
discount rate determines how "myopic" the agent is. If \(\gamma\) approaches 0,
the agent is more concerned with maximizing immediate rewards. On the contrary,
when \(\gamma\!=\! 1\), the agent takes future rewards strongly into account, the
agent is "farsighted".
\subsubsection{Policies and Value Functions \label{sec-policies}}
\label{sec:org746d36e}
An essential task of almost every RL agent is estimating \emph{value functions}.
These functions describe how "good" it is to be in a given state, or how "good"
it is to perform an action in a given state. More formally, they take a state
\(s\) or a state-action pair \(s,a\) as input and give the expected return \(G_t\) as
output. The expected return is dependent on the actions the agent will take in
the future. Consequently, value functions are formulated with respect to a
\emph{policy} \(\pi\). A policy is a mapping of states to actions; it describes the
probability that an agent performs a certain action dependent on the current
state. More formally, the policy is defined as
\(\pi(a|s)\defeq\Pr{A_t\!=\!a|S_t\!=\!s}\), a probability density function of all
\(a\!\in\!\A\) for each \(s\!\in\!\S\). The various RL approaches mainly differ in how the
policy is updated, based on the agent's interaction with the environment.
In RL, not only value functions of states but also value functions of
state-action pairs are used. The \emph{state-value function} of policy \(\pi\) is
denoted as \(\vpi(s)\) and is defined as the expected return when starting in \(s\)
and following policy \(\pi\):
\begin{equation}
\vpi(s) \defeq \EE{\pi}{G_t|S_t\!=\!s}, \text{ for all } s\in\S
\end{equation}
The \emph{action-value function} of policy \(\pi\) is denoted as \(\qpi(s,a)\) and is
defined as the expected return when starting in \(s\), taking action \(a\) and
following policy \(\pi\) afterwards:
\begin{equation}
\qpi(s,a) \defeq \EE{\pi}{G_t|S_t\!=\!s, A_t\!=\!a}, \text{ for all } a\in\A, s\in\S
\end{equation}
The \emph{optimal policy} \(\pi_*\) has a greater (or equal) expected return than all
other policies. The optimal state-value function and optimal action-value
function are defined as follows:
\begin{equation}
\vstar(s) \defeq \max_{\pi} \vpi(s), \text{ for all } s\in\S
\end{equation}
\begin{equation}
\qstar(s,a) \defeq \max_{\pi} \qpi(s,a), \text{ for all } s\in\S, a\in\A
\end{equation}
The \emph{optimal} action-value function describes the expected return when taking
action \(a\) in state \(s\) following the optimal policy \(\pi_*\) afterwards.
Estimating \(\qstar\) to obtain an optimal policy is a substantial part of RL and
has been known as \emph{Q-learning} \cite{watkins92_q_learn}, which is described in
Chapter \ref{sec-td-learning}.
\subsubsection{Bellman Equations \label{sec-bellman}}
\label{sec:org18b365c}
A central characteristic of value functions is their recursive relationship
between the state (or action) values. Similar to rewards in
(\ref{eq-expected-return}), current values are related to expected values of
successive states. This relationship is heavily used in RL and has been
formulated as \emph{Bellman equations} \cite{bellman57_dynam_progr}. The Bellman
equation for \(\vpi(s)\) is defined as follows:
\begin{equation} \label{eq-bellman}
\begin{split}
\vpi(s) &\defeq \EE{\pi}{G_t|S_t=s} \\
&= \EE{\pi}{R_{t+1}+\gamma G_{t+1}|S_t\!=\!s} \\
&= \sum_{a}{\pi(a|s)}\sum_{s',r}{p(s',r|s,a)}\bigg[r+\gamma\vpi(s')\bigg],
\end{split}
\end{equation}
where \(a\!\in\!\A\), \(s,s'\!\in\!\S\), \(r\!\in\!\R\). In other words, the value of
a state equals the immediate reward plus the expected value of all possible
successor states, weighted by their probability of occurring. The value function
\(\vpi(s)\) is the unique solution to its Bellman equation. The Bellman equation
of the optimal value function \(v_*\) is called the \emph{Bellman optimality equation}:
\begin{equation} \label{eq-bellman-optimality}
\begin{split}
\vstar(s) &\defeq \max_{a\in\A(s)}q_{\pi_*}(s,a) \\
&= \max_{a}\EE{\pi_*}{R_{t+1}+\gamma G_{t+1}|S_t\!=\!s, A_t\!=a} \\
&= \max_{a}\EE{\pi_*}{R_{t+1}+\gamma \vstar(S_{t+1})|S_t\!=\!s, A_t\!=a} \\
&= \max_{a}\sum_{s',r}{p(s',r|s,a)}\bigg[r+\gamma\vstar(s')\bigg]
\end{split}
\end{equation}
where \(a\!\in\!\A\), \(s,s'\!\in\!\S\), \(r\!\in\!\R\). In other words, the value of
a state under an optimal policy equals the expected return for the best action
from that state. Note that the Bellman optimality equation does not refer to a
specific policy, it has a unique solution independent from one. It can be seen
as an equation system, which can be solved when the dynamics of the environment
\eqref{eq-dynamics} are fully known. Similar Bellman equations as
\eqref{eq-bellman} and \eqref{eq-bellman-optimality} can also be formed for
\(\qpi(s,a)\) and \(\qstar(s,a)\). Bellman equations form the basis for computing
and approximating value functions and were an important milestone in RL history.
Most RL methods are \emph{approximately} solving the Bellman optimality equation, by
using experienced state transitions instead of expected transition
probabilities. The most common methods will be explored in the following