-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathdatabase_theory.theory.txt
348 lines (321 loc) · 25.5 KB
/
database_theory.theory.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
┏━━━━━━━━━━━━━━━━━━━━━┓
┃ DATABASE_THEORY ┃
┗━━━━━━━━━━━━━━━━━━━━━┛
┌──────────┐
│ DBMS │
└──────────┘
SQL ==> #There is an ANSI standard, but it is not freely available
#In practice, every DBMS extends it
DBMS ==> #Database Management Systems
#Types:
# - flat model: one single table
# - Relational (RDBMS): based on relations between tables, i.e. use primary keys and foreign keys.
# Usually based on SQL.
# - Object-oriented (OODBMS) where table can inherit other tables.
# Can also be relational (most modern RDBMS are also OODBMS -> ORDBMS (object-relational))
# - NoSQL: not based on SQL and usually not relational.
# Usually faster and more convenient, but often less ACID-compliant.
# Choices between them depend on structure of data:
# - document-based (if appartenance/contenance relations): based on a JSON object, XML, etc.
# - graph-based (if complex relations that are the main matter): each node has a varying number
# of relations between each other
# - key-value stores (if no relations): only VAR=VAL, but much faster.
# - NewSQL: based on SQL, but which try to be as fast and convenient as NoSQL.
#Architecture can be:
# - client-side
# - client-server
# - distributed (client with multiple servers, also called database virtualization):
# - shared-nothing (replication over nodes)
# - shared-data (division of data over nodes)
#Notable ones:
# - RDBMS:
# - client-side:
# - Sqlite: C library
# - client-server:
# - PostgreSQL: good performance, full of features
# - "Big 3":
# - Microsoft SQL server: uses proprietary T-SQL (similar to PL/PGSQL). Integrates with .NET
# Used with ERP/data warehouse too.
# - Oracle DB: full of features, but proprietary
# Used with ERP/data warehouse too.
# - MySQL: lacks of lot of features, but simple to use
# Mostly used in web development.
# - IBM DB2
# - distributed:
# - Postgres-XC: uses PostgreSQL
# - Hive: uses HiveQL, a SQL-like language. Integrates with Hadoop.
# - noSQL:
# - document-oriented:
# - mongoDB: retains some properties of SQL. Best if lot of write.
# - couchDB: best if lot of read or lot of versioning.
# - key-value:
# - Redis: rapidly changing data with small size.
# - graph:
# - Neo4j
# - column-oriented:
# - HBase: distributed. Integrates well with Hadoop. Best for read.
# - Cassandra: distributed. Best for write.
┌────────────┐
│ DESIGN │
└────────────┘
DATABASE DESIGN ==> #Also called modelling. Can be done at three view levels:
# - conceptual: general terms
# - logical: technical terms.
# Only tables and relationships.
# Where normalization happens.
# - physical: variable names terms.
# RDBMS-specific: includes indexes, constraints, types, triggers, functions, etc..
# Where denormalization happens.
#Entity-relationship (ER) diagrams:
# - done at the logical level
# - types:
# - oval: attribute (column)
# - key is underlined
# - rectangle: entity (table)
# - losange: relationship (between two tables, e.g. a foreign key or (if has attributes)
# relation table):
# - n to m: one X correspond to one or more Y, and vice versa
# - 1 to n (for X to Y): one X correspond to one or more Y, and one Y correspond to one X
# - 1 to 1: one X correspond to one Y, and vice versa
# - double-line mean all individuals of the set participate to the relation.
# Single-line mean contrary.
INTEGRITY ==> # - Entity integrity: no duplicate records for all columns (e.g. primary keys)
# - Domain integrity: enforcing types, constraints on columns
# - Referential integrity: can't delete records used by other records (e.g. foreign keys)
# - User-defined integrity: any other
ACID ==> #Transactions must be (usually done by most RDBMS):
# - Atomic:
# - no half-performed operations, i.e. if part of an operation fails, whole operation fails (including rollback)
# - e.g. transactions
# - implies possibility of more complex retries
# - Consistency:
# - can mean either:
# - data is same across all nodes
# - respects types, constraints, etc. (e.g. column types and constraints)
# - Isolation:
# - multithread-safety, i.e. isolate different threads from each other
# - i.e. requiring read|write locks while operation is ongoing
# - e.g. transactions with MVCC
# - tradeoff with speed: data synchronization
# - Durability:
# - once finished, transaction remains on physical data, even if data failure immediately after
# - i.e. persisting as fast as possible, e.g. flushing caches
# - tradeoff with speed: against caching
CAP THEOREM ==> #A DBMS can only have two of the three:
# - Consistency: data is same across all nodes
# - Availability: can always access data
# - Partition tolerance: if nodes fail, can use another route
#Usually choice is between consistency and availability, in event of a partition
BASE ==> #Alternative to ACID used by many noSQL databases
#Gives up on consistency and durability in order to increase availability, partition tolerance and speed
#Main principle is "eventual consistency":
# - instead of saying data will be ok right now, say that they will be "eventually"
# - i.e. for sure at some point, called "[replica] converging"
# - e.g. at some point, read will always return the same value
# - sometimes called optimistic replication
#Principles are:
# - Basic Availability: means "availability", without consistency guarantee
# - Soft State: means state might change over time even without input (because of "converging")
# - Eventual consistency
┌─────────────────┐
│ DISTRIBUTED │
└─────────────────┘
BACKUPS ==> #Can be either warm (can be done when server is running) or cold.
SEVERAL MACHINES ==> #Using several machines can have several goals:
# - load balancing (distributed database):
# - computing is attributed to several nodes to easily scale demand
# - high availability:
# - reduce server downtime by using several machines: if one fails, another handles requests
# - redundancy/replication:
# - making copies of one machine to other machines
# - can be at different levels:
# - disk-level
# - filesystem-level
# - OS-level
# - database-software-level:
# - machines can be:
# - master (or "read|write")
# - slave (or "standby"):
# - hot standby: real-time replication, can be up with no time and is exact copy
# - warm standby: regular intervals replication, can be up with little recovery time
# - cold standby: less frequent intervals replication, can be up with more time
# - configuration can be:
# - master-slave ("primary-backup", "passive"):
# - only one node can handle requests, and it propagates results on data to other
# nodes.
# - Can also cut into several subdatabases where each has a master-slave with a
# different master.
# - "cascading replication": a slave ("cascading standby", "upstream standby") can
# be a master of another slave ("downstream standby")
# - master-master ("multi-primary", "active"):
# - each node can handle requests, and propagation to other nodes can cause
# concurrency resolution challenges, especially in ACID.
# - can be based on:
# - logs (log shipping)
# - triggers sync (prevent concurrency problem) or async (fix it afterwards)
# - can be:
# - sync., which means statements wait for replication to finish to continue
# (best for high availability)
# - async., which means they don't wait (best for performance)
# - when a computer goes down and another takes in place, this is a failover (automatic) or a
# switchover (manual)
RAS ==> #Reliability/Availability/Serviceability, properties of distributed systems:
# - reliability: frequency of downtimes (not length), e.g. mean time between failures (MTBF) or
# failure rate (no of failures for a specific time)
# - high availability: low downtime (including maintenance) / total time.
# There should be low ETR (estimated time of repair) also known as RTO (recovery time objective)
# - serviceability: notifications of problems, monitoring, easiness to repair, etc.
┌───────────────────┐
│ NORMALIZATION │
└───────────────────┘
NORMALIZATION ==> #In short:
# - no duplicate information. Reason it to prevent:
# - update anomaly: need to update at several places (redundancy) for a single entity
# change -> risk of forget and inconsistency
# - One entity by table, i.e. one primary key by table, all COLs only dependent on it
# (no relations to other COLs). Reason is to prevent:
# - insert anomaly: insert only part of a row (rest is null) because have info only about part of
# the row. So introduce nulls -> harder to maintain.
# - deletion anomaly: delete row will erase part of it that we don't want to erase.
# So can't delete the row.
#Other goals:
# - Minimize database size
# - Simplify/normalize design
# - Scalability (tables are as independant as possible from each other, so easier to delete/add)
#To do so, follow those rules (each imply the previous ones):
# - First normal form (1NF): each record is unique:
# - must have primary key
# - if records are a combination of other tables, e.g. use several foreign keys to identify
# the row, then use them as a primary compound key instead of setting up a surrogate
# primary key
# - only atomic VAL, no multiple VAL... corresponding to the other columns for a specific row.
# One VAL should not be semantically VAL... (so several columns) by:
# - aggregating different types, e.g. a STR with first name + last name
# -> put in separate columns
# - aggregating same types with separators (spaces, slashes, etc.), e.g. a list of things
# separated by comma
# -> create several rows for each.
# - if duplicate values across a column (factor), put in separate table.
# - 2NF:
# - there should be a direct and unique relation between each COL and the primary key and a COL
# (called "functional dependency")
# - contrary is "transitive dependency": KEY => COL => COL2
# -> separate unrelated entities in different tables, and use foreign keys.
# - 3NF:
# - every dependance should be on a simple or unreductible compound primary key
# (called "full functional dependency")
# - 4NF: a row should not imply another row:
# - means a given non-primary COL value imply a full set of COL2 values
# - e.g. Each "Year" (COL) implies two products (COL2) rows:
# "ID,Year,P1" "ID,Year,P2" "ID,Year2,P1" "ID,Year2,P2"
# - it's like a dependency of COL2 on COL ("multivalued dependency")
# - 5NF: no table is (what could be) a join of simpler tables.
DENORMALIZATION ==> #Queries (or views which rewrite as queries) on joined tables can be slow.
#Solution: create tables/columns that represent usual queries:
# - materialized views (better)
# - new columns on existing tables, or new tables referencing existing tables and adding new columns.
#As a consequence, loses normalization, because induce redundancy:
# - consistency must be maintained using triggers, etc.
#Different types of views:
# - Non-materialized views are as normalized as the tables they rely on, since they will be rewritten
# as queries on them.
# - Tables and materialized views on the other hand can be since they are actual data.
#Can also be halfway: creating views on materialized views, or the inverse.
#Makes sense where there are more read than write, i.e. OLAP not for OLTP
DIMENSIONAL MODEL ==> #It is a denormalized model using "multidimensional tables":
# - types of tables:
# - Dimension table: with the actual data
# - Fact table: use several dimension tables (show them only as foreign keys ("dimensions"),
# that are used as a compound primary key) and add columns aggregating them ("measures").
# - the denormalization comes from the fact that any useful column can be put in a fact table, even
# if it is implied/depend on another one, and that is a materialized view or table, not a normal
# view.
# Reason: it is faster than to have to query it through other columns.
# - model subtypes:
# - star schema: each dimension is using one table
# - snowflake schema: the top table of all dimensions are all normalized, so might produce a set
# of tables for one dimension.
#How to choose the right dimensions/fact tables:
# - identify the business process to study, and the entities involved and their context ("grain"),
# e.g. "a line of product on shelves of a retail store"
# - extract possible measurement of each part of the "grain" as dimensions
# - decide which combination of dimensions are interesting as fact tables
┌────────────────────┐
│ DATA WAREHOUSE │
└────────────────────┘
OLAP ==> #Generalisation of multidimensional tables which does not necessary use RDBMS in the backend.
#OLAP cube: concept where data is represented as a n-dimensional cube where here cell is a value:
#so is like a "multidimensional table".
#Uses:
# - roll-up (aggregating, e.g. what fact tables do)
# - drill-down (disaggregating, e.g. looking up dimension tables from fact tables foreign keys)
# - slicing/dicing (filtering the fact table keeping same number of dimensions (dicing) or
# lowering it (slicing))
# - pivoting: when aggregating values according to two dimensions (i.e. viewing as a table),
# switching to other dimensions.
#Types:
# - ROLAP (relational) uses RDBMS in the backend. Usually more scalable and cleaner design.
# - MOLAP (multidimensional) not: they first process the data ("data load") in a specific format.
# Usually use special-purpose design which makes it faster.
# - HOLAP is hybrid.
DATA WAREHOUSE ==> #Database used for reporting and analysis, usually using a simple interface to get key indicators from
#complex data.
#When the interface is very simple with a focus on KPI, this is a "dashboard".
#Data stages:
# - OLTP (online transaction processing, mostly write, normalized):
# - taken from ODS (operational data stores), integrating multiple source to a unique format and
# removing redundancy in real time
# - then to OLTP/"operational systems" which store the data in real time, and upload it at regular
# times (offline data warehouse) or real time (online data warehouse) to the data warehouse.
# - OLAP (online analytical processing, mostly read, denormalized):
# - the data warehouse is the full database (which divides into dimensions and fact tables), and
# the access layer are the fact tables.
# They can be divided according to purposes/end users: those are "data marts".
#For each stage, data is populated with ETL (Extract/Transform/Load):
# - Extract:
# - measurement in real life
# - changing to conform to the format of the DBMS
# - Transform:
# - quality control (checking a set of values fits in the system)
# - changing to conform to the constraints/types of the specific database, e.g. using coded
# values, put things in different tables in right order, generate keys
# - Load: write in database
BUSINESS INTELLIGENCE #Using a data warehouse, and making analysis, data mining, etc. on it
(BI) ==> #ERP (Enterprise resource planning) are applications that rely on BI to serve a specific business
#function (accounting, HR training, supply chain, etc.)
┌──────────────────────┐
│ OBJECTS/COMMANDS │
└──────────────────────┘
KEYS ==> # - A candidate key is a unique ID for each record, usually first COL.
# - If there are several candidate key, the one chosen as reference is the primary key, the others
# are the alternate keys. Otherwise candidate key = primary key.
# - A superkey is like a candidate key, but can be spread across several fields (combination of
# those fields is unique), e.g. is a compound key (by opposition to a simple key).
# - A foreign key is a field in another table referring to a primary key, which become then
# parent|child tables.
# - If parent table = child table, this is a recursive foreign key.
# - A surrogate key is a candidate key which:
# - has no semantic meaning (e.g. UUID)
# - is not visible by users
# - is linked to real-life entity
INDEX ==> #Column copying one or several other columns, but in an efficient way so that it can be looked at to
#search for the other columns faster (e.g. by using a hash, or by reordering the table from a to z)
DDL|DML|DCL|TCL ==> #A statement can be :
# - DDL: changing DATABASE and SCHEMA: create, alter, drop, truncate, comment, etc.
# - DML: changing TABLE and other objects: select, insert, update, delete, lock, explain, etc.
# - DCL: access control: grant, revoke
# - TCL: transactions: commit, rollback, etc.
CRUD ==> #Fondamental write operations: Create, Read, Update, Delete
CDC ==> #"Change data capture". Listening for data changes.
#Can be:
# - push-based:
# - database trigger function to a:
# - queue
# - history table ("log trigger")
# - pull-based:
# - row-level column:
# - lastModified timestamp
# - version number
# - dirty boolean
# - version table, linking to other entities
# - scanning database-specific transaction logs