Data/State/Databases_general/database_theory.theory.txt


                                  ┏━━━━━━━━━━━━━━━━━━━━━┓
                                  ┃   DATABASE_THEORY   ┃
                                  ┗━━━━━━━━━━━━━━━━━━━━━┛

                                  ┌──────────┐
                                  │   DBMS   │
                                  └──────────┘

SQL ==>                 #There is an ANSI standard, but it is not freely available
                        #In practice, every DBMS extends it

DBMS ==>                #Database Management Systems
                        #Types:
                        #  - flat model: one single table
                        #  - Relational (RDBMS): based on relations between tables, i.e. use primary keys and foreign keys.
                        #    Usually based on SQL.
                        #  - Object-oriented (OODBMS) where table can inherit other tables.
                        #    Can also be relational (most modern RDBMS are also OODBMS -> ORDBMS (object-relational))
                        #  - NoSQL: not based on SQL and usually not relational.
                        #    Usually faster and more convenient, but often less ACID-compliant.
                        #    Choices between them depend on structure of data:
                        #      - document-based (if appartenance/contenance relations): based on a JSON object, XML, etc.
                        #      - graph-based (if complex relations that are the main matter): each node has a varying number
                        #        of relations between each other
                        #      - key-value stores (if no relations): only VAR=VAL, but much faster.
                        #  - NewSQL: based on SQL, but which try to be as fast and convenient as NoSQL.
                        #Architecture can be:
                        #  - client-side
                        #  - client-server
                        #  - distributed (client with multiple servers, also called database virtualization):
                        #    - shared-nothing (replication over nodes)
                        #    - shared-data (division of data over nodes)
                        #Notable ones:
                        #  - RDBMS:
                        #    - client-side:
                        #       - Sqlite: C library
                        #    - client-server:
                        #       - PostgreSQL: good performance, full of features
                        #       - "Big 3":
                        #          - Microsoft SQL server: uses proprietary T-SQL (similar to PL/PGSQL). Integrates with .NET
                        #            Used with ERP/data warehouse too.
                        #          - Oracle DB: full of features, but proprietary
                        #            Used with ERP/data warehouse too.
                        #          - MySQL: lacks of lot of features, but simple to use
                        #            Mostly used in web development.
                        #       - IBM DB2
                        #    - distributed:
                        #       - Postgres-XC: uses PostgreSQL
                        #       - Hive: uses HiveQL, a SQL-like language. Integrates with Hadoop.
                        #    - noSQL:
                        #       - document-oriented:
                        #          - mongoDB: retains some properties of SQL. Best if lot of write.
                        #          - couchDB: best if lot of read or lot of versioning.
                        #       - key-value:
                        #          - Redis: rapidly changing data with small size.
                        #       - graph:
                        #          - Neo4j
                        #       - column-oriented:
                        #          - HBase: distributed. Integrates well with Hadoop. Best for read.
                        #          - Cassandra: distributed. Best for write.

                                  ┌────────────┐
                                  │   DESIGN   │
                                  └────────────┘

DATABASE DESIGN ==>     #Also called modelling. Can be done at three view levels:
                        #  - conceptual: general terms
                        #  - logical: technical terms.
                        #    Only tables and relationships.
                        #    Where normalization happens.
                        #  - physical: variable names terms.
                        #    RDBMS-specific: includes indexes, constraints, types, triggers, functions, etc..
                        #    Where denormalization happens.
                        #Entity-relationship (ER) diagrams:
                        #  - done at the logical level
                        #  - types:
                        #     - oval: attribute (column)
                        #        - key is underlined
                        #     - rectangle: entity (table)
                        #     - losange: relationship (between two tables, e.g. a foreign key or (if has attributes)
                        #       relation table):
                        #        - n to m: one X correspond to one or more Y, and vice versa
                        #        - 1 to n (for X to Y): one X correspond to one or more Y, and one Y correspond to one X
                        #        - 1 to 1: one X correspond to one Y, and vice versa
                        #  - double-line mean all individuals of the set participate to the relation.
                        #    Single-line mean contrary.

INTEGRITY ==>           #  - Entity integrity: no duplicate records for all columns (e.g. primary keys)
                        #  - Domain integrity: enforcing types, constraints on columns
                        #  - Referential integrity: can't delete records used by other records (e.g. foreign keys)
                        #  - User-defined integrity: any other

ACID ==>                #Transactions must be (usually done by most RDBMS):
                        #  - Atomic:
                        #     - no half-performed operations, i.e. if part of an operation fails, whole operation fails (including rollback)
                        #     - e.g. transactions
                        #     - implies possibility of more complex retries
                        #  - Consistency:
                        #     - can mean either:
                        #        - data is same across all nodes
                        #        - respects types, constraints, etc. (e.g. column types and constraints)
                        #  - Isolation:
                        #     - multithread-safety, i.e. isolate different threads from each other
                        #     - i.e. requiring read|write locks while operation is ongoing
                        #     - e.g. transactions with MVCC
                        #     - tradeoff with speed: data synchronization
                        #  - Durability:
                        #     - once finished, transaction remains on physical data, even if data failure immediately after
                        #     - i.e. persisting as fast as possible, e.g. flushing caches
                        #     - tradeoff with speed: against caching

CAP THEOREM ==>         #A DBMS can only have two of the three:
                        #  - Consistency: data is same across all nodes
                        #  - Availability: can always access data
                        #  - Partition tolerance: if nodes fail, can use another route
                        #Usually choice is between consistency and availability, in event of a partition

BASE ==>                #Alternative to ACID used by many noSQL databases
                        #Gives up on consistency and durability in order to increase availability, partition tolerance and speed
                        #Main principle is "eventual consistency":
                        #  - instead of saying data will be ok right now, say that they will be "eventually"
                        #  - i.e. for sure at some point, called "[replica] converging"
                        #     - e.g. at some point, read will always return the same value
                        #  - sometimes called optimistic replication
                        #Principles are:
                        #  - Basic Availability: means "availability", without consistency guarantee
                        #  - Soft State: means state might change over time even without input (because of "converging")
                        #  - Eventual consistency

                                  ┌─────────────────┐
                                  │   DISTRIBUTED   │
                                  └─────────────────┘

BACKUPS ==>             #Can be either warm (can be done when server is running) or cold.

SEVERAL MACHINES ==>    #Using several machines can have several goals:
                        #  - load balancing (distributed database):
                        #     - computing is attributed to several nodes to easily scale demand
                        #  - high availability:
                        #     - reduce server downtime by using several machines: if one fails, another handles requests
                        #     - redundancy/replication:
                        #        - making copies of one machine to other machines
                        #        - can be at different levels:
                        #           - disk-level
                        #           - filesystem-level
                        #           - OS-level
                        #           - database-software-level:
                        #              - machines can be:
                        #                 - master (or "read|write")
                        #                 - slave (or "standby"):
                        #                    - hot standby: real-time replication, can be up with no time and is exact copy
                        #                    - warm standby: regular intervals replication, can be up with little recovery time
                        #                    - cold standby: less frequent intervals replication, can be up with more time
                        #              - configuration can be:
                        #                 - master-slave ("primary-backup", "passive"):
                        #                    - only one node can handle requests, and it propagates results on data to other
                        #                      nodes.
                        #                    - Can also cut into several subdatabases where each has a master-slave with a
                        #                      different master.
                        #                    - "cascading replication": a slave ("cascading standby", "upstream standby") can
                        #                      be a master of another slave ("downstream standby")
                        #                 - master-master ("multi-primary", "active"):
                        #                    - each node can handle requests, and propagation to other nodes can cause
                        #                      concurrency resolution challenges, especially in ACID.
                        #              - can be based on:
                        #                 - logs (log shipping)
                        #                 - triggers sync (prevent concurrency problem) or async (fix it afterwards)
                        #              - can be:
                        #                 - sync., which means statements wait for replication to finish to continue
                        #                   (best for high availability)
                        #                 - async., which means they don't wait (best for performance)
                        #           - when a computer goes down and another takes in place, this is a failover (automatic) or a
                        #             switchover (manual)

RAS ==>                 #Reliability/Availability/Serviceability, properties of distributed systems:
                        #  - reliability: frequency of downtimes (not length), e.g. mean time between failures (MTBF) or
                        #    failure rate (no of failures for a specific time)
                        #  - high availability: low downtime (including maintenance) / total time.
                        #    There should be low ETR (estimated time of repair) also known as RTO (recovery time objective)
                        #  - serviceability: notifications of problems, monitoring, easiness to repair, etc.

                                  ┌───────────────────┐
                                  │   NORMALIZATION   │
                                  └───────────────────┘

NORMALIZATION ==>       #In short:
                        #  - no duplicate information. Reason it to prevent:
                        #     - update anomaly: need to update at several places (redundancy) for a single entity
                        #       change -> risk of forget and inconsistency
                        #  - One entity by table, i.e. one primary key by table, all COLs only dependent on it
                        #    (no relations to other COLs). Reason is to prevent:
                        #     - insert anomaly: insert only part of a row (rest is null) because have info only about part of
                        #       the row. So introduce nulls -> harder to maintain.
                        #     - deletion anomaly: delete row will erase part of it that we don't want to erase.
                        #       So can't delete the row.
                        #Other goals:
                        #  - Minimize database size
                        #  - Simplify/normalize design
                        #  - Scalability (tables are as independant as possible from each other, so easier to delete/add)
                        #To do so, follow those rules (each imply the previous ones):
                        #  - First normal form (1NF): each record is unique:
                        #     - must have primary key
                        #        - if records are a combination of other tables, e.g. use several foreign keys to identify
                        #          the row, then use them as a primary compound key instead of setting up a surrogate
                        #          primary key
                        #     - only atomic VAL, no multiple VAL... corresponding to the other columns for a specific row.
                        #       One VAL should not be semantically VAL... (so several columns) by:
                        #        - aggregating different types, e.g. a STR with first name + last name
                        #          -> put in separate columns
                        #        - aggregating same types with separators (spaces, slashes, etc.), e.g. a list of things
                        #          separated by comma
                        #          -> create several rows for each.
                        #     - if duplicate values across a column (factor), put in separate table.
                        #  - 2NF:
                        #    - there should be a direct and unique relation between each COL and the primary key and a COL
                        #      (called "functional dependency")
                        #    - contrary is "transitive dependency": KEY => COL => COL2
                        #      -> separate unrelated entities in different tables, and use foreign keys.
                        #  - 3NF:
                        #    - every dependance should be on a simple or unreductible compound primary key
                        #      (called "full functional dependency")
                        #  - 4NF: a row should not imply another row:
                        #     - means a given non-primary COL value imply a full set of COL2 values
                        #        - e.g. Each "Year" (COL) implies two products (COL2) rows:
                        #          "ID,Year,P1" "ID,Year,P2" "ID,Year2,P1" "ID,Year2,P2"
                        #     - it's like a dependency of COL2 on COL ("multivalued dependency")
                        #  - 5NF: no table is (what could be) a join of simpler tables.

DENORMALIZATION ==>     #Queries (or views which rewrite as queries) on joined tables can be slow.
                        #Solution: create tables/columns that represent usual queries:
                        #  - materialized views (better)
                        #  - new columns on existing tables, or new tables referencing existing tables and adding new columns.
                        #As a consequence, loses normalization, because induce redundancy:
                        #  - consistency must be maintained using triggers, etc.
                        #Different types of views:
                        #  - Non-materialized views are as normalized as the tables they rely on, since they will be rewritten
                        #    as queries on them.
                        #  - Tables and materialized views on the other hand can be since they are actual data.
                        #Can also be halfway: creating views on materialized views, or the inverse.
                        #Makes sense where there are more read than write, i.e. OLAP not for OLTP

DIMENSIONAL MODEL ==>   #It is a denormalized model using "multidimensional tables":
                        #   - types of tables:
                        #      - Dimension table: with the actual data
                        #      - Fact table: use several dimension tables (show them only as foreign keys ("dimensions"),
                        #        that are used as a compound primary key) and add columns aggregating them ("measures").
                        #   - the denormalization comes from the fact that any useful column can be put in a fact table, even
                        #     if it is implied/depend on another one, and that is a materialized view or table, not a normal
                        #     view.
                        #     Reason: it is faster than to have to query it through other columns.
                        #   - model subtypes:
                        #      - star schema: each dimension is using one table
                        #      - snowflake schema: the top table of all dimensions are all normalized, so might produce a set
                        #        of tables for one dimension.
                        #How to choose the right dimensions/fact tables:
                        #   - identify the business process to study, and the entities involved and their context ("grain"),
                        #     e.g. "a line of product on shelves of a retail store"
                        #   - extract possible measurement of each part of the "grain" as dimensions
                        #   - decide which combination of dimensions are interesting as fact tables

                                  ┌────────────────────┐
                                  │   DATA WAREHOUSE   │
                                  └────────────────────┘

OLAP ==>                #Generalisation of multidimensional tables which does not necessary use RDBMS in the backend.
                        #OLAP cube: concept where data is represented as a n-dimensional cube where here cell is a value:
                        #so is like a "multidimensional table".
                        #Uses:
                        #  - roll-up (aggregating, e.g. what fact tables do)
                        #  - drill-down (disaggregating, e.g. looking up dimension tables from fact tables foreign keys)
                        #  - slicing/dicing (filtering the fact table keeping same number of dimensions (dicing) or
                        #    lowering it (slicing))
                        #  - pivoting: when aggregating values according to two dimensions (i.e. viewing as a table),
                        #    switching to other dimensions.
                        #Types:
                        #  - ROLAP (relational) uses RDBMS in the backend. Usually more scalable and cleaner design.
                        #  - MOLAP (multidimensional) not: they first process the data ("data load") in a specific format.
                        #    Usually use special-purpose design which makes it faster.
                        #  - HOLAP is hybrid.

DATA WAREHOUSE ==>      #Database used for reporting and analysis, usually using a simple interface to get key indicators from
                        #complex data.
                        #When the interface is very simple with a focus on KPI, this is a "dashboard".
                        #Data stages:
                        #  - OLTP (online transaction processing, mostly write, normalized):
                        #     - taken from ODS (operational data stores), integrating multiple source to a unique format and
                        #       removing redundancy in real time
                        #     - then to OLTP/"operational systems" which store the data in real time, and upload it at regular
                        #       times (offline data warehouse) or real time (online data warehouse) to the data warehouse.
                        #  - OLAP (online analytical processing, mostly read, denormalized):
                        #     - the data warehouse is the full database (which divides into dimensions and fact tables), and
                        #       the access layer are the fact tables.
                        #       They can be divided according to purposes/end users: those are "data marts".
                        #For each stage, data is populated with ETL (Extract/Transform/Load):
                        #  - Extract:
                        #     - measurement in real life
                        #     - changing to conform to the format of the DBMS
                        #  - Transform:
                        #     - quality control (checking a set of values fits in the system)
                        #     - changing to conform to the constraints/types of the specific database, e.g. using coded
                        #       values, put things in different tables in right order, generate keys
                        #  - Load: write in database

BUSINESS INTELLIGENCE   #Using a data warehouse, and making analysis, data mining, etc. on it
(BI) ==>                #ERP (Enterprise resource planning) are applications that rely on BI to serve a specific business
                        #function (accounting, HR training, supply chain, etc.)

                                  ┌──────────────────────┐
                                  │   OBJECTS/COMMANDS   │
                                  └──────────────────────┘

KEYS ==>                #  - A candidate key is a unique ID for each record, usually first COL.
                        #  - If there are several candidate key, the one chosen as reference is the primary key, the others
                        #    are the alternate keys. Otherwise candidate key = primary key.
                        #  - A superkey is like a candidate key, but can be spread across several fields (combination of
                        #    those fields is unique), e.g. is a compound key (by opposition to a simple key).
                        #  - A foreign key is a field in another table referring to a primary key, which become then
                        #    parent|child tables.
                        #  - If parent table = child table, this is a recursive foreign key.
                        #  - A surrogate key is a candidate key which:
                        #      - has no semantic meaning (e.g. UUID)
                        #      - is not visible by users
                        #      - is linked to real-life entity

INDEX ==>               #Column copying one or several other columns, but in an efficient way so that it can be looked at to
                        #search for the other columns faster (e.g. by using a hash, or by reordering the table from a to z)

DDL|DML|DCL|TCL ==>     #A statement can be :
                        #  - DDL: changing DATABASE and SCHEMA: create, alter, drop, truncate, comment, etc.
                        #  - DML: changing TABLE and other objects: select, insert, update, delete, lock, explain, etc.
                        #  - DCL: access control: grant, revoke
                        #  - TCL: transactions: commit, rollback, etc.
CRUD ==>                #Fondamental write operations: Create, Read, Update, Delete

CDC ==>                 #"Change data capture". Listening for data changes.
                        #Can be:
                        #  - push-based:
                        #     - database trigger function to a:
                        #        - queue
                        #        - history table ("log trigger")
                        #  - pull-based:
                        #     - row-level column:
                        #        - lastModified timestamp
                        #        - version number
                        #        - dirty boolean
                        #     - version table, linking to other entities
                        #     - scanning database-specific transaction logs