Skip to content

Commit

Permalink
add a section on schemas
Browse files Browse the repository at this point in the history
  • Loading branch information
dimitri-yatsenko committed Aug 20, 2024
1 parent 8563391 commit a6b4511
Show file tree
Hide file tree
Showing 3 changed files with 187 additions and 47 deletions.
189 changes: 183 additions & 6 deletions book/02-concepts/00-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Hierarchical file systems support a range of operations, including:

The hierarchical file system is one of the most familiar data models to scientists, who often think of data primarily in such terms. This model provides an organized way to store and retrieve data, making it easier to manage large collections of files across multiple directories.

## Variable-Based Data Model
## Examples: Variables in programming languages

The Variable-Based Data Model is fundamental to how most programming languages like JavaScript, C++, R, Julia, and Python handle data. In this model, variables act as containers or references that store data values, allowing programmers to manipulate and interact with data easily:

Expand Down Expand Up @@ -125,17 +125,194 @@ DataFrames support a wide range of operations, making them a powerful tool for d

DataFrames have become an essential tool in modern data analysis, providing a structured yet flexible way to handle and manipulate data. Their ability to work with heterogeneous data types, combined with a rich set of operations, makes them ideal for tasks ranging from simple data exploration to complex data transformations and machine learning preparation. Whether in Python, R, or Julia, DataFrames have become a cornerstone of data science workflows.

## Example: Document Data Model (JSON and BSON)

## Example: Object Model
The Document Data Model, commonly exemplified by JSON (JavaScript Object Notation), organizes data as key-value pairs within structured documents. This flexible, text-based format is widely used for data interchange between systems, particularly in web applications and APIs.

# Schema or Schemaless
Data models can be divided into structured and self-describing.
### History
- **1999**: JSON was developed by Douglas Crockford as a lightweight data interchange format. The goal was to create a simple, human-readable format that could easily be parsed and generated by machines.

# Exercise
- **2001**: JSON began to gain popularity as it was adopted for use in web applications, particularly with the rise of AJAX (Asynchronous JavaScript and XML), which allowed for dynamic content updates in web pages without requiring a full page reload.

- **2005**: JSON was officially standardized as ECMA-404 by the Ecma International organization, solidifying its status as a reliable and widely accepted data format.

- **2013**: JSON received its IETF standardization as RFC 7159, further cementing its role in data interchange across a wide range of applications.

- **Present**: JSON is now the de facto standard for data exchange in web APIs, configuration files, and NoSQL databases, due to its simplicity, flexibility, and wide support across programming languages.

### Structure

- **Key-Value Pairs**: The fundamental building block of JSON is the key-value pair. Each key is a string, and it maps to a value, which can be a primitive type (such as a number or string) or a more complex type (such as an object or array).

- **Objects**: Objects in JSON are collections of key-value pairs, enclosed in curly braces `{}`. Each key within an object is unique, and the values can be of any valid JSON type.

- **Arrays**: Arrays are ordered lists of values, enclosed in square brackets `[]`. The values within an array can be of different types, including other arrays or objects, making JSON highly flexible for representing complex data structures.

- **Primitive Types**: JSON supports simple data types such as:
- **Numbers**: Represent both integers and floating-point numbers.
- **Strings**: Text data enclosed in double quotes.
- **Booleans**: `true` or `false`.
- **Null**: Represents an empty or non-existent value.

### Supported Operations

The Document Data Model supports a variety of operations, including:

- **Data Serialization and Deserialization**: JSON data can be easily converted (serialized) into a string format for storage or transmission and then converted back (deserialized) into objects for use within a program.

- **Nested Structures**: JSON supports nested objects and arrays, allowing for the representation of hierarchical data structures.

- **Data Exchange**: JSON is widely used for transmitting data between a server and a web application, due to its lightweight and readable format.

- **Parsing and Querying**: JSON data can be parsed into native data structures in most programming languages, and tools like JSONPath can be used to query specific parts of a JSON document.

### Common Uses

The JSON data model is widely used in various scenarios, particularly in web development and data interchange:

- **APIs**: JSON is the de facto standard for data exchange in RESTful APIs, enabling communication between clients and servers.
- **Configuration Files**: JSON is often used for configuration files, storing settings in a structured, human-readable format.
- **NoSQL Databases**: Many NoSQL databases, like MongoDB, use a JSON-like format (BSON) to store documents, allowing for flexible schema design and dynamic data storage.

The Document Data Model, with JSON as its most common implementation, offers flexibility and simplicity for handling structured data, making it an ideal choice for many modern applications.

## Example: Key-Value Data Model

The Key-Value Data Model is a simple and efficient way of storing, retrieving, and managing data, where each piece of data is stored as a pair consisting of a unique key and its associated value. This model is particularly popular in scenarios where fast data access and scalability are critical.

### Historical Background

The Key-Value Data Model has its roots in early database systems but gained significant prominence with the rise of NoSQL databases in the late 2000s. As web applications grew in complexity and scale, traditional relational databases struggled to keep up with the demand for fast, distributed, and scalable data storage. This led to the development and adoption of key-value stores, which offered a more flexible and efficient approach to handling large-scale, distributed data.

### Structure

- **Keys**: Unique identifiers that are used to retrieve the associated values. Keys are typically simple data types like strings or integers.

- **Values**: The actual data associated with the key. Values can be of any type, from simple data types like strings and numbers to more complex types like JSON objects or binary data.

The simplicity of this model allows for extremely fast lookups, as the database can quickly find the value associated with a given key without the need for complex queries or joins.

### Supported Operations

The Key-Value Data Model supports a limited but powerful set of operations:

- **Put/Set**: Stores a value associated with a specific key. If the key already exists, its value is updated.

- **Get**: Retrieves the value associated with a specific key. If the key does not exist, the operation returns a null or error.

- **Delete**: Removes the key-value pair from the store, freeing up space and removing the associated data.

- **Existence Check**: Determines whether a specific key exists in the store.

These operations are typically executed in constant time, making key-value stores highly efficient for many applications.

### Prominent Implementations

The Key-Value Data Model has been implemented in several prominent systems, particularly in the realm of NoSQL databases:

- **Redis**: An in-memory key-value store known for its speed and support for complex data structures like lists, sets, and hashes. Redis is widely used for caching, real-time analytics, and message brokering.

- **Amazon DynamoDB**: A fully managed key-value and document database service provided by Amazon Web Services (AWS). It is designed for high availability and scalability, making it ideal for web-scale applications.

- **Riak**: A distributed key-value store that emphasizes availability and fault tolerance. Riak is often used in scenarios requiring distributed data storage with a strong emphasis on reliability.

- **Couchbase**: A NoSQL database that combines the simplicity of a key-value store with the power of a document store, supporting both key-value operations and complex queries.

### Common Uses

The Key-Value Data Model is particularly well-suited for:

- **Caching**: Storing frequently accessed data for rapid retrieval, reducing load on the primary database.

- **Session Management**: Maintaining user session information in web applications, where fast access and scalability are essential.

- **Configuration Management**: Storing configuration settings that need to be quickly retrieved by applications at runtime.

- **Real-Time Analytics**: Handling large volumes of data that require fast read and write operations, such as in monitoring and analytics applications.

The Key-Value Data Model’s simplicity, speed, and scalability make it a fundamental tool in modern computing, particularly for applications that require quick access to data and need to scale horizontally across distributed systems.

# Schema vs. Schemaless Data Models

Data models can generally be categorized into two broad types: **structured** and **self-describing** (often referred to as schemaless). These two approaches represent different philosophies in how data structure is defined, managed, and validated.

## Structured Data Models

In structured data models, the structure of the data is defined separately from the data itself. This predefined structure is known as a **schema**. A schema acts as a blueprint for the data, specifying the types of data that can be stored, the relationships between different data elements, and any constraints or rules that must be followed.

- **Schema**: A schema defines the organization of data within the model, including the fields, data types, and relationships between them. It provides a rigid framework that ensures consistency and integrity of the data. For example, in a relational database, the schema would define tables, columns, data types (e.g., integers, strings), and relationships (e.g., foreign keys) between tables.

- **Validation**: One of the key benefits of having a schema is the ability to validate data before it is stored. The schema serves as a gatekeeper, allowing only data that conforms to the predefined structure to be accepted. This ensures that the data remains consistent, predictable, and reliable. If data does not match the schema, it can be rejected or corrected before being saved.

- **Example**: The quintessential example of a structured data model is the **relational database model**, where data is organized into tables with clearly defined columns and relationships. Each table has a schema that dictates what kind of data it can hold, ensuring that every entry conforms to the expected format.

## Self-Describing (Schemaless) Data Models

In contrast, self-describing or schemaless data models do not require a predefined schema. Instead, the structure of the data is embedded within the data itself, allowing for greater flexibility and adaptability.

- **Self-Describing Structure**: In self-describing data models, each piece of data carries its own structure. This means that the structure of the data can vary from one entry to another, without the need for a strict, overarching schema. The structure is inferred from the data itself, making these models highly flexible and adaptable to changing data requirements.

- **Flexibility**: The primary advantage of self-describing models is their flexibility. Since there is no rigid schema, new types of data can be added or existing structures can be modified without needing to overhaul the entire database. This makes self-describing models particularly useful in environments where the data is diverse or evolving rapidly.

- **Example**: A common example of a self-describing data model is **JSON (JavaScript Object Notation)**. In JSON, data is stored as key-value pairs, where the structure is defined within each data entry. This allows for varying structures within the same dataset, enabling a more dynamic and flexible approach to data management.

## Choosing Between Structured and Schemaless Models

The choice between using a structured or schemaless data model often depends on the specific needs of the application:

- **When to Use Structured Models**: If data consistency, integrity, and the ability to enforce rules and relationships are paramount, a structured model with a well-defined schema is ideal. This is common in financial systems, enterprise databases, and other scenarios where data must adhere to strict standards.

- **When to Use Schemaless Models**: If flexibility, rapid development, and the ability to handle diverse or changing data are more important, a self-describing or schemaless model may be better suited. This is typical in web applications, content management systems, and scenarios where the data structure is likely to evolve over time.

Both approaches have their strengths and are often used together in hybrid systems, where some data is managed with a strict schema and other data is stored more flexibly.

# Data models in science
While business world gravitates toward structured data models to strictly enforce business order

Neuroscience, a field rife with intricate datasets, often sees researchers navigating vast amounts of data while collaborating within extensive, multidisciplinary teams.
Given this complexity, the logical assumption would be that cutting-edge tools for data organization, manipulation, analysis, and querying would be central to their operations.
However, this isn't the prevailing reality.
Despite technological advancements, a significant portion of the scientific community still refrains from employing proper databases for their studies.
The predominant trend is to rely on shared data in the format of file repositories, systematically organized into folders under a uniform naming convention.
This leads to the pertinent question: Why this discernible hesitance towards databases?

```{card} Reasons for scientists' reluctance to use databases
Gray *et al.* in their 2005 technical report titled "Scientific Data Management in the Coming Decade" {cite:p}`gray_scientific_2005` delved deep to unearth the reasons underpinning this avoidance:
* Perceived lack of tangible benefits.
* Absence of efficient visualization/plotting tools.
* A belief in the sufficiency of their programming language for data management.
* Incompatibility with specific data types like arrays, spatial data, text, etc.
* Mismatch with their data access patterns.
* Concerns over speed and efficiency.
* Inability to manipulate database data using regular application programs.
* The cost implications of hiring database administrators.
```

These apprehensions are valid.
Traditional database systems were primarily sculpted keeping in mind sectors like business, commerce, and web applications, not scientific computing.
For scientists, there's a clear need for a system that offers more—more flexibility, support for unique scientific data types, and capabilities tailored for distributed computation and visualization.

## The Limitations of File-based Systems

The aforementioned concerns naturally lead one to ponder: What, if any, are the drawbacks of simply organizing data as a structured file repository?
When do file systems falter?

Files, in essence, are nothing but sequences of bytes tagged with specific names.
They inherently lack structure or any meta-information.
While they can be systematically arranged with discerning naming conventions into structured folders, the onus of adhering to any structural framework lies externally.
Numerous *data standards*, such as [BIDS](https://bids.neuroimaging.io/) for brain imaging, essentially define their guidelines based on specific file/folder structures.
But therein lies a challenge—the filesystem itself doesn't enforce this structure, necessitating the creation of separate data standards.
The filesystem essentially passes on the challenge of efficient operations to the end processes that engage with them.
To efficiently navigate data organized in files, there's a need for distinct efforts in crafting access patterns, generating indices for swift searches, and scripting common queries.
In scenarios of shared distributed projects, there's also the added logistical challenge of data transfers, ensuring data integrity during concurrent access and modifications, and optimizing data operations.

# Exercises
As an exercise, describe other models you are familiar with in terms of its basic constructs, operations, and data integrity rules.
For example, what data models are used by the following:

* HDF5 or .MAT files
* Graph databases (
* Graph databases
* Vector database
* Document database e.g. MongoDB
4 changes: 2 additions & 2 deletions book/02-concepts/01-relational.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The concept of relations has a rich history that dates back to the mid-19th cent

```{figure} ../images/demorgan.jpg
:name: Augustus De Morgan
:scale: 50 %
:width: 300px
[Augustus De Morgan](https://en.wikipedia.org/wiki/Augustus_De_Morgan) developed the fundamental concepts of relational theory, including operations on relations.
Expand All @@ -46,7 +46,7 @@ Cantor's set theory introduced the idea that relations could be seen as subsets

```{figure} ../images/georg_cantor.jpg
:name: Georg Cantor
:scale: 50 %
:width: 300px
[Georg Cantor](https://en.wikipedia.org/wiki/Georg_Cantor) reframed relations in the context of Set Theory
```
Expand Down
Loading

0 comments on commit a6b4511

Please sign in to comment.