add the get-started documents

This commit adds documents that cover basic concepts in ScyllaDB and help developers get started with ScyllaDB.
scylladb · Jan 3, 2024 · 6e070dd · 6e070dd
1 parent d601338
commit 6e070dd
Show file tree

Hide file tree

Showing 20 changed files with 1,170 additions and 3 deletions.
diff --git a/docs/get-started/data-modeling/best-practices.rst b/docs/get-started/data-modeling/best-practices.rst
@@ -0,0 +1,100 @@
+====================================
+Data Modeling Best Practices
+====================================
+
+These additional topics provide a broader perspective on data modeling, query 
+design, schema design, and best practices when working with ScyllaDB or similar 
+distributed NoSQL databases.
+
+**Partition Key Selection**
+
+Choose your partition keys to avoid imbalances in your clusters. Imbalanced 
+partitions can lead to performance bottlenecks, which impact overall cluster 
+performance. Balancing the distribution of data across partitions is crucial 
+to ensure all nodes are effectively utilized in your cluster.
+
+Let's consider a scenario with poor partition key selection:
+
+.. code::
+
+    CREATE TABLE my_keyspace.messages_bad (
+        message_id uuid PRIMARY KEY,
+        user_id uuid,
+        message_text text,
+        created_at timestamp
+    );
+
+In this model, the partition key is chosen as ``message_id``, which is a globally 
+unique identifier for each message. This choice results in poor partition key 
+selection because it doesn't distribute data evenly across partitions. As 
+a result, messages from popular users with many posts will create hot 
+partitions, as all their messages will be concentrated in a single partition.
+
+A better solution for partition key selection would look like:
+
+.. code::
+
+    CREATE TABLE my_keyspace.messages_good (
+      user_id uuid,
+      message_id uuid,
+      message_text text,
+      created_at timestamp,
+      PRIMARY KEY (user_id, message_id)
+    );
+
+In this improved model, the partition key is chosen as ``user_id``, which is 
+the unique identifier for each user. This choice results in even data 
+distribution across partitions because each user's messages are distributed 
+across multiple partitions based on their ``user_id``. Popular users with many 
+posts won't create hot partitions, as their messages are distributed across 
+the cluster. This approach ensures that all nodes in the cluster are 
+effectively utilized, preventing performance bottlenecks.
+
+**Tombstones and Delete Workloads**
+
+If your workload involves frequent deletes, it’s crucial that you understand 
+the implications of tombstones on your read path. Tombstones are markers for 
+deleted data and can negatively affect query performance if not managed 
+effectively.
+
+Let's consider a data model for storing user messages:
+
+.. code::
+
+    CREATE TABLE my_keyspace.user_messages (
+        user_id uuid,
+        message_id uuid,
+        message_text text,
+        is_deleted boolean,
+        PRIMARY KEY (user_id, message_id)
+    );
+
+In this table, each user can have multiple messages, identified by 
+``user_id`` and ``message_id``.
+The ``is_deleted`` column is used to mark messages as deleted (true) or not 
+deleted (false). When a user deletes a message, a tombstone is created to mark 
+the message as deleted. Tombstones are necessary for data consistency, but can 
+negatively affect query performance, especially when there are frequent delete 
+operations.
+
+Adjust your compaction strategy to account for tombstones and optimize query 
+performance in scenarios with heavy delete operations.
+
+To optimize query performance in scenarios with heavy delete operations, you 
+can `adjust the compaction strategy and use TTL <https://opensource.docs.scylladb.com/stable/kb/ttl-facts.html>`_ 
+(Time-to-Live) to handle tombstones more efficiently. ScyllaDB allows you to 
+choose different compaction strategies. In scenarios with heavy delete 
+workloads, consider using a compaction strategy that efficiently handles 
+tombstones, such as the ``TimeWindowCompactionStrategy``.
+
+.. code::
+
+    ALTER TABLE my_keyspace.user_messages 
+     WITH default_time_to_live = 2592000 
+     AND compaction = {'class': 'TimeWindowCompactionStrategy', 'base_time_seconds': 86400, 'max_sstable_age_days': 14};
+
+
+This setup, with a 30-day TTL (``default_time_to_live = 2592000``) and 
+a 14-day maximum SSTable age ``('max_sstable_age_days': 14)``, is suited for 
+time-sensitive data scenarios where keeping data beyond a month is 
+unnecessary, and the most relevant data is always from the last two weeks.
diff --git a/docs/get-started/data-modeling/index.rst b/docs/get-started/data-modeling/index.rst
@@ -0,0 +1,34 @@
+===============
+Data Modeling
+===============
+
+Data modeling is the process of defining the structure and relationships of 
+your data in ScyllaDB. It involves making important decisions about how data 
+will be organized, stored, and retrieved.
+
+There are several types of data models, which include conceptual, logical, 
+and physical data models. Conceptual models tend to focus on high-level 
+business processes, while logic models detail the data structure. Physical 
+models consider how data is stored on the underlying infrastructure.
+
+Data modeling in NoSQL database such as ScyllaDB differs from traditional 
+relational databases. You may need to emphasize denormalization, scaling, and 
+optimal data access patterns to get the most out of ScyllaDB.
+
+A practical approach when data modeling for ScyllaDB is to adopt a query-first 
+data model, where you design your data model around the queries that it needs 
+to execute.
+
+
+.. toctree::
+  :titlesonly:
+
+  query-design
+  schema-design
+  best-practices
+
+
+
+
+
+
diff --git a/docs/get-started/data-modeling/query-design.rst b/docs/get-started/data-modeling/query-design.rst
@@ -0,0 +1,52 @@
+====================
+Query Design
+====================
+
+Your data model is heavily influenced by query efficiency. Effective partitioning, clustering columns and denormalization are key considerations for optimizing data access patterns.
+
+The way data is partitioned plays a pivotal role in how it’s accessed. An efficient partitioning strategy ensures that data is evenly distributed across the cluster, minimizing hotspots. For example:
+
+.. code::
+
+  CREATE TABLE my_keyspace.user_activities (
+    user_id uuid,
+    activity_date date,
+    activity_details text,
+    PRIMARY KEY (user_id, activity_date)
+  );
+
+In this table, ``user_id`` is the partition key, ensuring activities are grouped by user, and ``activity_date`` is the clustering column, ordering activities within each user's partition.
+
+Clustering columns dictate the order of rows within a partition. They are crucial for range queries. For example:
+
+.. code::
+
+  CREATE TABLE my_keyspace.user_logs (
+    user_id uuid,
+    log_time timestamp,
+    log_message text,
+    PRIMARY KEY (user_id, log_time)
+  );
+  
+Here, logs are ordered by ``log_time`` within each ``user_id`` partition, making it efficient to query logs over a time range for a specific user.  
+
+Your query design should also be optimized for efficient and effective queries 
+to retrieve and manipulate data. Query optimization aims to minimize resource 
+usage and latency while achieving maximum throughput.
+
+Indexing is another important aspect of query design. We have already 
+introduced the basic concept of primary keys, which can be made up of two 
+parts: the partition key and optional clustering columns. ScyllaDB also 
+supports secondary indexes for non-primary key columns. `Secondary indexes <https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html>_` can 
+improve query flexibility, but it’s important to consider their impact on 
+performance. For example:
+
+.. code::
+
+  CREATE INDEX ON my_keyspace.user_activities (activity_date);
+
+This index allows querying activities by date regardless of the user. However, secondary indexes might lead to additional overhead and should be used when necessary.
+
+An alternative to secondary indexes, `materialized views <https://opensource.docs.scylladb.com/stable/cql/mv.html>_` keep a separate, indexed table based on the base table's data. They can be more performant in certain scenarios.
+
+ScyllaDB supports CQL for querying data. Learning and mastering CQL is crucial for designing queries. For more detailed instructions, please see our `documentation <https://opensource.docs.scylladb.com/stable/cql/>_`.
diff --git a/docs/get-started/data-modeling/schema-design.rst b/docs/get-started/data-modeling/schema-design.rst
@@ -0,0 +1,94 @@
+=======================
+Schema Design
+=======================
+
+When adopting a query-first data model, the same constraints need to be applied to the schema design. While schema design can evolve to meet your changing application needs, there are certain choices you will need to make to get the most value out of ScyllaDB. This further reinforces the concept of adopting a query-first data model.
+
+**Data Types**
+
+Selecting the appropriate `data type <https://opensource.docs.scylladb.com/stable/cql/types.html>`_ for your columns is critical for both 
+physical storage and logical query performance in your data model. You will 
+need to consider factors such as data size, indexing, and sorting.
+
+Let's say you're designing a table to store information about e-commerce 
+products, and one of the attributes you want to capture is the product's price. 
+The choice of data type for the "price" column is crucial for efficient storage 
+and query performance.
+
+.. code::
+
+    CREATE TABLE my_keyspace.products (
+        product_id uuid PRIMARY KEY,
+        product_name text,
+        price decimal,
+        description text
+    );
+
+In this example, for the``price`` column, we've chosen the decimal data type. 
+This data type is suitable for storing precise numerical values, such as 
+prices, as it preserves decimal precision.
+Choosing decimal over other numeric data types like float or double is 
+essential when dealing with financial data to avoid issues with rounding errors.
+
+You can efficiently index and query prices using the decimal data type, 
+ensuring fast and precise searches for products within specific price ranges. 
+When you need to sort products by price, the decimal data type maintains the 
+correct order, even for values with different decimal precision.
+
+**(De)Normalization**
+
+The choice between normalization and denormalization will depend on your 
+specific use case. A good rule of thumb is that normalization reduces 
+redundancy but may require more complex queries, while denormalization 
+simplifies queries yet may increase storage requirements. It is important to 
+consider the tradeoff between approaches when designing your data model.
+
+Let's consider a scenario where you are designing a data model to manage 
+information about a library system with two main entities: books and authors. 
+You have the flexibility to choose between normalized and denormalized approaches.
+
+**Normalized Data Model**
+
+In a normalized data model, you would have separate tables for books and 
+authors, reducing data redundancy:
+
+.. code::
+
+    CREATE TABLE my_keyspace.authors (
+        author_id uuid PRIMARY KEY,
+        author_name text
+    );
+
+    CREATE TABLE my_keyspace.books (
+        book_id uuid PRIMARY KEY,
+        title text,
+        publication_year int,
+        author_id uuid,
+        ISBN text
+    );
+
+In this normalized model, the authors table stores information about authors, 
+and the books table stores information about books. The ``author_id`` column 
+in the books table serves as a foreign key referencing the authors table, 
+ensuring data consistency and reducing redundancy.
+
+**Denormalized Data Model**
+
+In a denormalized data model, you would combine some data to simplify queries, 
+even though it may lead to redundancy:
+
+.. code::
+
+    CREATE TABLE my_keyspace.books_and_authors (
+        book_id uuid PRIMARY KEY,
+        title text,
+        publication_year int,
+        author_name text,
+        ISBN text
+    );
+
+In this denormalized model, the ``books_and_authors`` table combines 
+information from both ``books`` and ``authors`` into a single table. 
+The ``author_name`` column directly stores the author's name, eliminating 
+the need for foreign key references.
+
diff --git a/docs/get-started/develop-with-scylladb/connect-apps.rst b/docs/get-started/develop-with-scylladb/connect-apps.rst
@@ -0,0 +1,110 @@
+=======================
+Connect an Application
+=======================
+
+To connect your application to ScyllaDB, you need to:
+
+#. :doc:`Install the relevant driver </get-started/develop-with-scylladb/install-drivers>` 
+   for your application language.
+
+   This step involves setting up a driver that is compatible with ScyllaDB. 
+   The driver acts as the link between your application and ScyllaDB, enabling 
+   your application to communicate with the database.
+
+#. Modify your application code to connect the driver. 
+
+   The following is some boilerplate code to help familiarize yourself with 
+   connecting your application with the ScyllaDB driver. For a detailed 
+   walkthrough of building a fictional media player application with code 
+   examples, please see our 
+   `Getting Started tutorial <https://cloud-getting-started.scylladb.com/stable/getting-started.html>`_.
+
+.. tabs::
+
+  .. group-tab:: Rust
+
+    .. code-block:: rust
+
+      use anyhow::Result;in various languages
+      use scylla::{Session, SessionBuilder};
+      use std::time::Duration;
+      #[tokio::main]
+      async fn main() -> Result<()> {
+          let session: Session = SessionBuilder::new()
+              .known_nodes(&[
+                  "localhost",
+              ])
+              .connection_timeout(Duration::from_secs(30))
+              .user("scylla", "your-awesome-password")
+              .build()
+              .await
+              .unwrap();
+
+          Ok(())
+      }
+
+  .. group-tab:: Go
+
+    .. code-block:: go
+
+      func main() {
+          cluster := gocql.NewCluster("localhost")
+
+          cluster.Authenticator = gocql.PasswordAuthenticator{Username: "scylla", Password: "your-awesome-password"}
+
+	        session, err := gocqlx.WrapSession(cluster.CreateSession())
+
+	        if err != nil {
+		          panic("Connection fail")
+	        }
+       }
+
+
+
+  .. group-tab:: Java
+
+    .. code-block:: java
+
+      import com.datastax.driver.core.Cluster;  
+      import com.datastax.driver.core.PlainTextAuthProvider;  
+      import com.datastax.driver.core.Session;  
+  
+      class Main {  
+  
+          public static void main(String[] args) {  
+          Cluster cluster = Cluster.builder()  
+              .addContactPoints("localhost")  
+              .withAuthProvider(new PlainTextAuthProvider("scylla", "your-awesome-password"))  
+              .build();  
+    
+          Session session = cluster.connect();  
+    
+          }  
+      }
+
+  .. group-tab:: Python
+
+    .. code-block:: python
+
+      from cassandra.cluster import Cluster 
+      from cassandra.auth import PlainTextAuthProvider
+      cluster = Cluster(
+          contact_points=[
+              "localhost",
+          ],
+          auth_provider=PlainTextAuthProvider(username='scylla', password='your-awesome-password')
+      )
+
+  .. group-tab:: JavaScript
+
+    .. code-block:: javascript
+
+      const cluster = new cassandra.Client({
+          contactPoints: ["localhost", ...],
+          localDataCenter: 'your-data-center', 
+          credentials: {username: 'scylla', password: 'your-awesome-password'},
+          // keyspace: 'your_keyspace' // optional
+      })
+
+
+