feat/getting started improvements (#10)

* ignore project files * reference ScyllaDB Operator instead * reference ScyllaDB Alternator as admonition * not relevant * not relevant * word replace * rephrase * prefer restriction to condition, to describe limiting effect of WHERE clauses * not relevant * more accurate * not accurate * more accurate * add a note about upsert behavior of inserts in ScyllaDB * improve examples in query design * rephrase - data types have little influence on physical * improve data types example * swap good and bad examples * too advanced for getting started * clean up new lines * typos and formatting * user not message * add ref to LWT * add note on update as well * add note on delete allowing conditional IF
scylladb · Jan 29, 2024 · 699e19b · 699e19b
1 parent 976953e
commit 699e19b
Show file tree

Hide file tree

Showing 19 changed files with 149 additions and 174 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,5 @@ env/
 venv/
 ENV/
 env.bak/
-venv.bak/
+venv.bak/
+.idea/
diff --git a/docs/get-started/data-modeling/best-practices.rst b/docs/get-started/data-modeling/best-practices.rst
@@ -18,83 +18,33 @@ Let's consider a scenario with poor partition key selection:
 .. code::
 
     CREATE TABLE my_keyspace.messages_bad (
-        message_id uuid PRIMARY KEY,
-        user_id uuid,
-        message_text text,
-        created_at timestamp
+      user_id uuid,
+      message_id uuid,
+      message_text text,
+      created_at timestamp,
+      PRIMARY KEY (user_id, message_id)
     );
 
-In this model, the partition key is chosen as ``message_id``, which is a globally 
-unique identifier for each message. This choice results in poor partition key 
+In this model, the partition key is chosen as ``user_id``, which is a globally
+unique identifier for each user. This choice results in poor partition key
 selection because it doesn't distribute data evenly across partitions. As 
-a result, messages from popular users with many posts will create hot 
+a result, messages from popular users with many messages will create hot
 partitions, as all their messages will be concentrated in a single partition.
 
 A better solution for partition key selection would look like:
 
 .. code::
 
     CREATE TABLE my_keyspace.messages_good (
+      message_id uuid PRIMARY KEY,
       user_id uuid,
-      message_id uuid,
       message_text text,
-      created_at timestamp,
-      PRIMARY KEY (user_id, message_id)
+      created_at timestamp
     );
 
-In this improved model, the partition key is chosen as ``user_id``, which is 
-the unique identifier for each user. This choice results in even data 
+In this improved model, the partition key is chosen as ``message_id``, which is
+the unique identifier for each message. This choice results in even data
 distribution across partitions because each user's messages are distributed 
-across multiple partitions based on their ``user_id``. Popular users with many 
-posts won't create hot partitions, as their messages are distributed across 
-the cluster. This approach ensures that all nodes in the cluster are 
-effectively utilized, preventing performance bottlenecks.
-
-**Tombstones and Delete Workloads**
-
-If your workload involves frequent deletes, it’s crucial that you understand 
-the implications of tombstones on your read path. Tombstones are markers for 
-deleted data and can negatively affect query performance if not managed 
-effectively.
-
-Let's consider a data model for storing user messages:
-
-.. code::
-
-    CREATE TABLE my_keyspace.user_messages (
-        user_id uuid,
-        message_id uuid,
-        message_text text,
-        is_deleted boolean,
-        PRIMARY KEY (user_id, message_id)
-    );
-
-In this table, each user can have multiple messages, identified by 
-``user_id`` and ``message_id``.
-The ``is_deleted`` column is used to mark messages as deleted (true) or not 
-deleted (false). When a user deletes a message, a tombstone is created to mark 
-the message as deleted. Tombstones are necessary for data consistency, but can 
-negatively affect query performance, especially when there are frequent delete 
-operations.
-
-Adjust your compaction strategy to account for tombstones and optimize query 
-performance in scenarios with heavy delete operations.
-
-To optimize query performance in scenarios with heavy delete operations, you 
-can `adjust the compaction strategy and use TTL <https://opensource.docs.scylladb.com/stable/kb/ttl-facts.html>`_ 
-(Time-to-Live) to handle tombstones more efficiently. ScyllaDB allows you to 
-choose different compaction strategies. In scenarios with heavy delete 
-workloads, consider using a compaction strategy that efficiently handles 
-tombstones, such as the ``TimeWindowCompactionStrategy``.
-
-.. code::
-
-    ALTER TABLE my_keyspace.user_messages 
-     WITH default_time_to_live = 2592000 
-     AND compaction = {'class': 'TimeWindowCompactionStrategy', 'base_time_seconds': 86400, 'max_sstable_age_days': 14};
-
-
-This setup, with a 30-day TTL (``default_time_to_live = 2592000``) and 
-a 14-day maximum SSTable age ``('max_sstable_age_days': 14)``, is suited for 
-time-sensitive data scenarios where keeping data beyond a month is 
-unnecessary, and the most relevant data is always from the last two weeks.
+across multiple partitions. Popular users with many posts won't create hot partitions,
+as their messages are distributed across the cluster. This approach ensures that all
+nodes in the cluster are effectively utilized, preventing performance bottlenecks.
diff --git a/docs/get-started/data-modeling/index.rst b/docs/get-started/data-modeling/index.rst
@@ -20,9 +20,3 @@ to execute.
   query-design
   schema-design
   best-practices
-
-
-
-
-
-
diff --git a/docs/get-started/data-modeling/query-design.rst b/docs/get-started/data-modeling/query-design.rst
@@ -2,51 +2,71 @@
 Query Design
 ====================
 
-Your data model is heavily influenced by query efficiency. Effective partitioning, clustering columns and denormalization are key considerations for optimizing data access patterns.
+Your data model is heavily influenced by query efficiency. Effective partitioning,
+clustering columns and denormalization are key considerations for optimizing data
+access patterns.
 
-The way data is partitioned plays a pivotal role in how it’s accessed. An efficient partitioning strategy ensures that data is evenly distributed across the cluster, minimizing hotspots. For example:
+The way data is partitioned plays a pivotal role in how it’s accessed. An efficient
+partitioning strategy ensures that data is evenly distributed across the cluster,
+minimizing hotspots. For example:
 
 .. code::
 
-  CREATE TABLE my_keyspace.user_activities (
+  CREATE TABLE my_keyspace.user_activities_bad (
     user_id uuid,
     activity_date date,
+    log_time timestamp,
     activity_details text,
-    PRIMARY KEY (user_id, activity_date)
+    PRIMARY KEY (user_id, activity_date, log_time)
   );
 
-In this table, ``user_id`` is the partition key, ensuring activities are grouped by user, and ``activity_date`` is the clustering column, ordering activities within each user's partition.
+In this table, ``user_id`` is the partition key, ensuring activities are
+grouped by user, and ``activity_date`` is the clustering column, ordering activities
+within each user's partition. However, this schema is prone to large partition sizes
+over time, given a user with high activity will create an imbalanced cluster.
 
-Clustering columns dictate the order of rows within a partition. They are crucial for range queries. For example:
+Clustering columns dictate the order of rows within a partition. They are crucial for
+range queries. For example:
 
 .. code::
 
-  CREATE TABLE my_keyspace.user_logs (
+  CREATE TABLE my_keyspace.user_activities_good (
     user_id uuid,
+    activity_date date,
     log_time timestamp,
     log_message text,
-    PRIMARY KEY (user_id, log_time)
+    PRIMARY KEY ((user_id, activity_date), log_time)
   );
   
-Here, logs are ordered by ``log_time`` within each ``user_id`` partition, making it efficient to query logs over a time range for a specific user.  
+In this table, here the partition is a combination of the ``user_id```
+and ``activity_date```, using a technique called "bucketing". This ensures that there
+is no unbounded growth within a partition, bucketed to a date. In addition, logs are
+ordered by ``log_time``` within each ``(user_id, activity_date)`` partition, making it
+efficient to query logs over a time range for a specific user.
 
 Your query design should also be optimized for efficient and effective queries 
 to retrieve and manipulate data. Query optimization aims to minimize resource 
 usage and latency while achieving maximum throughput.
 
 Indexing is another important aspect of query design. We have already 
 introduced the basic concept of primary keys, which can be made up of two 
-parts: the partition key and optional clustering columns. ScyllaDB also 
-supports secondary indexes for non-primary key columns. `Secondary indexes <https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html>_` can 
-improve query flexibility, but it’s important to consider their impact on 
-performance. For example:
+parts: the partition key and optional clustering columns.
+
+ScyllaDB also supports
+`secondary indexes <https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html>`_
+for non-primary key columns. Secondary indexes can improve query flexibility, but it’s
+important to consider their impact on performance. For example:
 
 .. code::
 
   CREATE INDEX ON my_keyspace.user_activities (activity_date);
 
-This index allows querying activities by date regardless of the user. However, secondary indexes might lead to additional overhead and should be used when necessary.
+This index allows querying activities by date regardless of the user. However, secondary
+indexes might lead to additional overhead and should be used when necessary.
 
-An alternative to secondary indexes, `materialized views <https://opensource.docs.scylladb.com/stable/cql/mv.html>`_ keep a separate, indexed table based on the base table's data. They can be more performant for reads.
+Secondary indexes are built on top of
+`materialized views <https://opensource.docs.scylladb.com/stable/cql/mv.html>`_, which
+keep a separate, indexed table based on the base table's data. They can be more performant for reads.
 
-ScyllaDB supports CQL for querying data. Learning and mastering CQL is crucial for designing queries. For more detailed instructions, please see our `documentation <https://opensource.docs.scylladb.com/stable/cql/>`_.
+ScyllaDB supports CQL for querying data. Learning and mastering CQL is crucial for designing queries.
+For more detailed instructions, please see our `documentation <https://opensource.docs.scylladb.com/stable/cql/>`_.
diff --git a/docs/get-started/data-modeling/schema-design.rst b/docs/get-started/data-modeling/schema-design.rst
@@ -10,9 +10,9 @@ This further reinforces the concept of adopting a query-first data model.
 
 **Data Types**
 
-Selecting the appropriate `data type <https://opensource.docs.scylladb.com/stable/cql/types.html>`_ for your columns is critical for both 
-physical storage and logical query performance in your data model. You will 
-need to consider factors such as data size, indexing, and sorting.
+Selecting the appropriate `data type <https://opensource.docs.scylladb.com/stable/cql/types.html>`_
+for your columns is critical to your application semantics in your data model.
+You will need to consider factors such as data size, indexing, and sorting.
 
 Let's say you're designing a table to store information about e-commerce 
 products, and one of the attributes you want to capture is the product's price. 
@@ -22,20 +22,21 @@ and query performance.
 .. code::
 
     CREATE TABLE my_keyspace.products (
-        product_id uuid PRIMARY KEY,
-        product_name text,
-        price decimal,
-        description text
+      seller_id uuid,
+      product_id uuid,
+      product_name text,
+      price decimal,
+      description text
+      PRIMARY KEY (seller_id, price, product_id)
     );
 
-In this example, for the``price`` column, we've chosen the decimal data type. 
-This data type is suitable for storing precise numerical values, such as 
-prices, as it preserves decimal precision.
-Choosing decimal over other numeric data types like float or double is 
-essential when dealing with financial data to avoid issues with rounding errors.
-
-You can efficiently index and query prices using the decimal data type, 
-ensuring fast and precise searches for products within specific price ranges. 
-When you need to sort products by price, the decimal data type maintains the 
-correct order, even for values with different decimal precision.
+In this example, for the ``price``` column, we've chosen the decimal data type.
+This data type is suitable for storing precise numerical values, such as prices,
+as it preserves decimal precision. Choosing decimal over other numeric data types
+like float or double is essential when dealing with financial data to avoid issues
+with rounding errors.
 
+You can efficiently index and query prices using the decimal data type, ensuring
+fast and precise searches for products within specific price ranges partitioned by
+``seller_id``. When you need to sort products by ``price``, the decimal data type
+maintains the correct order, even for values with different decimal precision.
diff --git a/docs/get-started/develop-with-scylladb/connect-apps.rst b/docs/get-started/develop-with-scylladb/connect-apps.rst
@@ -105,6 +105,3 @@ To connect your application to ScyllaDB, you need to:
           credentials: {username: 'scylla', password: 'your-awesome-password'},
           // keyspace: 'your_keyspace' // optional
       })
-
-
-
diff --git a/docs/get-started/develop-with-scylladb/index.rst b/docs/get-started/develop-with-scylladb/index.rst
@@ -18,4 +18,4 @@ integrating it with your application.
   run-scylladb
   install-drivers
   connect-apps
-  tutorials-example-projects
+  tutorials-example-projects
diff --git a/docs/get-started/develop-with-scylladb/run-scylladb.rst b/docs/get-started/develop-with-scylladb/run-scylladb.rst
@@ -17,9 +17,11 @@ Run ScyllaDB in Docker
 Docker simplifies the deployment and management of ScyllaDB. By using Docker 
 containers, you can easily create isolated ScyllaDB instances for development, 
 testing, and production. Running ScyllaDB in Docker is the simplest way to 
-experiment with ScyllaDB, and we highly recommend it. If you intend to run 
-ScyllaDB in Docker in production, we recommend following our 
-`best practices guide <https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/tips/best-practices-scylla-on-docker.html>`_.
+experiment with ScyllaDB, and we highly recommend it.
+
+If you intend to run ScyllaDB in Docker in production, we recommend using
+`ScyllaDB Operator <https://operator.docs.scylladb.com/stable/>`_
+which will help you manage ScyllaDB clusters within Kubernetes.
 
 Running a Single Node
 =======================

diff --git a/docs/get-started/develop-with-scylladb/tutorials-example-projects.rst b/docs/get-started/develop-with-scylladb/tutorials-example-projects.rst
@@ -38,5 +38,5 @@ an IoT project connected to ScyllaDB Cloud.
 
 ML Feature Store
 -----------------------
-Our `Feature Store sample application and tutorial <https://feature-store.scylladb.com/>`_ help you build a real-time feature store with ScyllaDB in Python.
-
+Our `Feature Store sample application and tutorial <https://feature-store.scylladb.com/>`_
+help you build a real-time feature store with ScyllaDB in Python.
diff --git a/docs/get-started/index.rst b/docs/get-started/index.rst
@@ -13,10 +13,3 @@ and use it as the database for your application.
    query-data/index
    data-modeling/index
    learn-resources/index
-
-
-
-
-
-
-
diff --git a/docs/get-started/learn-resources/index.rst b/docs/get-started/learn-resources/index.rst
@@ -43,4 +43,3 @@ ScyllaDB Blog
 Subscribe to the `ScyllaDB blog <https://www.scylladb.com/blog/>`_
 to be up to date with recent news about the ScyllaDB NoSQL database and 
 related technologies.
-
diff --git a/docs/get-started/query-data/cql.rst b/docs/get-started/query-data/cql.rst
@@ -38,4 +38,3 @@ The output of this command will look something like this:
 
 See `CQLSh: the CQL shell <https://opensource.docs.scylladb.com/master/cql/cqlsh.html>`_ 
 for details.
-
diff --git a/docs/get-started/query-data/delete-data.rst b/docs/get-started/query-data/delete-data.rst
@@ -2,13 +2,13 @@
 Deleting Data
 =======================
 
-Delete data with the ``DELETE`` statement. Be specific with your conditions to 
+Delete data with the ``DELETE`` statement. Be specific with your restrictions to 
 avoid accidental deletions. For example:
 
 .. code::
 
     DELETE FROM my_keyspace.users 
-     WHERE user_id = 123e4567-e89b-12d3-a456-426655440000;
+      WHERE user_id = 123e4567-e89b-12d3-a456-426655440000;
 
 Let's break down the components of this ``DELETE`` statement:
 
@@ -20,19 +20,24 @@ want to delete data. In this example, you are deleting data from a table named
 
 **WHERE Clause**
 ``WHERE user_id = 123e4567-e89b-12d3-a456-426655440000``: This part of 
-the statement specifies a condition for filtering the rows to be deleted.
+the statement specifies a restriction for filtering the rows to be deleted.
 
-Including the ``WHERE`` clause with a specific condition is essential to ensure 
-that only the rows meeting the condition will be deleted. This is done to 
-prevent accidental deletions of all data in the table.
+Including the ``WHERE`` clause with a specific restriction is essential to ensure 
+that only the rows meeting the restriction will be deleted. This is done to 
+prevent accidental deletions of the wrong data in the table.
+
+.. note::
+
+  Similar to ``INSERT`` and ``UPDATE`` statements, a ``DELETE`` operation can be conditional
+  using ScyllaDB's
+  `Lightweight Transaction <https://opensource.docs.scylladb.com/stable/using-scylla/lwt.html>`_
+  `IF EXISTS`` clause.
 
 In summary, the ``DELETE`` statement in ScyllaDB is used to remove existing 
-data from a table. Always use a ``WHERE`` clause with a suitable condition to 
-target the specific rows you want to delete, and ensure that the condition is 
+data from a table. Always use a ``WHERE`` clause with a suitable restriction to 
+target the specific rows you want to delete, and ensure that the restriction is 
 specific enough to avoid unintended data loss. This approach helps maintain 
 data integrity in your ScyllaDB tables.
 
 See the details about the `DELETE statement <https://opensource.docs.scylladb.com/stable/cql/dml/delete.html>`_ 
 in the ScyllaDB documentation.
-
-
diff --git a/docs/get-started/query-data/index.rst b/docs/get-started/query-data/index.rst
@@ -16,5 +16,6 @@ To perform these basic operations, you will use CQL.
   update-data
   delete-data
 
+.. note::
 
-
+  If you are looking to query data with a DynamoDB compatible API, we recommend using `ScyllaDB Alternator <https://opensource.docs.scylladb.com/stable/alternator/getting-started.html>`_.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -20,9 +20,3 @@ to execute.
		query-design
		schema-design
		best-practices
Original file line number	Diff line number	Diff line change
Expand Up		@@ -105,6 +105,3 @@ To connect your application to ScyllaDB, you need to:
		credentials: {username: 'scylla', password: 'your-awesome-password'},
		// keyspace: 'your_keyspace' // optional
		})
Original file line number	Diff line number	Diff line change
Expand Up		@@ -13,10 +13,3 @@ and use it as the database for your application.
		query-data/index
		data-modeling/index
		learn-resources/index
Original file line number	Diff line number	Diff line change
Expand Up		@@ -43,4 +43,3 @@ ScyllaDB Blog
		Subscribe to the `ScyllaDB blog <https://www.scylladb.com/blog/>`_
		to be up to date with recent news about the ScyllaDB NoSQL database and
		related technologies.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -38,4 +38,3 @@ The output of this command will look something like this:

		See `CQLSh: the CQL shell <https://opensource.docs.scylladb.com/master/cql/cqlsh.html>`_
		for details.