Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/getting started improvements #10

Merged
merged 24 commits into from
Jan 29, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@ env/
venv/
ENV/
env.bak/
venv.bak/
venv.bak/
.idea/
78 changes: 14 additions & 64 deletions docs/get-started/data-modeling/best-practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,83 +18,33 @@ Let's consider a scenario with poor partition key selection:
.. code::

CREATE TABLE my_keyspace.messages_bad (
message_id uuid PRIMARY KEY,
user_id uuid,
message_text text,
created_at timestamp
user_id uuid,
message_id uuid,
message_text text,
created_at timestamp,
PRIMARY KEY (user_id, message_id)
);

In this model, the partition key is chosen as ``message_id``, which is a globally
In this model, the partition key is chosen as ``user_id``, which is a globally
unique identifier for each message. This choice results in poor partition key
selection because it doesn't distribute data evenly across partitions. As
a result, messages from popular users with many posts will create hot
a result, messages from popular users with many messages will create hot
partitions, as all their messages will be concentrated in a single partition.

A better solution for partition key selection would look like:

.. code::

CREATE TABLE my_keyspace.messages_good (
message_id uuid PRIMARY KEY,
user_id uuid,
message_id uuid,
message_text text,
created_at timestamp,
PRIMARY KEY (user_id, message_id)
created_at timestamp
);

In this improved model, the partition key is chosen as ``user_id``, which is
the unique identifier for each user. This choice results in even data
In this improved model, the partition key is chosen as ``message_id``, which is
the unique identifier for each message. This choice results in even data
distribution across partitions because each user's messages are distributed
across multiple partitions based on their ``user_id``. Popular users with many
posts won't create hot partitions, as their messages are distributed across
the cluster. This approach ensures that all nodes in the cluster are
effectively utilized, preventing performance bottlenecks.

**Tombstones and Delete Workloads**

If your workload involves frequent deletes, it’s crucial that you understand
the implications of tombstones on your read path. Tombstones are markers for
deleted data and can negatively affect query performance if not managed
effectively.

Let's consider a data model for storing user messages:

.. code::

CREATE TABLE my_keyspace.user_messages (
user_id uuid,
message_id uuid,
message_text text,
is_deleted boolean,
PRIMARY KEY (user_id, message_id)
);

In this table, each user can have multiple messages, identified by
``user_id`` and ``message_id``.
The ``is_deleted`` column is used to mark messages as deleted (true) or not
deleted (false). When a user deletes a message, a tombstone is created to mark
the message as deleted. Tombstones are necessary for data consistency, but can
negatively affect query performance, especially when there are frequent delete
operations.

Adjust your compaction strategy to account for tombstones and optimize query
performance in scenarios with heavy delete operations.

To optimize query performance in scenarios with heavy delete operations, you
can `adjust the compaction strategy and use TTL <https://opensource.docs.scylladb.com/stable/kb/ttl-facts.html>`_
(Time-to-Live) to handle tombstones more efficiently. ScyllaDB allows you to
choose different compaction strategies. In scenarios with heavy delete
workloads, consider using a compaction strategy that efficiently handles
tombstones, such as the ``TimeWindowCompactionStrategy``.

.. code::

ALTER TABLE my_keyspace.user_messages
WITH default_time_to_live = 2592000
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'base_time_seconds': 86400, 'max_sstable_age_days': 14};


This setup, with a 30-day TTL (``default_time_to_live = 2592000``) and
a 14-day maximum SSTable age ``('max_sstable_age_days': 14)``, is suited for
time-sensitive data scenarios where keeping data beyond a month is
unnecessary, and the most relevant data is always from the last two weeks.
across multiple partitions. Popular users with many posts won't create hot partitions,
as their messages are distributed across the cluster. This approach ensures that all
nodes in the cluster are effectively utilized, preventing performance bottlenecks.
6 changes: 0 additions & 6 deletions docs/get-started/data-modeling/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,3 @@ to execute.
query-design
schema-design
best-practices






52 changes: 36 additions & 16 deletions docs/get-started/data-modeling/query-design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,71 @@
Query Design
====================

Your data model is heavily influenced by query efficiency. Effective partitioning, clustering columns and denormalization are key considerations for optimizing data access patterns.
Your data model is heavily influenced by query efficiency. Effective partitioning,
clustering columns and denormalization are key considerations for optimizing data
access patterns.

The way data is partitioned plays a pivotal role in how it’s accessed. An efficient partitioning strategy ensures that data is evenly distributed across the cluster, minimizing hotspots. For example:
The way data is partitioned plays a pivotal role in how it’s accessed. An efficient
partitioning strategy ensures that data is evenly distributed across the cluster,
minimizing hotspots. For example:

.. code::

CREATE TABLE my_keyspace.user_activities (
CREATE TABLE my_keyspace.user_activities_bad (
user_id uuid,
activity_date date,
log_time timestamp,
activity_details text,
PRIMARY KEY (user_id, activity_date)
PRIMARY KEY (user_id, activity_date, log_time)
);

In this table, ``user_id`` is the partition key, ensuring activities are grouped by user, and ``activity_date`` is the clustering column, ordering activities within each user's partition.
In this table, ``user_id`` is the partition key, ensuring activities are
grouped by user, and ``activity_date`` is the clustering column, ordering activities
within each user's partition. However, this schema is prone to large partition sizes
over time, given a user with high activity will create an imbalanced cluster.

Clustering columns dictate the order of rows within a partition. They are crucial for range queries. For example:
Clustering columns dictate the order of rows within a partition. They are crucial for
range queries. For example:

.. code::

CREATE TABLE my_keyspace.user_logs (
CREATE TABLE my_keyspace.user_activities_good (
user_id uuid,
activity_date date,
log_time timestamp,
log_message text,
PRIMARY KEY (user_id, log_time)
PRIMARY KEY ((user_id, activity_date), log_time)
);

Here, logs are ordered by ``log_time`` within each ``user_id`` partition, making it efficient to query logs over a time range for a specific user.
In this table, here the partition is a combination of the ``user_id```
and ``activity_date```, using a technique called "bucketing". This ensures that there
is no unbounded growth within a partition, bucketed to a date. In addition, logs are
ordered by ``log_time``` within each ``(user_id, activity_date)`` partition, making it
efficient to query logs over a time range for a specific user.

Your query design should also be optimized for efficient and effective queries
to retrieve and manipulate data. Query optimization aims to minimize resource
usage and latency while achieving maximum throughput.

Indexing is another important aspect of query design. We have already
introduced the basic concept of primary keys, which can be made up of two
parts: the partition key and optional clustering columns. ScyllaDB also
supports secondary indexes for non-primary key columns. `Secondary indexes <https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html>_` can
improve query flexibility, but it’s important to consider their impact on
performance. For example:
parts: the partition key and optional clustering columns.

ScyllaDB also supports
`secondary indexes <https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html>`_
for non-primary key columns. Secondary indexes can improve query flexibility, but it’s
important to consider their impact on performance. For example:

.. code::

CREATE INDEX ON my_keyspace.user_activities (activity_date);

This index allows querying activities by date regardless of the user. However, secondary indexes might lead to additional overhead and should be used when necessary.
This index allows querying activities by date regardless of the user. However, secondary
indexes might lead to additional overhead and should be used when necessary.

An alternative to secondary indexes, `materialized views <https://opensource.docs.scylladb.com/stable/cql/mv.html>`_ keep a separate, indexed table based on the base table's data. They can be more performant for reads.
Secondary indexes are built on top of
`materialized views <https://opensource.docs.scylladb.com/stable/cql/mv.html>`_, which
keep a separate, indexed table based on the base table's data. They can be more performant for reads.

ScyllaDB supports CQL for querying data. Learning and mastering CQL is crucial for designing queries. For more detailed instructions, please see our `documentation <https://opensource.docs.scylladb.com/stable/cql/>`_.
ScyllaDB supports CQL for querying data. Learning and mastering CQL is crucial for designing queries.
For more detailed instructions, please see our `documentation <https://opensource.docs.scylladb.com/stable/cql/>`_.
35 changes: 18 additions & 17 deletions docs/get-started/data-modeling/schema-design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ This further reinforces the concept of adopting a query-first data model.

**Data Types**

Selecting the appropriate `data type <https://opensource.docs.scylladb.com/stable/cql/types.html>`_ for your columns is critical for both
physical storage and logical query performance in your data model. You will
need to consider factors such as data size, indexing, and sorting.
Selecting the appropriate `data type <https://opensource.docs.scylladb.com/stable/cql/types.html>`_
for your columns is critical to your application semantics in your data model.
You will need to consider factors such as data size, indexing, and sorting.

Let's say you're designing a table to store information about e-commerce
products, and one of the attributes you want to capture is the product's price.
Expand All @@ -22,20 +22,21 @@ and query performance.
.. code::

CREATE TABLE my_keyspace.products (
product_id uuid PRIMARY KEY,
product_name text,
price decimal,
description text
seller_id uuid,
product_id uuid,
product_name text,
price decimal,
description text
PRIMARY KEY (seller_id, price, product_id)
);

In this example, for the``price`` column, we've chosen the decimal data type.
This data type is suitable for storing precise numerical values, such as
prices, as it preserves decimal precision.
Choosing decimal over other numeric data types like float or double is
essential when dealing with financial data to avoid issues with rounding errors.

You can efficiently index and query prices using the decimal data type,
ensuring fast and precise searches for products within specific price ranges.
When you need to sort products by price, the decimal data type maintains the
correct order, even for values with different decimal precision.
In this example, for the ``price``` column, we've chosen the decimal data type.
This data type is suitable for storing precise numerical values, such as prices,
as it preserves decimal precision. Choosing decimal over other numeric data types
like float or double is essential when dealing with financial data to avoid issues
with rounding errors.

You can efficiently index and query prices using the decimal data type, ensuring
fast and precise searches for products within specific price ranges partitioned by
``seller_id``. When you need to sort products by ``price``, the decimal data type
maintains the correct order, even for values with different decimal precision.
3 changes: 0 additions & 3 deletions docs/get-started/develop-with-scylladb/connect-apps.rst
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,3 @@ To connect your application to ScyllaDB, you need to:
credentials: {username: 'scylla', password: 'your-awesome-password'},
// keyspace: 'your_keyspace' // optional
})



2 changes: 1 addition & 1 deletion docs/get-started/develop-with-scylladb/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ integrating it with your application.
run-scylladb
install-drivers
connect-apps
tutorials-example-projects
tutorials-example-projects
8 changes: 5 additions & 3 deletions docs/get-started/develop-with-scylladb/run-scylladb.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@ Run ScyllaDB in Docker
Docker simplifies the deployment and management of ScyllaDB. By using Docker
containers, you can easily create isolated ScyllaDB instances for development,
testing, and production. Running ScyllaDB in Docker is the simplest way to
experiment with ScyllaDB, and we highly recommend it. If you intend to run
ScyllaDB in Docker in production, we recommend following our
`best practices guide <https://opensource.docs.scylladb.com/stable/operating-scylla/procedures/tips/best-practices-scylla-on-docker.html>`_.
experiment with ScyllaDB, and we highly recommend it.

If you intend to run ScyllaDB in Docker in production, we recommend using
`ScyllaDB Operator <https://operator.docs.scylladb.com/stable/>`_
which will help you manage ScyllaDB clusters within Kubernetes.

Running a Single Node
=======================
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ an IoT project connected to ScyllaDB Cloud.

ML Feature Store
-----------------------
Our `Feature Store sample application and tutorial <https://feature-store.scylladb.com/>`_ help you build a real-time feature store with ScyllaDB in Python.

Our `Feature Store sample application and tutorial <https://feature-store.scylladb.com/>`_
help you build a real-time feature store with ScyllaDB in Python.
7 changes: 0 additions & 7 deletions docs/get-started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,3 @@ and use it as the database for your application.
query-data/index
data-modeling/index
learn-resources/index







1 change: 0 additions & 1 deletion docs/get-started/learn-resources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,4 +43,3 @@ ScyllaDB Blog
Subscribe to the `ScyllaDB blog <https://www.scylladb.com/blog/>`_
to be up to date with recent news about the ScyllaDB NoSQL database and
related technologies.

1 change: 0 additions & 1 deletion docs/get-started/query-data/cql.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,3 @@ The output of this command will look something like this:

See `CQLSh: the CQL shell <https://opensource.docs.scylladb.com/master/cql/cqlsh.html>`_
for details.

18 changes: 8 additions & 10 deletions docs/get-started/query-data/delete-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
Deleting Data
=======================

Delete data with the ``DELETE`` statement. Be specific with your conditions to
Delete data with the ``DELETE`` statement. Be specific with your restrictions to
avoid accidental deletions. For example:

.. code::

DELETE FROM my_keyspace.users
WHERE user_id = 123e4567-e89b-12d3-a456-426655440000;
WHERE user_id = 123e4567-e89b-12d3-a456-426655440000;

Let's break down the components of this ``DELETE`` statement:

Expand All @@ -20,19 +20,17 @@ want to delete data. In this example, you are deleting data from a table named

**WHERE Clause**
``WHERE user_id = 123e4567-e89b-12d3-a456-426655440000``: This part of
the statement specifies a condition for filtering the rows to be deleted.
the statement specifies a restriction for filtering the rows to be deleted.

Including the ``WHERE`` clause with a specific condition is essential to ensure
that only the rows meeting the condition will be deleted. This is done to
prevent accidental deletions of all data in the table.
Including the ``WHERE`` clause with a specific restriction is essential to ensure
that only the rows meeting the restriction will be deleted. This is done to
prevent accidental deletions of the wrong data in the table.

In summary, the ``DELETE`` statement in ScyllaDB is used to remove existing
data from a table. Always use a ``WHERE`` clause with a suitable condition to
target the specific rows you want to delete, and ensure that the condition is
data from a table. Always use a ``WHERE`` clause with a suitable restriction to
target the specific rows you want to delete, and ensure that the restriction is
specific enough to avoid unintended data loss. This approach helps maintain
data integrity in your ScyllaDB tables.

See the details about the `DELETE statement <https://opensource.docs.scylladb.com/stable/cql/dml/delete.html>`_
in the ScyllaDB documentation.


3 changes: 2 additions & 1 deletion docs/get-started/query-data/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,6 @@ To perform these basic operations, you will use CQL.
update-data
delete-data

.. note::


If you are looking to query data with a DynamoDB compatible API, we recommend using `ScyllaDB Alternator <https://opensource.docs.scylladb.com/stable/alternator/getting-started.html>`_.
10 changes: 8 additions & 2 deletions docs/get-started/query-data/insert-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ the column values. For example:
.. code::

INSERT INTO my_keyspace.users (user_id, first_name, last_name, age)
VALUES (123e4567-e89b-12d3-a456-426655440000, 'Polly', 'Partition', 77);
VALUES (123e4567-e89b-12d3-a456-426655440000, 'Polly', 'Partition', 77);


Let's break down the components of this ``INSERT INTO`` statement:
Expand All @@ -34,11 +34,17 @@ type. ``'Polly', 'Partition'`` (enclosed in single quotes) are being inserted in
the ``first_name``, ``last_name`` columns. ``77`` is being inserted into
the ``age`` column (without quotes) as it is an ``int`` data type.

.. note::

Unlike in SQL, ``INSERT INTO`` does not check the prior existence of the row by default:
the row is created if none existed before, and updated otherwise.
This behavior can be changed by using the ``IF NOT EXISTS`` or ``IF EXISTS`` clauses.

timkoopmans marked this conversation as resolved.
Show resolved Hide resolved
In summary, the ``INSERT INTO`` statement in ScyllaDB is used to insert a new
row of data into a specific table within a keyspace. It requires you to specify
the keyspace, table, column names, and the corresponding values that you want
to insert into those columns. This allows you to add data to your tables in
ScyllaDB for subsequent retrieval and querying.

See the details about the `INSERT statement <https://opensource.docs.scylladb.com/stable/cql/dml/insert.html>`_
in the ScyllaDB documentation.
in the ScyllaDB documentation.
Loading