Skip to content

Commit

Permalink
add the get-started documents
Browse files Browse the repository at this point in the history
This commit adds documents that cover
basic concepts in ScyllaDB and help
developers get started with ScyllaDB.
  • Loading branch information
annastuchlik committed Jan 3, 2024
1 parent d601338 commit 6e070dd
Show file tree
Hide file tree
Showing 20 changed files with 1,170 additions and 3 deletions.
100 changes: 100 additions & 0 deletions docs/get-started/data-modeling/best-practices.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
====================================
Data Modeling Best Practices
====================================

These additional topics provide a broader perspective on data modeling, query
design, schema design, and best practices when working with ScyllaDB or similar
distributed NoSQL databases.

**Partition Key Selection**

Choose your partition keys to avoid imbalances in your clusters. Imbalanced
partitions can lead to performance bottlenecks, which impact overall cluster
performance. Balancing the distribution of data across partitions is crucial
to ensure all nodes are effectively utilized in your cluster.

Let's consider a scenario with poor partition key selection:

.. code::
CREATE TABLE my_keyspace.messages_bad (
message_id uuid PRIMARY KEY,
user_id uuid,
message_text text,
created_at timestamp
);
In this model, the partition key is chosen as ``message_id``, which is a globally
unique identifier for each message. This choice results in poor partition key
selection because it doesn't distribute data evenly across partitions. As
a result, messages from popular users with many posts will create hot
partitions, as all their messages will be concentrated in a single partition.

A better solution for partition key selection would look like:

.. code::
CREATE TABLE my_keyspace.messages_good (
user_id uuid,
message_id uuid,
message_text text,
created_at timestamp,
PRIMARY KEY (user_id, message_id)
);
In this improved model, the partition key is chosen as ``user_id``, which is
the unique identifier for each user. This choice results in even data
distribution across partitions because each user's messages are distributed
across multiple partitions based on their ``user_id``. Popular users with many
posts won't create hot partitions, as their messages are distributed across
the cluster. This approach ensures that all nodes in the cluster are
effectively utilized, preventing performance bottlenecks.

**Tombstones and Delete Workloads**

If your workload involves frequent deletes, it’s crucial that you understand
the implications of tombstones on your read path. Tombstones are markers for
deleted data and can negatively affect query performance if not managed
effectively.

Let's consider a data model for storing user messages:

.. code::
CREATE TABLE my_keyspace.user_messages (
user_id uuid,
message_id uuid,
message_text text,
is_deleted boolean,
PRIMARY KEY (user_id, message_id)
);
In this table, each user can have multiple messages, identified by
``user_id`` and ``message_id``.
The ``is_deleted`` column is used to mark messages as deleted (true) or not
deleted (false). When a user deletes a message, a tombstone is created to mark
the message as deleted. Tombstones are necessary for data consistency, but can
negatively affect query performance, especially when there are frequent delete
operations.

Adjust your compaction strategy to account for tombstones and optimize query
performance in scenarios with heavy delete operations.

To optimize query performance in scenarios with heavy delete operations, you
can `adjust the compaction strategy and use TTL <https://opensource.docs.scylladb.com/stable/kb/ttl-facts.html>`_
(Time-to-Live) to handle tombstones more efficiently. ScyllaDB allows you to
choose different compaction strategies. In scenarios with heavy delete
workloads, consider using a compaction strategy that efficiently handles
tombstones, such as the ``TimeWindowCompactionStrategy``.

.. code::
ALTER TABLE my_keyspace.user_messages
WITH default_time_to_live = 2592000
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'base_time_seconds': 86400, 'max_sstable_age_days': 14};
This setup, with a 30-day TTL (``default_time_to_live = 2592000``) and
a 14-day maximum SSTable age ``('max_sstable_age_days': 14)``, is suited for
time-sensitive data scenarios where keeping data beyond a month is
unnecessary, and the most relevant data is always from the last two weeks.
34 changes: 34 additions & 0 deletions docs/get-started/data-modeling/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
===============
Data Modeling
===============

Data modeling is the process of defining the structure and relationships of
your data in ScyllaDB. It involves making important decisions about how data
will be organized, stored, and retrieved.

There are several types of data models, which include conceptual, logical,
and physical data models. Conceptual models tend to focus on high-level
business processes, while logic models detail the data structure. Physical
models consider how data is stored on the underlying infrastructure.

Data modeling in NoSQL database such as ScyllaDB differs from traditional
relational databases. You may need to emphasize denormalization, scaling, and
optimal data access patterns to get the most out of ScyllaDB.

A practical approach when data modeling for ScyllaDB is to adopt a query-first
data model, where you design your data model around the queries that it needs
to execute.


.. toctree::
:titlesonly:

query-design
schema-design
best-practices






52 changes: 52 additions & 0 deletions docs/get-started/data-modeling/query-design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
====================
Query Design
====================

Your data model is heavily influenced by query efficiency. Effective partitioning, clustering columns and denormalization are key considerations for optimizing data access patterns.

The way data is partitioned plays a pivotal role in how it’s accessed. An efficient partitioning strategy ensures that data is evenly distributed across the cluster, minimizing hotspots. For example:

.. code::
CREATE TABLE my_keyspace.user_activities (
user_id uuid,
activity_date date,
activity_details text,
PRIMARY KEY (user_id, activity_date)
);
In this table, ``user_id`` is the partition key, ensuring activities are grouped by user, and ``activity_date`` is the clustering column, ordering activities within each user's partition.

Clustering columns dictate the order of rows within a partition. They are crucial for range queries. For example:

.. code::
CREATE TABLE my_keyspace.user_logs (
user_id uuid,
log_time timestamp,
log_message text,
PRIMARY KEY (user_id, log_time)
);
Here, logs are ordered by ``log_time`` within each ``user_id`` partition, making it efficient to query logs over a time range for a specific user.

Your query design should also be optimized for efficient and effective queries
to retrieve and manipulate data. Query optimization aims to minimize resource
usage and latency while achieving maximum throughput.

Indexing is another important aspect of query design. We have already
introduced the basic concept of primary keys, which can be made up of two
parts: the partition key and optional clustering columns. ScyllaDB also
supports secondary indexes for non-primary key columns. `Secondary indexes <https://opensource.docs.scylladb.com/stable/using-scylla/secondary-indexes.html>_` can
improve query flexibility, but it’s important to consider their impact on
performance. For example:

.. code::
CREATE INDEX ON my_keyspace.user_activities (activity_date);
This index allows querying activities by date regardless of the user. However, secondary indexes might lead to additional overhead and should be used when necessary.

An alternative to secondary indexes, `materialized views <https://opensource.docs.scylladb.com/stable/cql/mv.html>_` keep a separate, indexed table based on the base table's data. They can be more performant in certain scenarios.

ScyllaDB supports CQL for querying data. Learning and mastering CQL is crucial for designing queries. For more detailed instructions, please see our `documentation <https://opensource.docs.scylladb.com/stable/cql/>_`.
94 changes: 94 additions & 0 deletions docs/get-started/data-modeling/schema-design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
=======================
Schema Design
=======================

When adopting a query-first data model, the same constraints need to be applied to the schema design. While schema design can evolve to meet your changing application needs, there are certain choices you will need to make to get the most value out of ScyllaDB. This further reinforces the concept of adopting a query-first data model.

**Data Types**

Selecting the appropriate `data type <https://opensource.docs.scylladb.com/stable/cql/types.html>`_ for your columns is critical for both
physical storage and logical query performance in your data model. You will
need to consider factors such as data size, indexing, and sorting.

Let's say you're designing a table to store information about e-commerce
products, and one of the attributes you want to capture is the product's price.
The choice of data type for the "price" column is crucial for efficient storage
and query performance.

.. code::
CREATE TABLE my_keyspace.products (
product_id uuid PRIMARY KEY,
product_name text,
price decimal,
description text
);
In this example, for the``price`` column, we've chosen the decimal data type.
This data type is suitable for storing precise numerical values, such as
prices, as it preserves decimal precision.
Choosing decimal over other numeric data types like float or double is
essential when dealing with financial data to avoid issues with rounding errors.

You can efficiently index and query prices using the decimal data type,
ensuring fast and precise searches for products within specific price ranges.
When you need to sort products by price, the decimal data type maintains the
correct order, even for values with different decimal precision.

**(De)Normalization**

The choice between normalization and denormalization will depend on your
specific use case. A good rule of thumb is that normalization reduces
redundancy but may require more complex queries, while denormalization
simplifies queries yet may increase storage requirements. It is important to
consider the tradeoff between approaches when designing your data model.

Let's consider a scenario where you are designing a data model to manage
information about a library system with two main entities: books and authors.
You have the flexibility to choose between normalized and denormalized approaches.

**Normalized Data Model**

In a normalized data model, you would have separate tables for books and
authors, reducing data redundancy:

.. code::
CREATE TABLE my_keyspace.authors (
author_id uuid PRIMARY KEY,
author_name text
);
CREATE TABLE my_keyspace.books (
book_id uuid PRIMARY KEY,
title text,
publication_year int,
author_id uuid,
ISBN text
);
In this normalized model, the authors table stores information about authors,
and the books table stores information about books. The ``author_id`` column
in the books table serves as a foreign key referencing the authors table,
ensuring data consistency and reducing redundancy.

**Denormalized Data Model**

In a denormalized data model, you would combine some data to simplify queries,
even though it may lead to redundancy:

.. code::
CREATE TABLE my_keyspace.books_and_authors (
book_id uuid PRIMARY KEY,
title text,
publication_year int,
author_name text,
ISBN text
);
In this denormalized model, the ``books_and_authors`` table combines
information from both ``books`` and ``authors`` into a single table.
The ``author_name`` column directly stores the author's name, eliminating
the need for foreign key references.

110 changes: 110 additions & 0 deletions docs/get-started/develop-with-scylladb/connect-apps.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
=======================
Connect an Application
=======================

To connect your application to ScyllaDB, you need to:

#. :doc:`Install the relevant driver </get-started/develop-with-scylladb/install-drivers>`
for your application language.

This step involves setting up a driver that is compatible with ScyllaDB.
The driver acts as the link between your application and ScyllaDB, enabling
your application to communicate with the database.

#. Modify your application code to connect the driver.

The following is some boilerplate code to help familiarize yourself with
connecting your application with the ScyllaDB driver. For a detailed
walkthrough of building a fictional media player application with code
examples, please see our
`Getting Started tutorial <https://cloud-getting-started.scylladb.com/stable/getting-started.html>`_.

.. tabs::

.. group-tab:: Rust

.. code-block:: rust
use anyhow::Result;in various languages
use scylla::{Session, SessionBuilder};
use std::time::Duration;
#[tokio::main]
async fn main() -> Result<()> {
let session: Session = SessionBuilder::new()
.known_nodes(&[
"localhost",
])
.connection_timeout(Duration::from_secs(30))
.user("scylla", "your-awesome-password")
.build()
.await
.unwrap();
Ok(())
}
.. group-tab:: Go

.. code-block:: go
func main() {
cluster := gocql.NewCluster("localhost")
cluster.Authenticator = gocql.PasswordAuthenticator{Username: "scylla", Password: "your-awesome-password"}
session, err := gocqlx.WrapSession(cluster.CreateSession())

if err != nil {
panic("Connection fail")
}
}



.. group-tab:: Java

.. code-block:: java
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.PlainTextAuthProvider;
import com.datastax.driver.core.Session;
class Main {
public static void main(String[] args) {
Cluster cluster = Cluster.builder()
.addContactPoints("localhost")
.withAuthProvider(new PlainTextAuthProvider("scylla", "your-awesome-password"))
.build();
Session session = cluster.connect();
}
}
.. group-tab:: Python

.. code-block:: python
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
cluster = Cluster(
contact_points=[
"localhost",
],
auth_provider=PlainTextAuthProvider(username='scylla', password='your-awesome-password')
)
.. group-tab:: JavaScript

.. code-block:: javascript
const cluster = new cassandra.Client({
contactPoints: ["localhost", ...],
localDataCenter: 'your-data-center',
credentials: {username: 'scylla', password: 'your-awesome-password'},
// keyspace: 'your_keyspace' // optional
})
Loading

0 comments on commit 6e070dd

Please sign in to comment.