Skip to content

Commit

Permalink
Update docs (#7986)
Browse files Browse the repository at this point in the history
GitOrigin-RevId: 05d090e6e88eead576b75b28f6a09b82c106778a
  • Loading branch information
tryptofanik authored and Manul from Pathway committed Jan 13, 2025
1 parent 5ef6481 commit c3db74b
Show file tree
Hide file tree
Showing 9 changed files with 144 additions and 49 deletions.
2 changes: 2 additions & 0 deletions docs/2.developers/4.user-guide/10.introduction/10.welcome.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ width: '500'
- [Examples](/developers/user-guide/introduction/first_realtime_app_with_pathway)
- [Core concepts](/developers/user-guide/introduction/concepts)
- [Why Pathway](/developers/user-guide/introduction/why-pathway)
- [Streaming and Static Modes](/developers/user-guide/introduction/streaming-and-static-modes)
- [Batch Processing](/developers/user-guide/introduction/batch-processing)
- [Deployment](/developers/user-guide/deployment/cloud-deployment)
- [LLM tooling](/developers/user-guide/llm-xpack/overview)

Expand Down
11 changes: 11 additions & 0 deletions docs/2.developers/4.user-guide/10.introduction/20.installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ To quickly get started with Pathway, you can install it via pip with the followi
pip install -U pathway
```

It will install all basic dependencies to run your Pathway pipelines, including our powerful Rust engine.


<!-- https://www.canva.com/design/DAGGtZB_-kw/6gGXSnfMNL9LuOXTOSQbQQ/edit?utm_content=DAGGtZB_-kw&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton -->
::article-img
Expand Down Expand Up @@ -54,6 +56,15 @@ Pick one and start your hands-on experience with Pathway today!

## Optional packages

We separated dependencies into several groups to allow users to have better control over what is installed.

- For a standard usage of the Live Data Framework, run `pip install pathway` . No external LLM-based libraries will be installed.

- To run the AI pipelines or to build Live AI systems, consider running `pip install "pathway[xpack-llm]"`, which will install all common LLM libraries, such as OpenAI, Langchain, etc.

For more information, please visit our [pyproject.toml](https://github.com/pathwaycom/pathway/blob/main/pyproject.toml) file, which describes the content of all groups.


| **Package** | **Installation Command** | **Description** | **Notes** |
|--------------|--------------------------|------------------|-----------|
| **Basic LLM Tooling** | `pip install "pathway[xpack-llm]"` | Install common LLM libraries (OpenAI, Langchain, LlamaIndex) | [Learn more](/developers/user-guide/llm-xpack/overview) / [Examples](/developers/templates?category=llm#llm) |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,23 +16,40 @@ Then, you simply have to import Pathway as any other Python library:
import pathway as pw
```

## Connect to your data sources
In Pathway, you need to use a **connector** to create a table from a data source.
Connectors rely on [schemas](/developers/user-guide/connect/schema) to structure the data:
## Define your Data Schema:

[Schemas](/developers/user-guide/connect/schema) in Pathway define the structure of your data tables. They describe the data types and names of the columns, ensuring that your data is well-organized and consistent.

For instance, when reading a data source, you specify a schema to map the incoming data:


```python
class InputSchema(pw.Schema):
colA: int
colB: float
colC: str


input_table = pw.io.csv.read('./data/', schema=InputSchema)
```

Here, the InputSchema specifies three columns: colA (an integer), colB (a float) and colC (a string).
Schemas define the structure of the data, ensuring type safety and optimizing runtime performance.

Pathway supports the following basic [data types](/developers/user-guide/connect/datatypes): `bool`, `str`, `bytes`, `int`, and `float`.
Pathway also supports more complex data types, such as the `Optional` data type or temporal data types (`datetime.datetime`).

## Tables:

[Tables](https://pathway.com/developers/api-docs/pathway-table) are Pathway objects that can actually store the data. These are composed of columns, each of which keeps data of the same type, just like in relational databases.


## Connectors:
In Pathway, you need to use a **connector** to create a table from a data source. Connectors read and ingest, in real-time, data from your chosen data sources.

Here's an example of a connector that uses `InputSchema` to read **CSV** files from the `./data/` directory and outputs a table:

```python
input_table = pw.io.csv.read('./data/', schema=InputSchema)
```

Here is a small sample of Pathway input connectors:

| Input Connectors | Example |
Expand All @@ -46,11 +63,17 @@ Pathway comes with many more connectors, including an [Airbyte connector](/devel
You can find the list of available connectors on our [connector page](/developers/user-guide/connect/pathway-connectors).

## Transformations
Once your input data is defined, you can define your data pipeline using [Pathway transformations](/developers/user-guide/introduction/concepts#processing-the-data-with-transformations):
Once your input data is specified, you can now define your data pipeline using Pathway [transformations](/developers/user-guide/introduction/concepts#processing-the-data-with-transformations). These, under the hood, are written in Rust meaning that they are very efficient.

Here is an example of a simple transformation composed of filtering and summing by groups:

```python
filtered_table = input_table.filter(input_table.colA > 0)
result_table = filtered_table.groupby(filtered_table.colB).reduce(sum_val=pw.Reducers.sum(pw.this.colC))
result_table = (
filtered_table
.groupby(filtered_table.colB)
.reduce(sum_val=pw.Reducers.sum(pw.this.colC))
)
```

Here is a small sample of the operations you can do in Pathway:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ on a Python 3.10+ installation, and you are ready to roll!

## A simple sum example

Let's start with a simple sum over positive values stored in CSV files, and written to a JSON Lines file:
Let's start with a simple sum over positive values stored in CSV files, and written to a [JSON Lines](https://jsonlines.org/) file:

::article-img
---
Expand All @@ -38,10 +38,11 @@ The aim is to combine the data from those two data sources and find the live mea
This is how you can do the whole pipeline in Pathway:

```python
import pathway as pw # import Pathway
import pathway as pw

# Declare the Schema of your tables using pw.Schema.
# There are two input tables: measurements and threshold.
# There are two input tables: (1) measurements which is
# live stream and (2) threshold which is a CSV that might be modified over time.
# Both have two columns: a name (str) and a float.
class MeasurementSchema(pw.Schema):
name: str
Expand Down Expand Up @@ -80,21 +81,36 @@ thresholds_table = pw.io.csv(
)

# Joining tables on the column name
joined_table = measurements_table.join( # The left table is measurements_table (pw.left)
thresholds_table, # The right table is thresholds_table (pw.right)
pw.left.name==pw.right.name, # The join is done on the column name of each table
).select( # The columns of the joined table are chosen using select
*pw.left, # All the columns of measurements are kept
pw.right.threshold # The threshold column of the threshold table is kept
joined_table = (
# The left table is measurements_table (referred as pw.left)
measurements_table
.join(
# The right table is thresholds_table (referred as pw.right)
thresholds_table,
# The join is done on the column name of each table
pw.left.name==pw.right.name,
)
# The columns of the joined table are chosen using select
.select(
# All the columns of measurements are kept
*pw.left,
# The threshold column of the threshold table is kept
pw.right.threshold
)
)

# Filtering value strictly higher than the threshold.
alerts_table = joined_values.filter(
pw.this.value > pw.this.threshold
).select(pw.this.name, pw.this.value) # Only name and value fields are kept
alerts_table = (
joined_values
# Filtering value strictly higher than the threshold.
.filter(pw.this.value > pw.this.threshold)
# Only name and value fields are kept
.select(pw.this.name, pw.this.value)
)

# Sending the results to another Kafka topic, on the same Kafka instance
pw.io.kafka.write(alerts_table, rdkafka_settings, topic_name="alerts_topic", format="json")
pw.io.kafka.write(
alerts_table, rdkafka_settings, topic_name="alerts_topic", format="json"
)

# Launching Pathway computation.
pw.run()
Expand Down Expand Up @@ -148,28 +164,32 @@ Then the output is:
{"name": "B", "value":10, "time":1, "diff":1}
```
The output contains two more fields: `time` and `diff`:
* `time` represents the time at which the update has happened. In practice, the time is a timestamp.
* `diff` represents whether the row represents an addition or a deletion. An update is represented by two rows: one to remove the old value, one to add the new values. Those two rows have the same time to ensure the atomicity of the operation.
* `time` represents the time at which the update happened. In practice, the time is a regular timestamp.
* `diff` informs whether the row should be an insertion (`diff = 1`) or a deletion (`diff = -1`). An update is represented by two rows: one to remove the old value, one to add the new values. Those two rows have the same time to ensure the atomicity of the operation.

In this case, we assume the first values were computed at time 1.
The value `diff` is equal to `1` as it is an insertion.

Suppose now that the thresholds are updated:
Suppose now that the thresholds have changed, the file got an update and now looks like that:
```
name, threshold
"A", 7
"B", 11
```

Then the output is:
Connector will automatically detect any new files or modifications within `./threshold-data/` and update the tables accordingly.

That will trigger the reexecution of join and filter giving this output:
```
{"name": "B", "value":10, "time":1, "diff":1}
{"name": "B", "value":10, "time":2, "diff":-1}
{"name": "A", "value":8, "time":2, "diff":1}
```

There are two more lines: the old alert is removed (`diff=-1`) and a new one is inserted for the other event (`diff=1`).
Old values are still kept as the output is a log of insertion and suppression.
Old values are still kept as the output is a log of insertion and suppression allowing to have exhaustive information about what happened to our data.

Keep in mind that some output connectors to external data storage system might take these `-1` and `+1` rows, link them by time, and represent as update operation (like in case of SQL database).

## Other examples
::container{.flex .gap-8 .items-center .w-full .justify-center}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,9 @@ Here is a way to do it with Pathway, assuming a correct input table called `inpu

```python
filtered_table = input_table.filter(input_table.age >= 0)
result_table = filtered_table.reduce(sum_age = pw.reducers.sum(filtered_table.age))
result_table = filtered_table.reduce(
sum_age = pw.reducers.sum(filtered_table.age)
)
```

It's okay if you don't understand everything for now.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ def cc(vertices: pw.Table, edges: pw.Table) -> pw.Table:

# %% [markdown]

# In an iteration step, the `edges` table is joined with the `vertices` table to get the representatives of neighbors in the graph. Then `groupby` is performed on `edges_with_repr` to get a minimal representative for each vertex. A new ID is assigned based on column `a` - vertex label. It is assigned in exactly the same way it is done above when creating a table. It allows you to have the same set of keys in the `vertices_updated` table as in the `vertices` table. However, Pathway is not that clever to deduce that the keys are exactly the same in these two tables. That's why it has to be additionally told they are the same, by using [`with_universe_of`](/developers/api-docs/pathway-table#pathway.Table.with_universe_of).
# In an iteration step, the `edges` table is joined with the `vertices` table to get the representatives of neighbors in the graph. Then `groupby` is performed on `edges_with_repr` to get a minimal representative for each vertex. A new ID is assigned based on column `a` - vertex label. It is assigned in exactly the same way it is done above when creating a table. It allows you to have the same set of keys in the `vertices_updated` table as in the `vertices` table. However, Pathway is not that clever to deduce that the keys are exactly the same in these two tables. That's why it has to be additionaly told they are the same, by using [`with_universe_of`](/developers/api-docs/pathway-table#pathway.Table.with_universe_of).

# Preserving the set of keys is important in `iterate`. The iteration can stop only stop if there are no updates in any of the records. The records correspondence between iterations is determined using their IDs. If a record with one ID disappears and a record with a new ID appears, Pathway decides that something is still changing and the computation has to continue (even if the contents of the two rows are the same). It is possible to change the set of keys used in `iterate` but in the end the set of keys has to stop changing anyway. You can see that in the next example on computing shortest distances in a graph.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
# %% [markdown]
# # Always Up-to-date Data Indexing pipeline
#
# This showcase shows how to use Pathway to deploy a live data indexing pipeline, which can be queried like a typical vector store. However, under the hood, Pathway updates the index on each data change, always giving up-to-date answers.
# This showcase demonstrates how to use Pathway to deploy a live data indexing pipeline that can be queried similarly to a typical vector store. Unlike traditional approaches, Pathway updates the index with every data change, ensuring consistently up-to-date answers.
# <!-- canva link: https://www.canva.com/design/DAF1cxQW5Vg/LcFdDrPApBrgwM5kyirY6w/edit -->
# ::article-img
# ---
Expand Down Expand Up @@ -61,7 +61,7 @@
# Then download sample data.

# %%
# _MD_SHOW_!pip install pathway litellm
# _MD_SHOW_!pip install "pathway[xpack-llm,xpack-llm-docs]"
# _MD_SHOW_ !pip install unstructured[all-docs]
# _MD_SHOW_!mkdir -p sample_documents
# _MD_SHOW_![ -f sample_documents/repo_readme.md ] || wget 'https://gist.githubusercontent.com/janchorowski/dd22a293f3d99d1b726eedc7d46d2fc0/raw/pathway_readme.md' -O 'sample_documents/repo_readme.md'
Expand Down
Loading

0 comments on commit c3db74b

Please sign in to comment.