Update docs (#7986)

GitOrigin-RevId: 05d090e6e88eead576b75b28f6a09b82c106778a
pathwaycom · Jan 13, 2025 · c3db74b · c3db74b
1 parent 5ef6481
commit c3db74b
Show file tree

Hide file tree

Showing 9 changed files with 144 additions and 49 deletions.
diff --git a/docs/2.developers/4.user-guide/10.introduction/10.welcome.md b/docs/2.developers/4.user-guide/10.introduction/10.welcome.md
@@ -59,6 +59,8 @@ width: '500'
 - [Examples](/developers/user-guide/introduction/first_realtime_app_with_pathway)
 - [Core concepts](/developers/user-guide/introduction/concepts)
 - [Why Pathway](/developers/user-guide/introduction/why-pathway)
+- [Streaming and Static Modes](/developers/user-guide/introduction/streaming-and-static-modes)
+- [Batch Processing](/developers/user-guide/introduction/batch-processing)
 - [Deployment](/developers/user-guide/deployment/cloud-deployment)
 - [LLM tooling](/developers/user-guide/llm-xpack/overview)
 

diff --git a/docs/2.developers/4.user-guide/10.introduction/20.installation.md b/docs/2.developers/4.user-guide/10.introduction/20.installation.md
@@ -16,6 +16,8 @@ To quickly get started with Pathway, you can install it via pip with the followi
 pip install -U pathway
 ```
 
+It will install all basic dependencies to run your Pathway pipelines, including our powerful Rust engine.
+
 
 <!-- https://www.canva.com/design/DAGGtZB_-kw/6gGXSnfMNL9LuOXTOSQbQQ/edit?utm_content=DAGGtZB_-kw&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton -->
 ::article-img
@@ -54,6 +56,15 @@ Pick one and start your hands-on experience with Pathway today!
 
 ## Optional packages
 
+We separated dependencies into several groups to allow users to have better control over what is installed.
+
+- For a standard usage of the Live Data Framework, run `pip install pathway` . No external LLM-based libraries will be installed. 
+
+- To run the AI pipelines or to build Live AI systems, consider running `pip install "pathway[xpack-llm]"`, which will install all common LLM libraries, such as OpenAI, Langchain, etc.
+
+For more information, please visit our [pyproject.toml](https://github.com/pathwaycom/pathway/blob/main/pyproject.toml) file, which describes the content of all groups.
+
+
 | **Package** | **Installation Command** | **Description**  | **Notes** |
 |--------------|--------------------------|------------------|-----------|
 | **Basic LLM Tooling**  | `pip install "pathway[xpack-llm]"` | Install common LLM libraries (OpenAI, Langchain, LlamaIndex) | [Learn more](/developers/user-guide/llm-xpack/overview) / [Examples](/developers/templates?category=llm#llm) |

diff --git a/docs/2.developers/4.user-guide/10.introduction/30.pathway-overview.md b/docs/2.developers/4.user-guide/10.introduction/30.pathway-overview.md
@@ -16,23 +16,40 @@ Then, you simply have to import Pathway as any other Python library:
 import pathway as pw
 ```
 
-## Connect to your data sources
-In Pathway, you need to use a **connector** to create a table from a data source.
-Connectors rely on [schemas](/developers/user-guide/connect/schema) to structure the data:
+## Define your Data Schema:
+
+[Schemas](/developers/user-guide/connect/schema) in Pathway define the structure of your data tables. They describe the data types and names of the columns, ensuring that your data is well-organized and consistent.
+
+For instance, when reading a data source, you specify a schema to map the incoming data:
+
 
 ```python
 class InputSchema(pw.Schema):
     colA: int
     colB: float
     colC: str
-
-
-input_table = pw.io.csv.read('./data/', schema=InputSchema)
 ```
 
+Here, the InputSchema specifies three columns: colA (an integer), colB (a float) and colC (a string).
+Schemas define the structure of the data, ensuring type safety and optimizing runtime performance.
+
 Pathway supports the following basic [data types](/developers/user-guide/connect/datatypes): `bool`, `str`, `bytes`, `int`, and `float`.
 Pathway also supports more complex data types, such as the `Optional` data type or temporal data types (`datetime.datetime`).
 
+## Tables:
+
+[Tables](https://pathway.com/developers/api-docs/pathway-table) are Pathway objects that can actually store the data. These are composed of columns, each of which keeps data of the same type, just like in relational databases.
+
+
+## Connectors:
+In Pathway, you need to use a **connector** to create a table from a data source. Connectors read and ingest, in real-time, data from your chosen data sources.
+
+Here's an example of a connector that uses `InputSchema` to read **CSV** files from the `./data/` directory and outputs a table:
+
+```python
+input_table = pw.io.csv.read('./data/', schema=InputSchema)
+```
+
 Here is a small sample of Pathway input connectors:
 
 | Input Connectors               |  Example                                                                                |
@@ -46,11 +63,17 @@ Pathway comes with many more connectors, including an [Airbyte connector](/devel
 You can find the list of available connectors on our [connector page](/developers/user-guide/connect/pathway-connectors).
 
 ## Transformations
-Once your input data is defined, you can define your data pipeline using [Pathway transformations](/developers/user-guide/introduction/concepts#processing-the-data-with-transformations):
+Once your input data is specified, you can now define your data pipeline using Pathway [transformations](/developers/user-guide/introduction/concepts#processing-the-data-with-transformations). These, under the hood, are written in Rust meaning that they are very efficient.
+
+Here is an example of a simple transformation composed of filtering and summing by groups:
 
 ```python
 filtered_table = input_table.filter(input_table.colA > 0)
-result_table = filtered_table.groupby(filtered_table.colB).reduce(sum_val=pw.Reducers.sum(pw.this.colC))
+result_table = (
+    filtered_table
+    .groupby(filtered_table.colB)
+    .reduce(sum_val=pw.Reducers.sum(pw.this.colC))
+)
 ```
 
 Here is a small sample of the operations you can do in Pathway:

diff --git a/...2.developers/4.user-guide/10.introduction/40.first_realtime_app_with_pathway.md b/...2.developers/4.user-guide/10.introduction/40.first_realtime_app_with_pathway.md
@@ -15,7 +15,7 @@ on a Python 3.10+ installation, and you are ready to roll!
 
 ## A simple sum example
 
-Let's start with a simple sum over positive values stored in CSV files, and written to a JSON Lines file:
+Let's start with a simple sum over positive values stored in CSV files, and written to a [JSON Lines](https://jsonlines.org/) file:
 
 ::article-img
 ---
@@ -38,10 +38,11 @@ The aim is to combine the data from those two data sources and find the live mea
 This is how you can do the whole pipeline in Pathway:
 
 ```python
-import pathway as pw # import Pathway
+import pathway as pw
 
 # Declare the Schema of your tables using pw.Schema.
-# There are two input tables: measurements and threshold.
+# There are two input tables: (1) measurements which is 
+# live stream and (2) threshold which is a CSV that might be modified over time.
 # Both have two columns: a name (str) and a float.
 class MeasurementSchema(pw.Schema):
     name: str
@@ -80,21 +81,36 @@ thresholds_table = pw.io.csv(
 )
 
 # Joining tables on the column name
-joined_table = measurements_table.join( # The left table is measurements_table (pw.left)
-    thresholds_table,                   # The right table is thresholds_table (pw.right)
-    pw.left.name==pw.right.name,        # The join is done on the column name of each table 
-).select(                               # The columns of the joined table are chosen using select
-    *pw.left,                           # All the columns of measurements are kept
-    pw.right.threshold                  # The threshold column of the threshold table is kept
+joined_table = (
+    # The left table is measurements_table (referred as pw.left)
+    measurements_table
+    .join(
+        # The right table is thresholds_table (referred as pw.right)
+        thresholds_table,
+        # The join is done on the column name of each table 
+        pw.left.name==pw.right.name,
+    )
+    # The columns of the joined table are chosen using select
+    .select(
+        # All the columns of measurements are kept
+        *pw.left,
+        # The threshold column of the threshold table is kept
+        pw.right.threshold
+    )
 )
 
-# Filtering value strictly higher than the threshold.
-alerts_table = joined_values.filter(
-    pw.this.value > pw.this.threshold
-).select(pw.this.name, pw.this.value) # Only name and value fields are kept
+alerts_table = (
+    joined_values
+    # Filtering value strictly higher than the threshold.
+    .filter(pw.this.value > pw.this.threshold)
+    # Only name and value fields are kept
+    .select(pw.this.name, pw.this.value)
+)
 
 # Sending the results to another Kafka topic, on the same Kafka instance
-pw.io.kafka.write(alerts_table, rdkafka_settings, topic_name="alerts_topic", format="json")
+pw.io.kafka.write(
+    alerts_table, rdkafka_settings, topic_name="alerts_topic", format="json"
+)
 
 # Launching Pathway computation.
 pw.run()
@@ -148,28 +164,32 @@ Then the output is:
 {"name": "B", "value":10, "time":1, "diff":1}
 ```
 The output contains two more fields: `time` and `diff`:
-* `time` represents the time at which the update has happened. In practice, the time is a timestamp.
-* `diff` represents whether the row represents an addition or a deletion. An update is represented by two rows: one to remove the old value, one to add the new values. Those two rows have the same time to ensure the atomicity of the operation.
+* `time` represents the time at which the update happened. In practice, the time is a regular timestamp.
+* `diff` informs whether the row should be an insertion (`diff = 1`) or a deletion (`diff = -1`). An update is represented by two rows: one to remove the old value, one to add the new values. Those two rows have the same time to ensure the atomicity of the operation.
 
 In this case, we assume the first values were computed at time 1.
 The value `diff` is equal to `1` as it is an insertion.
 
-Suppose now that the thresholds are updated:
+Suppose now that the thresholds have changed, the file got an update and now looks like that:
 ```
 name, threshold
 "A", 7
 "B", 11
 ```
 
-Then the output is:
+Connector will automatically detect any new files or modifications within `./threshold-data/` and update the tables accordingly.
+
+That will trigger the reexecution of join and filter giving this output:
 ```
 {"name": "B", "value":10, "time":1, "diff":1}
 {"name": "B", "value":10, "time":2, "diff":-1}
 {"name": "A", "value":8, "time":2, "diff":1}
 ```
 
 There are two more lines: the old alert is removed (`diff=-1`) and a new one is inserted for the other event (`diff=1`).
-Old values are still kept as the output is a log of insertion and suppression.
+Old values are still kept as the output is a log of insertion and suppression allowing to have exhaustive information about what happened to our data.
+
+Keep in mind that some output connectors to external data storage system might take these `-1` and `+1` rows, link them by time, and represent as update operation (like in case of SQL database).
 
 ## Other examples
 ::container{.flex .gap-8 .items-center .w-full .justify-center}

diff --git a/docs/2.developers/4.user-guide/10.introduction/50.concepts.md b/docs/2.developers/4.user-guide/10.introduction/50.concepts.md
@@ -149,7 +149,9 @@ Here is a way to do it with Pathway, assuming a correct input table called `inpu
 
 ```python
 filtered_table = input_table.filter(input_table.age >= 0)
-result_table = filtered_table.reduce(sum_age = pw.reducers.sum(filtered_table.age))
+result_table = filtered_table.reduce(
+    sum_age = pw.reducers.sum(filtered_table.age)
+)
 ```
 
 It's okay if you don't understand everything for now.

diff --git a/docs/2.developers/4.user-guide/30.data-transformation/.iterate/article.py b/docs/2.developers/4.user-guide/30.data-transformation/.iterate/article.py
@@ -175,7 +175,7 @@ def cc(vertices: pw.Table, edges: pw.Table) -> pw.Table:
 
 # %% [markdown]
 
-# In an iteration step, the `edges` table is joined with the `vertices` table to get the representatives of neighbors in the graph. Then `groupby` is performed on `edges_with_repr` to get a minimal representative for each vertex. A new ID is assigned based on column `a` - vertex label. It is assigned in exactly the same way it is done above when creating a table. It allows you to have the same set of keys in the `vertices_updated` table as in the `vertices` table. However, Pathway is not that clever to deduce that the keys are exactly the same in these two tables. That's why it has to be additionally told they are the same, by using [`with_universe_of`](/developers/api-docs/pathway-table#pathway.Table.with_universe_of).
+# In an iteration step, the `edges` table is joined with the `vertices` table to get the representatives of neighbors in the graph. Then `groupby` is performed on `edges_with_repr` to get a minimal representative for each vertex. A new ID is assigned based on column `a` - vertex label. It is assigned in exactly the same way it is done above when creating a table. It allows you to have the same set of keys in the `vertices_updated` table as in the `vertices` table. However, Pathway is not that clever to deduce that the keys are exactly the same in these two tables. That's why it has to be additionaly told they are the same, by using [`with_universe_of`](/developers/api-docs/pathway-table#pathway.Table.with_universe_of).
 
 # Preserving the set of keys is important in `iterate`. The iteration can stop only stop if there are no updates in any of the records. The records correspondence between iterations is determined using their IDs. If a record with one ID disappears and a record with a new ID appears, Pathway decides that something is still changing and the computation has to continue (even if the contents of the two rows are the same). It is possible to change the set of keys used in `iterate` but in the end the set of keys has to stop changing anyway. You can see that in the next example on computing shortest distances in a graph.
 

diff --git a/docs/2.developers/4.user-guide/50.llm-xpack/.vectorstore_pipeline/article.py b/docs/2.developers/4.user-guide/50.llm-xpack/.vectorstore_pipeline/article.py
@@ -26,7 +26,7 @@
 # %% [markdown]
 # # Always Up-to-date Data Indexing pipeline
 #
-# This showcase shows how to use Pathway to deploy a live data indexing pipeline, which can be queried like a typical vector store. However, under the hood, Pathway updates the index on each data change, always giving up-to-date answers.
+# This showcase demonstrates how to use Pathway to deploy a live data indexing pipeline that can be queried similarly to a typical vector store. Unlike traditional approaches, Pathway updates the index with every data change, ensuring consistently up-to-date answers.
 # <!-- canva link: https://www.canva.com/design/DAF1cxQW5Vg/LcFdDrPApBrgwM5kyirY6w/edit  -->
 # ::article-img
 # ---
@@ -61,7 +61,7 @@
 # Then download sample data.
 
 # %%
-# _MD_SHOW_!pip install pathway litellm
+# _MD_SHOW_!pip install "pathway[xpack-llm,xpack-llm-docs]"
 # _MD_SHOW_ !pip install unstructured[all-docs]
 # _MD_SHOW_!mkdir -p sample_documents
 # _MD_SHOW_![ -f sample_documents/repo_readme.md ] || wget 'https://gist.githubusercontent.com/janchorowski/dd22a293f3d99d1b726eedc7d46d2fc0/raw/pathway_readme.md' -O 'sample_documents/repo_readme.md'