Skip to content

Commit

Permalink
[docs] Update example CodeExample anchors (#27953)
Browse files Browse the repository at this point in the history
## Summary & Motivation
Update anchors in `docs_project` and `examples` content to use comment
anchoring rather than line numbers.

## How I Tested These Changes

## Changelog

> Insert changelog entry or delete this section.
  • Loading branch information
dehume authored Feb 20, 2025
1 parent 3475388 commit 5b8fb36
Show file tree
Hide file tree
Showing 42 changed files with 257 additions and 68 deletions.
8 changes: 4 additions & 4 deletions docs/docs/examples/bluesky/dashboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,20 +12,20 @@ For this example we will use [Power BI](https://www.microsoft.com/en-us/power-pl

First we will initialize the `PowerBIWorkspace` resource which allows Dagster to communicate with Power BI.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/dashboard/definitions.py" language="python" lineStart="9" lineEnd="17"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/dashboard/definitions.py" language="python" startAfter="start_powerbi" endBefore="end_powerbi"/>

Then, like dbt, we will define a translator. This time since the Power BI assets live downstream of our dbt models, we will map the Power BI assets to those model assets.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/dashboard/definitions.py" language="python" lineStart="19" lineEnd="41"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/dashboard/definitions.py" language="python" startAfter="start_dbt" endBefore="end_dbt"/>

Finally we define the definition for our dashboard assets and Power BI resource.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/dashboard/definitions.py" language="python" lineStart="43" lineEnd="49"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/dashboard/definitions.py" language="python" startAfter="start_def" endBefore="end_def"/>

## Definition merge

With the dashboard definition set, we have all three layers of the end-to-end project ready to go. We can now define a single definition which will be the definition used in our code location.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/definitions.py" language="python" lineStart="2" lineEnd="9"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/definitions.py" language="python" startAfter="start_def" endBefore="end_def"/>

You can see that organizing your project into domain specific definitions leads to a clean definition. We do this with our own [internal Dagster project](https://github.com/dagster-io/dagster-open-platform/blob/main/dagster_open_platform/definitions.py) that combines over a dozen domain specific definitions for the various tools and services we use.
14 changes: 7 additions & 7 deletions docs/docs/examples/bluesky/ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,37 +12,37 @@ The data that serves the foundation for our project is [Bluesky](https://bsky.ap

Because there is not an out of the box integration for Bluesky in Dagster, we will build our a custom <PyObject section="resources" module="dagster" object="ConfigurableResource"/>. Bluesky uses [atproto](https://docs.bsky.app/docs/advanced-guides/atproto) and provides an [SDK](https://docs.bsky.app/docs/get-started) which will serve as the backbone for our resource.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/resources.py" language="python" lineStart="6" lineEnd="29"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/resources.py" language="python" startAfter="start_resource" endBefore="end_resource"/>

The most important part of our resource is that it returns the atproto client which our Dagster assets will use. The `_login` method will initialize the client and cache it within the resource. We do this because Bluesky has rate limits and initializing the client counts against that limit. We intend to run these assets separately, so we want to be as efficient with our API calls as possible.

With the client defined, we can move to the strategy for pulling from Bluesky. There is a lot of data we can potentially pull but we will focus around posts related to data. In Bluesky there is the idea of [starter packs](https://bsky.social/about/blog/06-26-2024-starter-packs) which are curated lists of users associated with specific domains. We will ingest the [Data People](https://blueskystarterpack.com/starter-packs/lc5jzrr425fyah724df3z5ik/3l7cddlz5ja24) pack. Using the atproto client we can get all the members of that starter pack.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/utils/atproto.py" language="python" lineStart="42" lineEnd="59"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/utils/atproto.py" language="python" startAfter="start_starter_pack" endBefore="end_starter_pack"/>

The `get_all_feed_items` function is similar in using the atproto client to get information about individual feeds. This retrieves a lot more data and is where we will be most concerned about rate limiting (which we will cover in the [next section](rate-limiting)). But now that we have everything we need to interact with Bluesky, we can create our assets.

## Extracting data

Our first asset (`starter_pack_snapshot`) is responsible for extracting the members of the Data People starter pack and loading the data into R2. Let's look at the asset decorator and parameters before getting into the logic of the function.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" lineStart="18" lineEnd="33"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" startAfter="start_starter_pack_dec" endBefore="end_starter_pack_dec"/>

First we can see that we are setting a static partition for the specific starter pack. Next an `automation_condition` is added. This is a simple way to set a schedule for this asset and ensure it runs every midnight. Finally we add `kinds` of `Python` and `group_name` of `ingestion` which will help us organize our assets in the Dagster UI. For the parameters we will use the `ATProtoResource` we created for Bluesky data and the Dagster maintained `S3Resource` which works for R2. Now we can walk through the logic of the function.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" lineStart="41" lineEnd="74"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" startAfter="start_starter_pack_func" endBefore="end_starter_pack_func"/>

Using the `ATProtoResource` we initialize the client and extract the members from the starter pack. That information is stored in R2 at a path defined by the current date. We also update a dynamic partition that is defined outside of this asset. This partition will be used by our next asset.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" lineStart="15" lineEnd="16"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" startAfter="start_dynamic_partition" endBefore="end_dynamic_partition"/>

Finally we set metadata about the file we saved in the Dagster asset catalog.

## Dynamic partitions

The next asset is `actor_feed_snapshot` where the feed data will be ingested. This asset will use the same resources as the `starter_pack_snapshot` asset and has a similar flow. The primary difference is that while `actor_feed_snapshot` had a single static partition, `actor_feed_snapshot` uses a dynamic partition.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" lineStart="75" lineEnd="117"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" startAfter="start_actor_feed_snapshot" endBefore="end_actor_feed_snapshot"/>

This asset will maintain a separate partition and execution for every member of the data starter pack and store a file in R2 at an object path specific to that user.

Expand All @@ -52,7 +52,7 @@ One other difference you may have noticed is the `automation_condition`. Because

This is everything we need for ingestion. At the bottom of the file we will set the <PyObject section="definitions" module="dagster" object="Definitions" />. This will contain all the assets and initialized resources.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" lineStart="119" lineEnd="138"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" startAfter="start_def" endBefore="end_def"/>

This definition is just one part of our overall project but it can be helpful to define separate definitions for more complicated projects devoted to specific domains. You will see the same pattern for the modeling and dashboard layers. All of these definitions will be merged into a final definition at the very end.

Expand Down
12 changes: 6 additions & 6 deletions docs/docs/examples/bluesky/modeling.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ After running the ingestion assets, we will have all the data we need in R2 to s

The first thing to do is set up our dbt project. We will configure the connection details for the R2 bucket and the DuckDB database in the `profiles.yml` file. We will define two profiles, each with their own schema and path for our dev and production environments.

<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/profiles.yml" language="yaml" lineStart="1" lineEnd="27"/>
<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/profiles.yml" language="yaml" startAfter="start_profile" endBefore="end_profile"/>

Next we can define the `sources.yml` which will be the foundation for our dbt models. We can use the DuckDB function [read_ndjson_objects](https://duckdb.org/docs/data/json/loading_json.html#functions-for-reading-json-objects) to retrieve all the data in our specific R2 object paths. Even though all the data exists within the same R2 bucket, it can still be mapped into individual tables in DuckDB.

<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/models/sources.yml" language="yaml" lineStart="0" lineEnd="11"/>
<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/models/sources.yml" language="yaml" startAfter="start_sources" endBefore="end_sources"/>

| DuckDB Table | R2 Path |
| --- | --- |
Expand All @@ -25,21 +25,21 @@ Next we can define the `sources.yml` which will be the foundation for our dbt mo

With dbt configured to read our JSON data, we can start to build the models. We will follow dbt conventions and begin with staging models that map to the tables defined in the `sources.yml`. These will be models that extract all the information.

<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/models/staging/stg_feed_snapshots.sql" language="sql" lineStart="0" lineEnd="5"/>
<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/models/staging/stg_feed_snapshots.sql" language="sql" startAfter="start_stg_feed_snapshots" endBefore="end_stg_feed_snapshots"/>

Within the dbt project the `analysis` directory builds out the rest of the models where more complex metrics such as top daily posts are calculated. For metrics such as latest feeds, we can also leverage how we partitioned the data within our R2 bucket during ingestion to ensure we are using the most up to date posts.

<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/models/analysis/latest_feed.sql" language="sql" lineStart="0" lineEnd="17"/>
<CodeExample path="docs_projects/project_atproto_dashboard/dbt_project/models/analysis/latest_feed.sql" language="sql" startAfter="start_latest_feed_cte" endBefore="end_latest_feed_cte"/>

### dbt assets

Moving back into Dagster, there is not too much we need to do to turn the dbt models into assets. Dagster can parse a dbt project and generate all the assets by using a path to the project directory.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/modeling/definitions.py" language="python" lineStart="8" lineEnd="14"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/modeling/definitions.py" language="python" startAfter="start_dbt_project" endBefore="end_dbt_project"/>

We will use the `DagsterDbtTranslator` to map our ingestion assets that bring in the Bluesky data to the tables we defined in the `sources.yml`. This will ensure that everything exists as part of the same DAG and lineage within Dagster. Next we will combine the translator and dbt project to generate our Dagster assets.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/modeling/definitions.py" language="python" lineStart="32" lineEnd="46"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/modeling/definitions.py" language="python" startAfter="start_dbt_assets" endBefore="end_dbt_assets"/>

Like the ingestion layer, we will create a definition specific to dbt and modeling which we will combine with the other layers of our project.

Expand Down
6 changes: 3 additions & 3 deletions docs/docs/examples/bluesky/rate-limiting.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ sidebar_position: 30

One of the hurdles in getting data from Bluesky is working within the rate limits. Let's go back and look at the `get_all_feed_items` function that extracts feed information. This function uses [tenacity](https://tenacity.readthedocs.io/en/latest/) to handle retries for the function `_get_feed_with_retries` and will back off requests if we begin to hit our limits.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/utils/atproto.py" language="python" lineStart="8" lineEnd="40"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/utils/atproto.py" language="python" startAfter="start_all_feed_items" endBefore="end_all_feed_items"/>

Then if we look at the `actor_feed_snapshot` asset that uses `get_all_feed_items`, you will see one additional parameter in the decorator.

<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" lineStart="81" lineEnd="82"/>
<CodeExample path="docs_projects/project_atproto_dashboard/project_atproto_dashboard/ingestion/definitions.py" language="python" startAfter="start_concurrency" endBefore="end_concurrency"/>

This tells the asset to use the concurrency defined in the `dagster.yaml` which is a top level configuration of the Dagster instance.

<CodeExample path="docs_projects/project_atproto_dashboard/dagster.yaml" language="yaml" lineStart="4" lineEnd="6"/>
<CodeExample path="docs_projects/project_atproto_dashboard/dagster.yaml" language="yaml" startAfter="start_concurrency" endBefore="end_concurrency"/>

We already mentioned that the `actor_feed_snapshot` asset is dynamically partitioned by user feeds. This means that without setting concurrency controls, all of those segments within the partition would execute in parallel. Given that Bluesky is the limiting factor, and the shared resource client by all of the assets, we want to ensure that only one asset is running at a time. Applying the concurrency control ensures that Dagster will do this without having to add additional code to our assets.

Expand Down
4 changes: 2 additions & 2 deletions docs/docs/examples/llm-fine-tuning/feature-engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,13 @@ CATEGORIES = [

Using these categories, we can construct a table of the most common genres and select the single best genre for each book (assuming it was shelved that way at least three times). We can then wrap that query in an asset and materialize it as a table alongside our other DuckDB tables:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="64" lineEnd="105"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_book_category" endBefore="end_book_category"/>

## Enrichment table

With our `book_category` asset created, we can combine that with the `author` and `graphic_novel` assets to create our final data set we will use for modeling. Here we will both create the table within DuckDB and select its contents into a DataFrame, which we can pass to our next series of assets:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="107" lineEnd="134"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_enriched_graphic_novels" endBefore="end_enriched_graphic_novels"/>

## Next steps

Expand Down
8 changes: 4 additions & 4 deletions docs/docs/examples/llm-fine-tuning/file-creation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ sidebar_position: 40

Using the data we prepared in the [previous step](feature-engineering), we will create two files: a training file and a validation file. A training file provides the model with labeled data to learn patterns, while a validation file evaluates the model's performance on unseen data to prevent overfitting. These will be used in our OpenAI fine-tuning job to create our model. The columnar data from our DuckDB assets needs to be fit into messages that resemble the conversation a user would have with a chatbot. Here we can inject the values of those fields into conversations:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="136" lineEnd="154"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_prompt_record" endBefore="end_prompt_record"/>

The fine-tuning process does not need all the data prepared from `enriched_graphic_novels`. We will simply take a sample of the DataFrame and write it to a `.jsonl` file. The assets to create the training and validation set are very similar (only the filename is different). They will take in the `enriched_graphic_novels` asset, generate the prompts, and write the outputs to a file stored locally:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="156" lineEnd="172"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_training_file" endBefore="end_training_file"/>

:::note

Expand All @@ -30,11 +30,11 @@ Looking at this notebook. This would make a great asset check. Asset checks help

Since we want an asset check for both the training and validation files, we will write a general function that contains the logic from the cookbook:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="192" lineEnd="237"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_file_validation" endBefore="end_file_validation"/>

This looks like any other Python function, except it returns an `AssetCheckResult`, which is what Dagster uses to store the output of the asset check. Now we can use that function to create asset checks directly tied to our file assets. Again, they look similar to assets, except they use the `asset_check` decorator:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="239" lineEnd="249"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_asset_check" endBefore="end_asset_check"/>

## Next steps

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/examples/llm-fine-tuning/ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Since the data is normalized across these two files, we will want to combine the

We will start by creating two Dagster assets to load in the data. Each asset will load one of the files and create a DuckDB table (`graphic_novels` and `authors`). The asset will use the Dagster `DuckDBResource`, which gives us an easy way to interact with and run queries in DuckDB. Both files will create a table from their respective JSON files:

<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" lineStart="22" lineEnd="41"/>
<CodeExample path="docs_projects/project_llm_fine_tune/project_llm_fine_tune/assets.py" language="python" startAfter="start_graphic_novel" endBefore="end_graphic_novel"/>

Now that the base tables are loaded, we can move on to working with the data.

Expand Down
Loading

1 comment on commit 5b8fb36

@github-actions
Copy link

@github-actions github-actions bot commented on 5b8fb36 Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deploy preview for dagster-docs ready!

✅ Preview
https://dagster-docs-dwp6kca4l-elementl.vercel.app

Built with commit 5b8fb36.
This pull request is being automatically deployed with vercel-action

Please sign in to comment.