-
Notifications
You must be signed in to change notification settings - Fork 16k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
25 changed files
with
3,829 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,292 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "1f3a5ebf", | ||
"metadata": {}, | ||
"source": [ | ||
"# AirbyteLoader" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "35ac77b1-449b-44f7-b8f3-3494d55c286e", | ||
"metadata": {}, | ||
"source": [ | ||
">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n", | ||
"\n", | ||
"This covers how to load any source from Airbyte into LangChain documents\n", | ||
"\n", | ||
"## Installation\n", | ||
"\n", | ||
"In order to use `AirbyteLoader` you need to install the `langchain-airbyte` integration package." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "180c8b74", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"% pip install -qU langchain-airbyte" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "3dd92c62", | ||
"metadata": {}, | ||
"source": [ | ||
"## Loading Documents\n", | ||
"\n", | ||
"By default, the `AirbyteLoader` will load any structured data from a stream and output yaml-formatted documents." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "721d9316", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"```yaml\n", | ||
"academic_degree: PhD\n", | ||
"address:\n", | ||
" city: Lauderdale Lakes\n", | ||
" country_code: FI\n", | ||
" postal_code: '75466'\n", | ||
" province: New Jersey\n", | ||
" state: Hawaii\n", | ||
" street_name: Stoneyford\n", | ||
" street_number: '1112'\n", | ||
"age: 44\n", | ||
"blood_type: \"O\\u2212\"\n", | ||
"created_at: '2004-04-02T13:05:27+00:00'\n", | ||
"email: [email protected]\n", | ||
"gender: Fluid\n", | ||
"height: '1.62'\n", | ||
"id: 1\n", | ||
"language: Belarusian\n", | ||
"name: Moses\n", | ||
"nationality: Dutch\n", | ||
"occupation: Track Worker\n", | ||
"telephone: 1-467-194-2318\n", | ||
"title: M.Sc.Tech.\n", | ||
"updated_at: '2024-02-27T16:41:01+00:00'\n", | ||
"weight: 6\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from langchain_airbyte import AirbyteLoader\n", | ||
"\n", | ||
"loader = AirbyteLoader(\n", | ||
" source=\"source-faker\",\n", | ||
" stream=\"users\",\n", | ||
" config={\"count\": 10},\n", | ||
")\n", | ||
"docs = loader.load()\n", | ||
"print(docs[0].page_content[:500])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fca024cb", | ||
"metadata": { | ||
"scrolled": true | ||
}, | ||
"source": [ | ||
"You can also specify a custom prompt template for formatting documents:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "9fa002a5", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"My name is Verdie and I am 1.73 meters tall.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from langchain_core.prompts import PromptTemplate\n", | ||
"\n", | ||
"loader_templated = AirbyteLoader(\n", | ||
" source=\"source-faker\",\n", | ||
" stream=\"users\",\n", | ||
" config={\"count\": 10},\n", | ||
" template=PromptTemplate.from_template(\n", | ||
" \"My name is {name} and I am {height} meters tall.\"\n", | ||
" ),\n", | ||
")\n", | ||
"docs_templated = loader_templated.load()\n", | ||
"print(docs_templated[0].page_content)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d3e6d887", | ||
"metadata": {}, | ||
"source": [ | ||
"## Lazy Loading Documents\n", | ||
"\n", | ||
"One of the powerful features of `AirbyteLoader` is its ability to load large documents from upstream sources. When working with large datasets, the default `.load()` behavior can be slow and memory-intensive. To avoid this, you can use the `.lazy_load()` method to load documents in a more memory-efficient manner." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"id": "684b9187", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Just calling lazy load is quick! This took 0.0001 seconds\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import time\n", | ||
"\n", | ||
"loader = AirbyteLoader(\n", | ||
" source=\"source-faker\",\n", | ||
" stream=\"users\",\n", | ||
" config={\"count\": 3},\n", | ||
" template=PromptTemplate.from_template(\n", | ||
" \"My name is {name} and I am {height} meters tall.\"\n", | ||
" ),\n", | ||
")\n", | ||
"\n", | ||
"start_time = time.time()\n", | ||
"my_iterator = loader.lazy_load()\n", | ||
"print(\n", | ||
" f\"Just calling lazy load is quick! This took {time.time() - start_time:.4f} seconds\"\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "6b24a64b", | ||
"metadata": {}, | ||
"source": [ | ||
"And you can iterate over documents as they're yielded:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"id": "3e8355d0", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"My name is Andera and I am 1.91 meters tall.\n", | ||
"My name is Jody and I am 1.85 meters tall.\n", | ||
"My name is Zonia and I am 1.53 meters tall.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"for doc in my_iterator:\n", | ||
" print(doc.page_content)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d1040d81", | ||
"metadata": {}, | ||
"source": [ | ||
"You can also lazy load documents in an async manner with `.alazy_load()`:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"id": "dc5d0911", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"My name is Carmelina and I am 1.74 meters tall.\n", | ||
"My name is Ali and I am 1.90 meters tall.\n", | ||
"My name is Rochell and I am 1.83 meters tall.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"loader = AirbyteLoader(\n", | ||
" source=\"source-faker\",\n", | ||
" stream=\"users\",\n", | ||
" config={\"count\": 3},\n", | ||
" template=PromptTemplate.from_template(\n", | ||
" \"My name is {name} and I am {height} meters tall.\"\n", | ||
" ),\n", | ||
")\n", | ||
"\n", | ||
"my_async_iterator = loader.alazy_load()\n", | ||
"\n", | ||
"async for doc in my_async_iterator:\n", | ||
" print(doc.page_content)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ba4ede33", | ||
"metadata": {}, | ||
"source": [ | ||
"## Configuration\n", | ||
"\n", | ||
"`AirbyteLoader` can be configured with the following options:\n", | ||
"\n", | ||
"- `source` (str, required): The name of the Airbyte source to load from.\n", | ||
"- `stream` (str, required): The name of the stream to load from (Airbyte sources can return multiple streams)\n", | ||
"- `config` (dict, required): The configuration for the Airbyte source\n", | ||
"- `template` (PromptTemplate, optional): A custom prompt template for formatting documents\n", | ||
"- `include_metadata` (bool, optional, default True): Whether to include all fields as metadata in the output documents\n", | ||
"\n", | ||
"The majority of the configuration will be in `config`, and you can find the specific configuration options in the \"Config field reference\" for each source in the [Airbyte documentation](https://docs.airbyte.com/integrations/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2e2ed269", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__pycache__ |
Oops, something went wrong.