diff --git a/docs/docs/concepts.mdx b/docs/docs/concepts.mdx
deleted file mode 100644
index 136433240b364..0000000000000
--- a/docs/docs/concepts.mdx
+++ /dev/null
@@ -1,1391 +0,0 @@
-# Conceptual guide
-
-import ThemedImage from '@theme/ThemedImage';
-import useBaseUrl from '@docusaurus/useBaseUrl';
-
-This section contains introductions to key parts of LangChain.
-
-## Architecture
-
-LangChain as a framework consists of a number of packages.
-
-### `langchain-core`
-This package contains base abstractions of different components and ways to compose them together.
-The interfaces for core components like LLMs, vector stores, retrievers and more are defined here.
-No third party integrations are defined here.
-The dependencies are kept purposefully very lightweight.
-
-### `langchain`
-
-The main `langchain` package contains chains, agents, and retrieval strategies that make up an application's cognitive architecture.
-These are NOT third party integrations.
-All chains, agents, and retrieval strategies here are NOT specific to any one integration, but rather generic across all integrations.
-
-### `langchain-community`
-
-This package contains third party integrations that are maintained by the LangChain community.
-Key partner packages are separated out (see below).
-This contains all integrations for various components (LLMs, vector stores, retrievers).
-All dependencies in this package are optional to keep the package as lightweight as possible.
-
-### Partner packages
-
-While the long tail of integrations is in `langchain-community`, we split popular integrations into their own packages (e.g. `langchain-openai`, `langchain-anthropic`, etc).
-This was done in order to improve support for these important integrations.
-
-### [`langgraph`](https://langchain-ai.github.io/langgraph)
-
-`langgraph` is an extension of `langchain` aimed at
-building robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.
-
-LangGraph exposes high level interfaces for creating common types of agents, as well as a low-level API for composing custom flows.
-
-### [`langserve`](/docs/langserve)
-
-A package to deploy LangChain chains as REST APIs. Makes it easy to get a production ready API up and running.
-
-### [LangSmith](https://docs.smith.langchain.com)
-
-A developer platform that lets you debug, test, evaluate, and monitor LLM applications.
-
-
-
-## LangChain Expression Language (LCEL)
-
-
-`LangChain Expression Language`, or `LCEL`, is a declarative way to chain LangChain components.
-LCEL was designed from day 1 to **support putting prototypes in production, with no code changes**, from the simplest “prompt + LLM” chain to the most complex chains (we’ve seen folks successfully run LCEL chains with 100s of steps in production). To highlight a few of the reasons you might want to use LCEL:
-
-- **First-class streaming support:**
-When you build your chains with LCEL you get the best possible time-to-first-token (time elapsed until the first chunk of output comes out). For some chains this means eg. we stream tokens straight from an LLM to a streaming output parser, and you get back parsed, incremental chunks of output at the same rate as the LLM provider outputs the raw tokens.
-
-- **Async support:**
-Any chain built with LCEL can be called both with the synchronous API (eg. in your Jupyter notebook while prototyping) as well as with the asynchronous API (eg. in a [LangServe](/docs/langserve/) server). This enables using the same code for prototypes and in production, with great performance, and the ability to handle many concurrent requests in the same server.
-
-- **Optimized parallel execution:**
-Whenever your LCEL chains have steps that can be executed in parallel (eg if you fetch documents from multiple retrievers) we automatically do it, both in the sync and the async interfaces, for the smallest possible latency.
-
-- **Retries and fallbacks:**
-Configure retries and fallbacks for any part of your LCEL chain. This is a great way to make your chains more reliable at scale. We’re currently working on adding streaming support for retries/fallbacks, so you can get the added reliability without any latency cost.
-
-- **Access intermediate results:**
-For more complex chains it’s often very useful to access the results of intermediate steps even before the final output is produced. This can be used to let end-users know something is happening, or even just to debug your chain. You can stream intermediate results, and it’s available on every [LangServe](/docs/langserve) server.
-
-- **Input and output schemas**
-Input and output schemas give every LCEL chain Pydantic and JSONSchema schemas inferred from the structure of your chain. This can be used for validation of inputs and outputs, and is an integral part of LangServe.
-
-- [**Seamless LangSmith tracing**](https://docs.smith.langchain.com)
-As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step.
-With LCEL, **all** steps are automatically logged to [LangSmith](https://docs.smith.langchain.com/) for maximum observability and debuggability.
-
-LCEL aims to provide consistency around behavior and customization over legacy subclassed chains such as `LLMChain` and
-`ConversationalRetrievalChain`. Many of these legacy chains hide important details like prompts, and as a wider variety
-of viable models emerge, customization has become more and more important.
-
-If you are currently using one of these legacy chains, please see [this guide for guidance on how to migrate](/docs/versions/migrating_chains).
-
-For guides on how to do specific tasks with LCEL, check out [the relevant how-to guides](/docs/how_to/#langchain-expression-language-lcel).
-
-### Runnable interface
-
-
-To make it as easy as possible to create custom chains, we've implemented a ["Runnable"](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable) protocol. Many LangChain components implement the `Runnable` protocol, including chat models, LLMs, output parsers, retrievers, prompt templates, and more. There are also several useful primitives for working with runnables, which you can read about below.
-
-This is a standard interface, which makes it easy to define custom chains as well as invoke them in a standard way.
-The standard interface includes:
-
-- `stream`: stream back chunks of the response
-- `invoke`: call the chain on an input
-- `batch`: call the chain on a list of inputs
-
-These also have corresponding async methods that should be used with [asyncio](https://docs.python.org/3/library/asyncio.html) `await` syntax for concurrency:
-
-- `astream`: stream back chunks of the response async
-- `ainvoke`: call the chain on an input async
-- `abatch`: call the chain on a list of inputs async
-- `astream_log`: stream back intermediate steps as they happen, in addition to the final response
-- `astream_events`: **beta** stream events as they happen in the chain (introduced in `langchain-core` 0.1.14)
-
-The **input type** and **output type** varies by component:
-
-| Component | Input Type | Output Type |
-|--------------|-------------------------------------------------------|-----------------------|
-| Prompt | Dictionary | PromptValue |
-| ChatModel | Single string, list of chat messages or a PromptValue | ChatMessage |
-| LLM | Single string, list of chat messages or a PromptValue | String |
-| OutputParser | The output of an LLM or ChatModel | Depends on the parser |
-| Retriever | Single string | List of Documents |
-| Tool | Single string or dictionary, depending on the tool | Depends on the tool |
-
-
-All runnables expose input and output **schemas** to inspect the inputs and outputs:
-- `input_schema`: an input Pydantic model auto-generated from the structure of the Runnable
-- `output_schema`: an output Pydantic model auto-generated from the structure of the Runnable
-
-## Components
-
-LangChain provides standard, extendable interfaces and external integrations for various components useful for building with LLMs.
-Some components LangChain implements, some components we rely on third-party integrations for, and others are a mix.
-
-### Chat models
-
-
-Language models that use a sequence of messages as inputs and return chat messages as outputs (as opposed to using plain text).
-These are traditionally newer models (older models are generally `LLMs`, see below).
-Chat models support the assignment of distinct roles to conversation messages, helping to distinguish messages from the AI, users, and instructions such as system messages.
-
-Although the underlying models are messages in, message out, the LangChain wrappers also allow these models to take a string as input. This means you can easily use chat models in place of LLMs.
-
-When a string is passed in as input, it is converted to a `HumanMessage` and then passed to the underlying model.
-
-LangChain does not host any Chat Models, rather we rely on third party integrations.
-
-We have some standardized parameters when constructing ChatModels:
-- `model`: the name of the model
-- `temperature`: the sampling temperature
-- `timeout`: request timeout
-- `max_tokens`: max tokens to generate
-- `stop`: default stop sequences
-- `max_retries`: max number of times to retry requests
-- `api_key`: API key for the model provider
-- `base_url`: endpoint to send requests to
-
-Some important things to note:
-- standard params only apply to model providers that expose parameters with the intended functionality. For example, some providers do not expose a configuration for maximum output tokens, so max_tokens can't be supported on these.
-- standard params are currently only enforced on integrations that have their own integration packages (e.g. `langchain-openai`, `langchain-anthropic`, etc.), they're not enforced on models in ``langchain-community``.
-
-ChatModels also accept other parameters that are specific to that integration. To find all the parameters supported by a ChatModel head to the API reference for that model.
-
-:::important
-Some chat models have been fine-tuned for **tool calling** and provide a dedicated API for it.
-Generally, such models are better at tool calling than non-fine-tuned models, and are recommended for use cases that require tool calling.
-Please see the [tool calling section](/docs/concepts/#functiontool-calling) for more information.
-:::
-
-For specifics on how to use chat models, see the [relevant how-to guides here](/docs/how_to/#chat-models).
-
-#### Multimodality
-
-Some chat models are multimodal, accepting images, audio and even video as inputs. These are still less common, meaning model providers haven't standardized on the "best" way to define the API. Multimodal **outputs** are even less common. As such, we've kept our multimodal abstractions fairly light weight and plan to further solidify the multimodal APIs and interaction patterns as the field matures.
-
-In LangChain, most chat models that support multimodal inputs also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
-
-For specifics on how to use multimodal models, see the [relevant how-to guides here](/docs/how_to/#multimodal).
-
-For a full list of LangChain model providers with multimodal models, [check out this table](/docs/integrations/chat/#advanced-features).
-
-### LLMs
-
-
-:::caution
-Pure text-in/text-out LLMs tend to be older or lower-level. Many new popular models are best used as [chat completion models](/docs/concepts/#chat-models),
-even for non-chat use cases.
-
-You are probably looking for [the section above instead](/docs/concepts/#chat-models).
-:::
-
-Language models that takes a string as input and returns a string.
-These are traditionally older models (newer models generally are [Chat Models](/docs/concepts/#chat-models), see above).
-
-Although the underlying models are string in, string out, the LangChain wrappers also allow these models to take messages as input.
-This gives them the same interface as [Chat Models](/docs/concepts/#chat-models).
-When messages are passed in as input, they will be formatted into a string under the hood before being passed to the underlying model.
-
-LangChain does not host any LLMs, rather we rely on third party integrations.
-
-For specifics on how to use LLMs, see the [how-to guides](/docs/how_to/#llms).
-
-### Messages
-
-Some language models take a list of messages as input and return a message.
-There are a few different types of messages.
-All messages have a `role`, `content`, and `response_metadata` property.
-
-The `role` describes WHO is saying the message. The standard roles are "user", "assistant", "system", and "tool".
-LangChain has different message classes for different roles.
-
-The `content` property describes the content of the message.
-This can be a few different things:
-
-- A string (most models deal with this type of content)
-- A List of dictionaries (this is used for multimodal input, where the dictionary contains information about that input type and that input location)
-
-Optionally, messages can have a `name` property which allows for differentiating between multiple speakers with the same role.
-For example, if there are two users in the chat history it can be useful to differentiate between them. Not all models support this.
-
-#### HumanMessage
-
-This represents a message with role "user".
-
-#### AIMessage
-
-This represents a message with role "assistant". In addition to the `content` property, these messages also have:
-
-**`response_metadata`**
-
-The `response_metadata` property contains additional metadata about the response. The data here is often specific to each model provider.
-This is where information like log-probs and token usage may be stored.
-
-**`tool_calls`**
-
-These represent a decision from a language model to call a tool. They are included as part of an `AIMessage` output.
-They can be accessed from there with the `.tool_calls` property.
-
-This property returns a list of `ToolCall`s. A `ToolCall` is a dictionary with the following arguments:
-
-- `name`: The name of the tool that should be called.
-- `args`: The arguments to that tool.
-- `id`: The id of that tool call.
-
-#### SystemMessage
-
-This represents a message with role "system", which tells the model how to behave. Not every model provider supports this.
-
-#### ToolMessage
-
-This represents a message with role "tool", which contains the result of calling a tool. In addition to `role` and `content`, this message has:
-
-- a `tool_call_id` field which conveys the id of the call to the tool that was called to produce this result.
-- an `artifact` field which can be used to pass along arbitrary artifacts of the tool execution which are useful to track but which should not be sent to the model.
-
-With most chat models, a `ToolMessage` can only appear in the chat history after an `AIMessage` that has a populated `tool_calls` field.
-
-#### (Legacy) FunctionMessage
-
-This is a legacy message type, corresponding to OpenAI's legacy function-calling API. `ToolMessage` should be used instead to correspond to the updated tool-calling API.
-
-This represents the result of a function call. In addition to `role` and `content`, this message has a `name` parameter which conveys the name of the function that was called to produce this result.
-
-
-### Prompt templates
-
-
-Prompt templates help to translate user input and parameters into instructions for a language model.
-This can be used to guide a model's response, helping it understand the context and generate relevant and coherent language-based output.
-
-Prompt Templates take as input a dictionary, where each key represents a variable in the prompt template to fill in.
-
-Prompt Templates output a PromptValue. This PromptValue can be passed to an LLM or a ChatModel, and can also be cast to a string or a list of messages.
-The reason this PromptValue exists is to make it easy to switch between strings and messages.
-
-There are a few different types of prompt templates:
-
-#### String PromptTemplates
-
-These prompt templates are used to format a single string, and generally are used for simpler inputs.
-For example, a common way to construct and use a PromptTemplate is as follows:
-
-```python
-from langchain_core.prompts import PromptTemplate
-
-prompt_template = PromptTemplate.from_template("Tell me a joke about {topic}")
-
-prompt_template.invoke({"topic": "cats"})
-```
-
-#### ChatPromptTemplates
-
-These prompt templates are used to format a list of messages. These "templates" consist of a list of templates themselves.
-For example, a common way to construct and use a ChatPromptTemplate is as follows:
-
-```python
-from langchain_core.prompts import ChatPromptTemplate
-
-prompt_template = ChatPromptTemplate.from_messages([
- ("system", "You are a helpful assistant"),
- ("user", "Tell me a joke about {topic}")
-])
-
-prompt_template.invoke({"topic": "cats"})
-```
-
-In the above example, this ChatPromptTemplate will construct two messages when called.
-The first is a system message, that has no variables to format.
-The second is a HumanMessage, and will be formatted by the `topic` variable the user passes in.
-
-#### MessagesPlaceholder
-
-
-This prompt template is responsible for adding a list of messages in a particular place.
-In the above ChatPromptTemplate, we saw how we could format two messages, each one a string.
-But what if we wanted the user to pass in a list of messages that we would slot into a particular spot?
-This is how you use MessagesPlaceholder.
-
-```python
-from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
-from langchain_core.messages import HumanMessage
-
-prompt_template = ChatPromptTemplate.from_messages([
- ("system", "You are a helpful assistant"),
- MessagesPlaceholder("msgs")
-])
-
-prompt_template.invoke({"msgs": [HumanMessage(content="hi!")]})
-```
-
-This will produce a list of two messages, the first one being a system message, and the second one being the HumanMessage we passed in.
-If we had passed in 5 messages, then it would have produced 6 messages in total (the system message plus the 5 passed in).
-This is useful for letting a list of messages be slotted into a particular spot.
-
-An alternative way to accomplish the same thing without using the `MessagesPlaceholder` class explicitly is:
-
-```python
-prompt_template = ChatPromptTemplate.from_messages([
- ("system", "You are a helpful assistant"),
- ("placeholder", "{msgs}") # <-- This is the changed part
-])
-```
-
-For specifics on how to use prompt templates, see the [relevant how-to guides here](/docs/how_to/#prompt-templates).
-
-### Example selectors
-One common prompting technique for achieving better performance is to include examples as part of the prompt.
-This is known as [few-shot prompting](/docs/concepts/#few-shot-prompting).
-This gives the language model concrete examples of how it should behave.
-Sometimes these examples are hardcoded into the prompt, but for more advanced situations it may be nice to dynamically select them.
-Example Selectors are classes responsible for selecting and then formatting examples into prompts.
-
-For specifics on how to use example selectors, see the [relevant how-to guides here](/docs/how_to/#example-selectors).
-
-### Output parsers
-
-
-:::note
-
-The information here refers to parsers that take a text output from a model try to parse it into a more structured representation.
-More and more models are supporting function (or tool) calling, which handles this automatically.
-It is recommended to use function/tool calling rather than output parsing.
-See documentation for that [here](/docs/concepts/#function-tool-calling).
-
-:::
-
-`Output parser` is responsible for taking the output of a model and transforming it to a more suitable format for downstream tasks.
-Useful when you are using LLMs to generate structured data, or to normalize output from chat models and LLMs.
-
-LangChain has lots of different types of output parsers. This is a list of output parsers LangChain supports. The table below has various pieces of information:
-
-- **Name**: The name of the output parser
-- **Supports Streaming**: Whether the output parser supports streaming.
-- **Has Format Instructions**: Whether the output parser has format instructions. This is generally available except when (a) the desired schema is not specified in the prompt but rather in other parameters (like OpenAI function calling), or (b) when the OutputParser wraps another OutputParser.
-- **Calls LLM**: Whether this output parser itself calls an LLM. This is usually only done by output parsers that attempt to correct misformatted output.
-- **Input Type**: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific kwargs.
-- **Output Type**: The output type of the object returned by the parser.
-- **Description**: Our commentary on this output parser and when to use it.
-
-| Name | Supports Streaming | Has Format Instructions | Calls LLM | Input Type | Output Type | Description |
-|-----------------|--------------------|-------------------------------|-----------|----------------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [JSON](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.json.JsonOutputParser.html#langchain_core.output_parsers.json.JsonOutputParser) | ✅ | ✅ | | `str` \| `Message` | JSON object | Returns a JSON object as specified. You can specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling. |
-| [XML](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.xml.XMLOutputParser.html#langchain_core.output_parsers.xml.XMLOutputParser) | ✅ | ✅ | | `str` \| `Message` | `dict` | Returns a dictionary of tags. Use when XML output is needed. Use with models that are good at writing XML (like Anthropic's). |
-| [CSV](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.list.CommaSeparatedListOutputParser.html#langchain_core.output_parsers.list.CommaSeparatedListOutputParser) | ✅ | ✅ | | `str` \| `Message` | `List[str]` | Returns a list of comma separated values. |
-| [OutputFixing](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.fix.OutputFixingParser.html#langchain.output_parsers.fix.OutputFixingParser) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the error message and the bad output to an LLM and ask it to fix the output. |
-| [RetryWithError](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.retry.RetryWithErrorOutputParser.html#langchain.output_parsers.retry.RetryWithErrorOutputParser) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the original inputs, the bad output, and the error message to an LLM and ask it to fix it. Compared to OutputFixingParser, this one also sends the original instructions. |
-| [Pydantic](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.pydantic.PydanticOutputParser.html#langchain_core.output_parsers.pydantic.PydanticOutputParser) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. |
-| [YAML](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.yaml.YamlOutputParser.html#langchain.output_parsers.yaml.YamlOutputParser) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. Uses YAML to encode it. |
-| [PandasDataFrame](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.pandas_dataframe.PandasDataFrameOutputParser.html#langchain.output_parsers.pandas_dataframe.PandasDataFrameOutputParser) | | ✅ | | `str` \| `Message` | `dict` | Useful for doing operations with pandas DataFrames. |
-| [Enum](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.enum.EnumOutputParser.html#langchain.output_parsers.enum.EnumOutputParser) | | ✅ | | `str` \| `Message` | `Enum` | Parses response into one of the provided enum values. |
-| [Datetime](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.datetime.DatetimeOutputParser.html#langchain.output_parsers.datetime.DatetimeOutputParser) | | ✅ | | `str` \| `Message` | `datetime.datetime` | Parses response into a datetime string. |
-| [Structured](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.structured.StructuredOutputParser.html#langchain.output_parsers.structured.StructuredOutputParser) | | ✅ | | `str` \| `Message` | `Dict[str, str]` | An output parser that returns structured information. It is less powerful than other output parsers since it only allows for fields to be strings. This can be useful when you are working with smaller LLMs. |
-
-For specifics on how to use output parsers, see the [relevant how-to guides here](/docs/how_to/#output-parsers).
-
-### Chat history
-Most LLM applications have a conversational interface.
-An essential component of a conversation is being able to refer to information introduced earlier in the conversation.
-At bare minimum, a conversational system should be able to access some window of past messages directly.
-
-The concept of `ChatHistory` refers to a class in LangChain which can be used to wrap an arbitrary chain.
-This `ChatHistory` will keep track of inputs and outputs of the underlying chain, and append them as messages to a message database.
-Future interactions will then load those messages and pass them into the chain as part of the input.
-
-### Documents
-
-
-A Document object in LangChain contains information about some data. It has two attributes:
-
-- `page_content: str`: The content of this document. Currently is only a string.
-- `metadata: dict`: Arbitrary metadata associated with this document. Can track the document id, file name, etc.
-
-### Document loaders
-
-
-These classes load Document objects. LangChain has hundreds of integrations with various data sources to load data from: Slack, Notion, Google Drive, etc.
-
-Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the `.load` method.
-An example use case is as follows:
-
-```python
-from langchain_community.document_loaders.csv_loader import CSVLoader
-
-loader = CSVLoader(
- ... # <-- Integration specific parameters here
-)
-data = loader.load()
-```
-
-For specifics on how to use document loaders, see the [relevant how-to guides here](/docs/how_to/#document-loaders).
-
-### Text splitters
-
-Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window. LangChain has a number of built-in document transformers that make it easy to split, combine, filter, and otherwise manipulate documents.
-
-When you want to deal with long pieces of text, it is necessary to split up that text into chunks. As simple as this sounds, there is a lot of potential complexity here. Ideally, you want to keep the semantically related pieces of text together. What "semantically related" means could depend on the type of text. This notebook showcases several ways to do that.
-
-At a high level, text splitters work as following:
-
-1. Split the text up into small, semantically meaningful chunks (often sentences).
-2. Start combining these small chunks into a larger chunk until you reach a certain size (as measured by some function).
-3. Once you reach that size, make that chunk its own piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).
-
-That means there are two different axes along which you can customize your text splitter:
-
-1. How the text is split
-2. How the chunk size is measured
-
-For specifics on how to use text splitters, see the [relevant how-to guides here](/docs/how_to/#text-splitters).
-
-### Embedding models
-
-
-Embedding models create a vector representation of a piece of text. You can think of a vector as an array of numbers that captures the semantic meaning of the text.
-By representing the text in this way, you can perform mathematical operations that allow you to do things like search for other pieces of text that are most similar in meaning.
-These natural language search capabilities underpin many types of [context retrieval](/docs/concepts/#retrieval),
-where we provide an LLM with the relevant data it needs to effectively respond to a query.
-
-![](/img/embeddings.png)
-
-The `Embeddings` class is a class designed for interfacing with text embedding models. There are many different embedding model providers (OpenAI, Cohere, Hugging Face, etc) and local models, and this class is designed to provide a standard interface for all of them.
-
-The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).
-
-For specifics on how to use embedding models, see the [relevant how-to guides here](/docs/how_to/#embedding-models).
-
-### Vector stores
-
-
-One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors,
-and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query.
-A vector store takes care of storing embedded data and performing vector search for you.
-
-Most vector stores can also store metadata about embedded vectors and support filtering on that metadata before
-similarity search, allowing you more control over returned documents.
-
-Vector stores can be converted to the retriever interface by doing:
-
-```python
-vectorstore = MyVectorStore()
-retriever = vectorstore.as_retriever()
-```
-
-For specifics on how to use vector stores, see the [relevant how-to guides here](/docs/how_to/#vector-stores).
-
-### Retrievers
-
-
-A retriever is an interface that returns documents given an unstructured query.
-It is more general than a vector store.
-A retriever does not need to be able to store documents, only to return (or retrieve) them.
-Retrievers can be created from vector stores, but are also broad enough to include [Wikipedia search](/docs/integrations/retrievers/wikipedia/) and [Amazon Kendra](/docs/integrations/retrievers/amazon_kendra_retriever/).
-
-Retrievers accept a string query as input and return a list of Document's as output.
-
-For specifics on how to use retrievers, see the [relevant how-to guides here](/docs/how_to/#retrievers).
-
-### Key-value stores
-
-For some techniques, such as [indexing and retrieval with multiple vectors per document](/docs/how_to/multi_vector/) or
-[caching embeddings](/docs/how_to/caching_embeddings/), having a form of key-value (KV) storage is helpful.
-
-LangChain includes a [`BaseStore`](https://python.langchain.com/api_reference/core/stores/langchain_core.stores.BaseStore.html) interface,
-which allows for storage of arbitrary data. However, LangChain components that require KV-storage accept a
-more specific `BaseStore[str, bytes]` instance that stores binary data (referred to as a `ByteStore`), and internally take care of
-encoding and decoding data for their specific needs.
-
-This means that as a user, you only need to think about one type of store rather than different ones for different types of data.
-
-#### Interface
-
-All [`BaseStores`](https://python.langchain.com/api_reference/core/stores/langchain_core.stores.BaseStore.html) support the following interface. Note that the interface allows
-for modifying **multiple** key-value pairs at once:
-
-- `mget(key: Sequence[str]) -> List[Optional[bytes]]`: get the contents of multiple keys, returning `None` if the key does not exist
-- `mset(key_value_pairs: Sequence[Tuple[str, bytes]]) -> None`: set the contents of multiple keys
-- `mdelete(key: Sequence[str]) -> None`: delete multiple keys
-- `yield_keys(prefix: Optional[str] = None) -> Iterator[str]`: yield all keys in the store, optionally filtering by a prefix
-
-For key-value store implementations, see [this section](/docs/integrations/stores/).
-
-### Tools
-
-
-Tools are utilities designed to be called by a model: their inputs are designed to be generated by models, and their outputs are designed to be passed back to models.
-Tools are needed whenever you want a model to control parts of your code or call out to external APIs.
-
-A tool consists of:
-
-1. The `name` of the tool.
-2. A `description` of what the tool does.
-3. A `JSON schema` defining the inputs to the tool.
-4. A `function` (and, optionally, an async variant of the function).
-
-When a tool is bound to a model, the name, description and JSON schema are provided as context to the model.
-Given a list of tools and a set of instructions, a model can request to call one or more tools with specific inputs.
-Typical usage may look like the following:
-
-```python
-tools = [...] # Define a list of tools
-llm_with_tools = llm.bind_tools(tools)
-ai_msg = llm_with_tools.invoke("do xyz...")
-# -> AIMessage(tool_calls=[ToolCall(...), ...], ...)
-```
-
-The `AIMessage` returned from the model MAY have `tool_calls` associated with it.
-Read [this guide](/docs/concepts/#aimessage) for more information on what the response type may look like.
-
-Once the chosen tools are invoked, the results can be passed back to the model so that it can complete whatever task
-it's performing.
-There are generally two different ways to invoke the tool and pass back the response:
-
-#### Invoke with just the arguments
-
-When you invoke a tool with just the arguments, you will get back the raw tool output (usually a string).
-This generally looks like:
-
-```python
-# You will want to previously check that the LLM returned tool calls
-tool_call = ai_msg.tool_calls[0]
-# ToolCall(args={...}, id=..., ...)
-tool_output = tool.invoke(tool_call["args"])
-tool_message = ToolMessage(
- content=tool_output,
- tool_call_id=tool_call["id"],
- name=tool_call["name"]
-)
-```
-
-Note that the `content` field will generally be passed back to the model.
-If you do not want the raw tool response to be passed to the model, but you still want to keep it around,
-you can transform the tool output but also pass it as an artifact (read more about [`ToolMessage.artifact` here](/docs/concepts/#toolmessage))
-
-```python
-... # Same code as above
-response_for_llm = transform(response)
-tool_message = ToolMessage(
- content=response_for_llm,
- tool_call_id=tool_call["id"],
- name=tool_call["name"],
- artifact=tool_output
-)
-```
-
-#### Invoke with `ToolCall`
-
-The other way to invoke a tool is to call it with the full `ToolCall` that was generated by the model.
-When you do this, the tool will return a ToolMessage.
-The benefits of this are that you don't have to write the logic yourself to transform the tool output into a ToolMessage.
-This generally looks like:
-
-```python
-tool_call = ai_msg.tool_calls[0]
-# -> ToolCall(args={...}, id=..., ...)
-tool_message = tool.invoke(tool_call)
-# -> ToolMessage(
-# content="tool result foobar...",
-# tool_call_id=...,
-# name="tool_name"
-# )
-```
-
-If you are invoking the tool this way and want to include an [artifact](/docs/concepts/#toolmessage) for the ToolMessage, you will need to have the tool return two things.
-Read more about [defining tools that return artifacts here](/docs/how_to/tool_artifacts/).
-
-#### Best practices
-
-When designing tools to be used by a model, it is important to keep in mind that:
-
-- Chat models that have explicit [tool-calling APIs](/docs/concepts/#functiontool-calling) will be better at tool calling than non-fine-tuned models.
-- Models will perform better if the tools have well-chosen names, descriptions, and JSON schemas. This is another form of prompt engineering.
-- Simple, narrowly scoped tools are easier for models to use than complex tools.
-
-#### Related
-
-For specifics on how to use tools, see the [tools how-to guides](/docs/how_to/#tools).
-
-To use a pre-built tool, see the [tool integration docs](/docs/integrations/tools/).
-
-### Toolkits
-
-
-Toolkits are collections of tools that are designed to be used together for specific tasks. They have convenient loading methods.
-
-All Toolkits expose a `get_tools` method which returns a list of tools.
-You can therefore do:
-
-```python
-# Initialize a toolkit
-toolkit = ExampleTookit(...)
-
-# Get list of tools
-tools = toolkit.get_tools()
-```
-
-### Agents
-
-By themselves, language models can't take actions - they just output text.
-A big use case for LangChain is creating **agents**.
-Agents are systems that use an LLM as a reasoning engine to determine which actions to take and what the inputs to those actions should be.
-The results of those actions can then be fed back into the agent and it determine whether more actions are needed, or whether it is okay to finish.
-
-[LangGraph](https://github.com/langchain-ai/langgraph) is an extension of LangChain specifically aimed at creating highly controllable and customizable agents.
-Please check out that documentation for a more in depth overview of agent concepts.
-
-There is a legacy `agent` concept in LangChain that we are moving towards deprecating: `AgentExecutor`.
-AgentExecutor was essentially a runtime for agents.
-It was a great place to get started, however, it was not flexible enough as you started to have more customized agents.
-In order to solve that we built LangGraph to be this flexible, highly-controllable runtime.
-
-If you are still using AgentExecutor, do not fear: we still have a guide on [how to use AgentExecutor](/docs/how_to/agent_executor).
-It is recommended, however, that you start to transition to LangGraph.
-In order to assist in this, we have put together a [transition guide on how to do so](/docs/how_to/migrate_agent).
-
-#### ReAct agents
-
-
-One popular architecture for building agents is [**ReAct**](https://arxiv.org/abs/2210.03629).
-ReAct combines reasoning and acting in an iterative process - in fact the name "ReAct" stands for "Reason" and "Act".
-
-The general flow looks like this:
-
-- The model will "think" about what step to take in response to an input and any previous observations.
-- The model will then choose an action from available tools (or choose to respond to the user).
-- The model will generate arguments to that tool.
-- The agent runtime (executor) will parse out the chosen tool and call it with the generated arguments.
-- The executor will return the results of the tool call back to the model as an observation.
-- This process repeats until the agent chooses to respond.
-
-There are general prompting based implementations that do not require any model-specific features, but the most
-reliable implementations use features like [tool calling](/docs/how_to/tool_calling/) to reliably format outputs
-and reduce variance.
-
-Please see the [LangGraph documentation](https://langchain-ai.github.io/langgraph/) for more information,
-or [this how-to guide](/docs/how_to/migrate_agent/) for specific information on migrating to LangGraph.
-
-### Callbacks
-
-LangChain provides a callbacks system that allows you to hook into the various stages of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.
-
-You can subscribe to these events by using the `callbacks` argument available throughout the API. This argument is list of handler objects, which are expected to implement one or more of the methods described below in more detail.
-
-#### Callback Events
-
-| Event | Event Trigger | Associated Method |
-|------------------|---------------------------------------------|-----------------------|
-| Chat model start | When a chat model starts | `on_chat_model_start` |
-| LLM start | When a llm starts | `on_llm_start` |
-| LLM new token | When an llm OR chat model emits a new token | `on_llm_new_token` |
-| LLM ends | When an llm OR chat model ends | `on_llm_end` |
-| LLM errors | When an llm OR chat model errors | `on_llm_error` |
-| Chain start | When a chain starts running | `on_chain_start` |
-| Chain end | When a chain ends | `on_chain_end` |
-| Chain error | When a chain errors | `on_chain_error` |
-| Tool start | When a tool starts running | `on_tool_start` |
-| Tool end | When a tool ends | `on_tool_end` |
-| Tool error | When a tool errors | `on_tool_error` |
-| Agent action | When an agent takes an action | `on_agent_action` |
-| Agent finish | When an agent ends | `on_agent_finish` |
-| Retriever start | When a retriever starts | `on_retriever_start` |
-| Retriever end | When a retriever ends | `on_retriever_end` |
-| Retriever error | When a retriever errors | `on_retriever_error` |
-| Text | When arbitrary text is run | `on_text` |
-| Retry | When a retry event is run | `on_retry` |
-
-#### Callback handlers
-
-Callback handlers can either be `sync` or `async`:
-
-* Sync callback handlers implement the [BaseCallbackHandler](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.base.BaseCallbackHandler.html) interface.
-* Async callback handlers implement the [AsyncCallbackHandler](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.base.AsyncCallbackHandler.html) interface.
-
-During run-time LangChain configures an appropriate callback manager (e.g., [CallbackManager](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.manager.CallbackManager.html) or [AsyncCallbackManager](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.manager.AsyncCallbackManager.html)) which will be responsible for calling the appropriate method on each "registered" callback handler when the event is triggered.
-
-#### Passing callbacks
-
-The `callbacks` property is available on most objects throughout the API (Models, Tools, Agents, etc.) in two different places:
-
-- **Request time callbacks**: Passed at the time of the request in addition to the input data.
- Available on all standard `Runnable` objects. These callbacks are INHERITED by all children
- of the object they are defined on. For example, `chain.invoke({"number": 25}, {"callbacks": [handler]})`.
-- **Constructor callbacks**: `chain = TheNameOfSomeChain(callbacks=[handler])`. These callbacks
- are passed as arguments to the constructor of the object. The callbacks are scoped
- only to the object they are defined on, and are **not** inherited by any children of the object.
-
-:::warning
-Constructor callbacks are scoped only to the object they are defined on. They are **not** inherited by children
-of the object.
-:::
-
-If you're creating a custom chain or runnable, you need to remember to propagate request time
-callbacks to any child objects.
-
-:::important Async in Python<=3.10
-
-Any `RunnableLambda`, a `RunnableGenerator`, or `Tool` that invokes other runnables
-and is running `async` in python<=3.10, will have to propagate callbacks to child
-objects manually. This is because LangChain cannot automatically propagate
-callbacks to child objects in this case.
-
-This is a common reason why you may fail to see events being emitted from custom
-runnables or tools.
-:::
-
-For specifics on how to use callbacks, see the [relevant how-to guides here](/docs/how_to/#callbacks).
-
-## Techniques
-
-### Streaming
-
-
-Individual LLM calls often run for much longer than traditional resource requests.
-This compounds when you build more complex chains or agents that require multiple reasoning steps.
-
-Fortunately, LLMs generate output iteratively, which means it's possible to show sensible intermediate results
-before the final response is ready. Consuming output as soon as it becomes available has therefore become a vital part of the UX
-around building apps with LLMs to help alleviate latency issues, and LangChain aims to have first-class support for streaming.
-
-Below, we'll discuss some concepts and considerations around streaming in LangChain.
-
-#### `.stream()` and `.astream()`
-
-Most modules in LangChain include the `.stream()` method (and the equivalent `.astream()` method for [async](https://docs.python.org/3/library/asyncio.html) environments) as an ergonomic streaming interface.
-`.stream()` returns an iterator, which you can consume with a simple `for` loop. Here's an example with a chat model:
-
-```python
-from langchain_anthropic import ChatAnthropic
-
-model = ChatAnthropic(model="claude-3-sonnet-20240229")
-
-for chunk in model.stream("what color is the sky?"):
- print(chunk.content, end="|", flush=True)
-```
-
-For models (or other components) that don't support streaming natively, this iterator would just yield a single chunk, but
-you could still use the same general pattern when calling them. Using `.stream()` will also automatically call the model in streaming mode
-without the need to provide additional config.
-
-The type of each outputted chunk depends on the type of component - for example, chat models yield [`AIMessageChunks`](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.AIMessageChunk.html).
-Because this method is part of [LangChain Expression Language](/docs/concepts/#langchain-expression-language-lcel),
-you can handle formatting differences from different outputs using an [output parser](/docs/concepts/#output-parsers) to transform
-each yielded chunk.
-
-You can check out [this guide](/docs/how_to/streaming/#using-stream) for more detail on how to use `.stream()`.
-
-#### `.astream_events()`
-
-
-While the `.stream()` method is intuitive, it can only return the final generated value of your chain. This is fine for single LLM calls,
-but as you build more complex chains of several LLM calls together, you may want to use the intermediate values of
-the chain alongside the final output - for example, returning sources alongside the final generation when building a chat
-over documents app.
-
-There are ways to do this [using callbacks](/docs/concepts/#callbacks-1), or by constructing your chain in such a way that it passes intermediate
-values to the end with something like chained [`.assign()`](/docs/how_to/passthrough/) calls, but LangChain also includes an
-`.astream_events()` method that combines the flexibility of callbacks with the ergonomics of `.stream()`. When called, it returns an iterator
-which yields [various types of events](/docs/how_to/streaming/#event-reference) that you can filter and process according
-to the needs of your project.
-
-Here's one small example that prints just events containing streamed chat model output:
-
-```python
-from langchain_core.output_parsers import StrOutputParser
-from langchain_core.prompts import ChatPromptTemplate
-from langchain_anthropic import ChatAnthropic
-
-model = ChatAnthropic(model="claude-3-sonnet-20240229")
-
-prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")
-parser = StrOutputParser()
-chain = prompt | model | parser
-
-async for event in chain.astream_events({"topic": "parrot"}, version="v2"):
- kind = event["event"]
- if kind == "on_chat_model_stream":
- print(event, end="|", flush=True)
-```
-
-You can roughly think of it as an iterator over callback events (though the format differs) - and you can use it on almost all LangChain components!
-
-See [this guide](/docs/how_to/streaming/#using-stream-events) for more detailed information on how to use `.astream_events()`,
-including a table listing available events.
-
-#### Callbacks
-
-The lowest level way to stream outputs from LLMs in LangChain is via the [callbacks](/docs/concepts/#callbacks) system. You can pass a
-callback handler that handles the [`on_llm_new_token`](https://python.langchain.com/api_reference/langchain/callbacks/langchain.callbacks.streaming_aiter.AsyncIteratorCallbackHandler.html#langchain.callbacks.streaming_aiter.AsyncIteratorCallbackHandler.on_llm_new_token) event into LangChain components. When that component is invoked, any
-[LLM](/docs/concepts/#llms) or [chat model](/docs/concepts/#chat-models) contained in the component calls
-the callback with the generated token. Within the callback, you could pipe the tokens into some other destination, e.g. a HTTP response.
-You can also handle the [`on_llm_end`](https://python.langchain.com/api_reference/langchain/callbacks/langchain.callbacks.streaming_aiter.AsyncIteratorCallbackHandler.html#langchain.callbacks.streaming_aiter.AsyncIteratorCallbackHandler.on_llm_end) event to perform any necessary cleanup.
-
-You can see [this how-to section](/docs/how_to/#callbacks) for more specifics on using callbacks.
-
-Callbacks were the first technique for streaming introduced in LangChain. While powerful and generalizable,
-they can be unwieldy for developers. For example:
-
-- You need to explicitly initialize and manage some aggregator or other stream to collect results.
-- The execution order isn't explicitly guaranteed, and you could theoretically have a callback run after the `.invoke()` method finishes.
-- Providers would often make you pass an additional parameter to stream outputs instead of returning them all at once.
-- You would often ignore the result of the actual model call in favor of callback results.
-
-#### Tokens
-
-The unit that most model providers use to measure input and output is via a unit called a **token**.
-Tokens are the basic units that language models read and generate when processing or producing text.
-The exact definition of a token can vary depending on the specific way the model was trained -
-for instance, in English, a token could be a single word like "apple", or a part of a word like "app".
-
-When you send a model a prompt, the words and characters in the prompt are encoded into tokens using a **tokenizer**.
-The model then streams back generated output tokens, which the tokenizer decodes into human-readable text.
-The below example shows how OpenAI models tokenize `LangChain is cool!`:
-
-![](/img/tokenization.png)
-
-You can see that it gets split into 5 different tokens, and that the boundaries between tokens are not exactly the same as word boundaries.
-
-The reason language models use tokens rather than something more immediately intuitive like "characters"
-has to do with how they process and understand text. At a high-level, language models iteratively predict their next generated output based on
-the initial input and their previous generations. Training the model using tokens language models to handle linguistic
-units (like words or subwords) that carry meaning, rather than individual characters, which makes it easier for the model
-to learn and understand the structure of the language, including grammar and context.
-Furthermore, using tokens can also improve efficiency, since the model processes fewer units of text compared to character-level processing.
-
-### Function/tool calling
-
-:::info
-We use the term `tool calling` interchangeably with `function calling`. Although
-function calling is sometimes meant to refer to invocations of a single function,
-we treat all models as though they can return multiple tool or function calls in
-each message.
-:::
-
-Tool calling allows a [chat model](/docs/concepts/#chat-models) to respond to a given prompt by generating output that
-matches a user-defined schema.
-
-While the name implies that the model is performing
-some action, this is actually not the case! The model only generates the arguments to a tool, and actually running the tool (or not) is up to the user.
-One common example where you **wouldn't** want to call a function with the generated arguments
-is if you want to [extract structured output matching some schema](/docs/concepts/#structured-output)
-from unstructured text. You would give the model an "extraction" tool that takes
-parameters matching the desired schema, then treat the generated output as your final
-result.
-
-![Diagram of a tool call by a chat model](/img/tool_call.png)
-
-Tool calling is not universal, but is supported by many popular LLM providers, including [Anthropic](/docs/integrations/chat/anthropic/),
-[Cohere](/docs/integrations/chat/cohere/), [Google](/docs/integrations/chat/google_vertex_ai_palm/),
-[Mistral](/docs/integrations/chat/mistralai/), [OpenAI](/docs/integrations/chat/openai/), and even for locally-running models via [Ollama](/docs/integrations/chat/ollama/).
-
-LangChain provides a standardized interface for tool calling that is consistent across different models.
-
-The standard interface consists of:
-
-* `ChatModel.bind_tools()`: a method for specifying which tools are available for a model to call. This method accepts [LangChain tools](/docs/concepts/#tools) as well as [Pydantic](https://pydantic.dev/) objects.
-* `AIMessage.tool_calls`: an attribute on the `AIMessage` returned from the model for accessing the tool calls requested by the model.
-
-#### Tool usage
-
-After the model calls tools, you can use the tool by invoking it, then passing the arguments back to the model.
-LangChain provides the [`Tool`](/docs/concepts/#tools) abstraction to help you handle this.
-
-The general flow is this:
-
-1. Generate tool calls with a chat model in response to a query.
-2. Invoke the appropriate tools using the generated tool call as arguments.
-3. Format the result of the tool invocations as [`ToolMessages`](/docs/concepts/#toolmessage).
-4. Pass the entire list of messages back to the model so that it can generate a final answer (or call more tools).
-
-![Diagram of a complete tool calling flow](/img/tool_calling_flow.png)
-
-This is how tool calling [agents](/docs/concepts/#agents) perform tasks and answer queries.
-
-Check out some more focused guides below:
-
-- [How to use chat models to call tools](/docs/how_to/tool_calling/)
-- [How to pass tool outputs to chat models](/docs/how_to/tool_results_pass_to_model/)
-- [Building an agent with LangGraph](https://langchain-ai.github.io/langgraph/tutorials/introduction/)
-
-### Structured output
-
-LLMs are capable of generating arbitrary text. This enables the model to respond appropriately to a wide
-range of inputs, but for some use-cases, it can be useful to constrain the LLM's output
-to a specific format or structure. This is referred to as **structured output**.
-
-For example, if the output is to be stored in a relational database,
-it is much easier if the model generates output that adheres to a defined schema or format.
-[Extracting specific information](/docs/tutorials/extraction/) from unstructured text is another
-case where this is particularly useful. Most commonly, the output format will be JSON,
-though other formats such as [YAML](/docs/how_to/output_parser_yaml/) can be useful too. Below, we'll discuss
-a few ways to get structured output from models in LangChain.
-
-#### `.with_structured_output()`
-
-For convenience, some LangChain chat models support a [`.with_structured_output()`](/docs/how_to/structured_output/#the-with_structured_output-method)
-method. This method only requires a schema as input, and returns a dict or Pydantic object.
-Generally, this method is only present on models that support one of the more advanced methods described below,
-and will use one of them under the hood. It takes care of importing a suitable output parser and
-formatting the schema in the right format for the model.
-
-Here's an example:
-
-```python
-from typing import Optional
-
-from pydantic import BaseModel, Field
-
-
-class Joke(BaseModel):
- """Joke to tell user."""
-
- setup: str = Field(description="The setup of the joke")
- punchline: str = Field(description="The punchline to the joke")
- rating: Optional[int] = Field(description="How funny the joke is, from 1 to 10")
-
-structured_llm = llm.with_structured_output(Joke)
-
-structured_llm.invoke("Tell me a joke about cats")
-```
-
-```
-Joke(setup='Why was the cat sitting on the computer?', punchline='To keep an eye on the mouse!', rating=None)
-```
-
-We recommend this method as a starting point when working with structured output:
-
-- It uses other model-specific features under the hood, without the need to import an output parser.
-- For the models that use tool calling, no special prompting is needed.
-- If multiple underlying techniques are supported, you can supply a `method` parameter to
-[toggle which one is used](/docs/how_to/structured_output/#advanced-specifying-the-method-for-structuring-outputs).
-
-You may want or need to use other techniques if:
-
-- The chat model you are using does not support tool calling.
-- You are working with very complex schemas and the model is having trouble generating outputs that conform.
-
-For more information, check out this [how-to guide](/docs/how_to/structured_output/#the-with_structured_output-method).
-
-You can also check out [this table](/docs/integrations/chat/#advanced-features) for a list of models that support
-`with_structured_output()`.
-
-#### Raw prompting
-
-The most intuitive way to get a model to structure output is to ask nicely.
-In addition to your query, you can give instructions describing what kind of output you'd like, then
-parse the output using an [output parser](/docs/concepts/#output-parsers) to convert the raw
-model message or string output into something more easily manipulated.
-
-The biggest benefit to raw prompting is its flexibility:
-
-- Raw prompting does not require any special model features, only sufficient reasoning capability to understand
-the passed schema.
-- You can prompt for any format you'd like, not just JSON. This can be useful if the model you
-are using is more heavily trained on a certain type of data, such as XML or YAML.
-
-However, there are some drawbacks too:
-
-- LLMs are non-deterministic, and prompting a LLM to consistently output data in the exactly correct format
-for smooth parsing can be surprisingly difficult and model-specific.
-- Individual models have quirks depending on the data they were trained on, and optimizing prompts can be quite difficult.
-Some may be better at interpreting [JSON schema](https://json-schema.org/), others may be best with TypeScript definitions,
-and still others may prefer XML.
-
-While features offered by model providers may increase reliability, prompting techniques remain important for tuning your
-results no matter which method you choose.
-
-#### JSON mode
-
-
-Some models, such as [Mistral](/docs/integrations/chat/mistralai/), [OpenAI](/docs/integrations/chat/openai/),
-[Together AI](/docs/integrations/chat/together/) and [Ollama](/docs/integrations/chat/ollama/),
-support a feature called **JSON mode**, usually enabled via config.
-
-When enabled, JSON mode will constrain the model's output to always be some sort of valid JSON.
-Often they require some custom prompting, but it's usually much less burdensome than completely raw prompting and
-more along the lines of, `"you must always return JSON"`. The [output also generally easier to parse](/docs/how_to/output_parser_json/).
-
-It's also generally simpler to use directly and more commonly available than tool calling, and can give
-more flexibility around prompting and shaping results than tool calling.
-
-Here's an example:
-
-```python
-from langchain_core.prompts import ChatPromptTemplate
-from langchain_openai import ChatOpenAI
-from langchain.output_parsers.json import SimpleJsonOutputParser
-
-model = ChatOpenAI(
- model="gpt-4o",
- model_kwargs={ "response_format": { "type": "json_object" } },
-)
-
-prompt = ChatPromptTemplate.from_template(
- "Answer the user's question to the best of your ability."
- 'You must always output a JSON object with an "answer" key and a "followup_question" key.'
- "{question}"
-)
-
-chain = prompt | model | SimpleJsonOutputParser()
-
-chain.invoke({ "question": "What is the powerhouse of the cell?" })
-```
-
-```
-{'answer': 'The powerhouse of the cell is the mitochondrion. It is responsible for producing energy in the form of ATP through cellular respiration.',
- 'followup_question': 'Would you like to know more about how mitochondria produce energy?'}
-```
-
-For a full list of model providers that support JSON mode, see [this table](/docs/integrations/chat/#advanced-features).
-
-#### Tool calling {#structured-output-tool-calling}
-
-For models that support it, [tool calling](/docs/concepts/#functiontool-calling) can be very convenient for structured output. It removes the
-guesswork around how best to prompt schemas in favor of a built-in model feature.
-
-It works by first binding the desired schema either directly or via a [LangChain tool](/docs/concepts/#tools) to a
-[chat model](/docs/concepts/#chat-models) using the `.bind_tools()` method. The model will then generate an `AIMessage` containing
-a `tool_calls` field containing `args` that match the desired shape.
-
-There are several acceptable formats you can use to bind tools to a model in LangChain. Here's one example:
-
-```python
-from pydantic import BaseModel, Field
-from langchain_openai import ChatOpenAI
-
-class ResponseFormatter(BaseModel):
- """Always use this tool to structure your response to the user."""
-
- answer: str = Field(description="The answer to the user's question")
- followup_question: str = Field(description="A followup question the user could ask")
-
-model = ChatOpenAI(
- model="gpt-4o",
- temperature=0,
-)
-
-model_with_tools = model.bind_tools([ResponseFormatter])
-
-ai_msg = model_with_tools.invoke("What is the powerhouse of the cell?")
-
-ai_msg.tool_calls[0]["args"]
-```
-
-```
-{'answer': "The powerhouse of the cell is the mitochondrion. It generates most of the cell's supply of adenosine triphosphate (ATP), which is used as a source of chemical energy.",
- 'followup_question': 'How do mitochondria generate ATP?'}
-```
-
-Tool calling is a generally consistent way to get a model to generate structured output, and is the default technique
-used for the [`.with_structured_output()`](/docs/concepts/#with_structured_output) method when a model supports it.
-
-The following how-to guides are good practical resources for using function/tool calling for structured output:
-
-- [How to return structured data from an LLM](/docs/how_to/structured_output/)
-- [How to use a model to call tools](/docs/how_to/tool_calling)
-
-For a full list of model providers that support tool calling, [see this table](/docs/integrations/chat/#advanced-features).
-
-### Few-shot prompting
-
-One of the most effective ways to improve model performance is to give a model examples of
-what you want it to do. The technique of adding example inputs and expected outputs
-to a model prompt is known as "few-shot prompting". The technique is based on the
-[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) paper.
-There are a few things to think about when doing few-shot prompting:
-
-1. How are examples generated?
-2. How many examples are in each prompt?
-3. How are examples selected at runtime?
-4. How are examples formatted in the prompt?
-
-Here are the considerations for each.
-
-#### 1. Generating examples
-
-The first and most important step of few-shot prompting is coming up with a good dataset of examples. Good examples should be relevant at runtime, clear, informative, and provide information that was not already known to the model.
-
-At a high-level, the basic ways to generate examples are:
-- Manual: a person/people generates examples they think are useful.
-- Better model: a better (presumably more expensive/slower) model's responses are used as examples for a worse (presumably cheaper/faster) model.
-- User feedback: users (or labelers) leave feedback on interactions with the application and examples are generated based on that feedback (for example, all interactions with positive feedback could be turned into examples).
-- LLM feedback: same as user feedback but the process is automated by having models evaluate themselves.
-
-Which approach is best depends on your task. For tasks where a small number core principles need to be understood really well, it can be valuable hand-craft a few really good examples.
-For tasks where the space of correct behaviors is broader and more nuanced, it can be useful to generate many examples in a more automated fashion so that there's a higher likelihood of there being some highly relevant examples for any runtime input.
-
-**Single-turn v.s. multi-turn examples**
-
-Another dimension to think about when generating examples is what the example is actually showing.
-
-The simplest types of examples just have a user input and an expected model output. These are single-turn examples.
-
-One more complex type if example is where the example is an entire conversation, usually in which a model initially responds incorrectly and a user then tells the model how to correct its answer.
-This is called a multi-turn example. Multi-turn examples can be useful for more nuanced tasks where its useful to show common errors and spell out exactly why they're wrong and what should be done instead.
-
-#### 2. Number of examples
-
-Once we have a dataset of examples, we need to think about how many examples should be in each prompt.
-The key tradeoff is that more examples generally improve performance, but larger prompts increase costs and latency.
-And beyond some threshold having too many examples can start to confuse the model.
-Finding the right number of examples is highly dependent on the model, the task, the quality of the examples, and your cost and latency constraints.
-Anecdotally, the better the model is the fewer examples it needs to perform well and the more quickly you hit steeply diminishing returns on adding more examples.
-But, the best/only way to reliably answer this question is to run some experiments with different numbers of examples.
-
-#### 3. Selecting examples
-
-Assuming we are not adding our entire example dataset into each prompt, we need to have a way of selecting examples from our dataset based on a given input. We can do this:
-- Randomly
-- By (semantic or keyword-based) similarity of the inputs
-- Based on some other constraints, like token size
-
-LangChain has a number of [`ExampleSelectors`](/docs/concepts/#example-selectors) which make it easy to use any of these techniques.
-
-Generally, selecting by semantic similarity leads to the best model performance. But how important this is is again model and task specific, and is something worth experimenting with.
-
-#### 4. Formatting examples
-
-Most state-of-the-art models these days are chat models, so we'll focus on formatting examples for those. Our basic options are to insert the examples:
-- In the system prompt as a string
-- As their own messages
-
-If we insert our examples into the system prompt as a string, we'll need to make sure it's clear to the model where each example begins and which parts are the input versus output. Different models respond better to different syntaxes, like [ChatML](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chat-markup-language), XML, TypeScript, etc.
-
-If we insert our examples as messages, where each example is represented as a sequence of Human, AI messages, we might want to also assign [names](/docs/concepts/#messages) to our messages like `"example_user"` and `"example_assistant"` to make it clear that these messages correspond to different actors than the latest input message.
-
-**Formatting tool call examples**
-
-One area where formatting examples as messages can be tricky is when our example outputs have tool calls. This is because different models have different constraints on what types of message sequences are allowed when any tool calls are generated.
-- Some models require that any AIMessage with tool calls be immediately followed by ToolMessages for every tool call,
-- Some models additionally require that any ToolMessages be immediately followed by an AIMessage before the next HumanMessage,
-- Some models require that tools are passed in to the model if there are any tool calls / ToolMessages in the chat history.
-
-These requirements are model-specific and should be checked for the model you are using. If your model requires ToolMessages after tool calls and/or AIMessages after ToolMessages and your examples only include expected tool calls and not the actual tool outputs, you can try adding dummy ToolMessages / AIMessages to the end of each example with generic contents to satisfy the API constraints.
-In these cases it's especially worth experimenting with inserting your examples as strings versus messages, as having dummy messages can adversely affect certain models.
-
-You can see a case study of how Anthropic and OpenAI respond to different few-shot prompting techniques on two different tool calling benchmarks [here](https://blog.langchain.dev/few-shot-prompting-to-improve-tool-calling-performance/).
-
-### Retrieval
-
-LLMs are trained on a large but fixed dataset, limiting their ability to reason over private or recent information.
-Fine-tuning an LLM with specific facts is one way to mitigate this, but is often [poorly suited for factual recall](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts) and [can be costly](https://www.glean.com/blog/how-to-build-an-ai-assistant-for-the-enterprise).
-`Retrieval` is the process of providing relevant information to an LLM to improve its response for a given input.
-`Retrieval augmented generation` (`RAG`) [paper](https://arxiv.org/abs/2005.11401) is the process of grounding the LLM generation (output) using the retrieved information.
-
-:::tip
-
-* See our RAG from Scratch [code](https://github.com/langchain-ai/rag-from-scratch) and [video series](https://youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x&feature=shared).
-* For a high-level guide on retrieval, see this [tutorial on RAG](/docs/tutorials/rag/).
-
-:::
-
-RAG is only as good as the retrieved documents’ relevance and quality. Fortunately, an emerging set of techniques can be employed to design and improve RAG systems. We've focused on taxonomizing and summarizing many of these techniques (see below figure) and will share some high-level strategic guidance in the following sections.
-You can and should experiment with using different pieces together. You might also find [this LangSmith guide](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application) useful for showing how to evaluate different iterations of your app.
-
-![](/img/rag_landscape.png)
-
-#### Query Translation
-
-First, consider the user input(s) to your RAG system. Ideally, a RAG system can handle a wide range of inputs, from poorly worded questions to complex multi-part queries.
-**Using an LLM to review and optionally modify the input is the central idea behind query translation.** This serves as a general buffer, optimizing raw user inputs for your retrieval system.
-For example, this can be as simple as extracting keywords or as complex as generating multiple sub-questions for a complex query.
-
-| Name | When to use | Description |
-|---------------|-------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [Multi-query](/docs/how_to/MultiQueryRetriever/) | When you need to cover multiple perspectives of a question. | Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, return the unique documents for all queries. |
-| [Decomposition](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a question can be broken down into smaller subproblems. | Decompose a question into a set of subproblems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer). |
-| [Step-back](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a higher-level conceptual understanding is required. | First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. [Paper](https://arxiv.org/pdf/2310.06117). |
-| [HyDE](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | If you have challenges retrieving relevant documents using the raw user inputs. | Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. [Paper](https://arxiv.org/abs/2212.10496). |
-
-:::tip
-
-See our RAG from Scratch videos for a few different specific approaches:
-- [Multi-query](https://youtu.be/JChPi0CRnDY?feature=shared)
-- [Decomposition](https://youtu.be/h0OPWlEOank?feature=shared)
-- [Step-back](https://youtu.be/xn1jEjRyJ2U?feature=shared)
-- [HyDE](https://youtu.be/SaDzIVkYqyY?feature=shared)
-
-:::
-
-#### Routing
-
-Second, consider the data sources available to your RAG system. You want to query across more than one database or across structured and unstructured data sources. **Using an LLM to review the input and route it to the appropriate data source is a simple and effective approach for querying across sources.**
-
-| Name | When to use | Description |
-|------------------|--------------------------------------------|-------------|
-| [Logical routing](/docs/how_to/routing/) | When you can prompt an LLM with rules to decide where to route the input. | Logical routing can use an LLM to reason about the query and choose which datastore is most appropriate. |
-| [Semantic routing](/docs/how_to/routing/#routing-by-semantic-similarity) | When semantic similarity is an effective way to determine where to route the input. | Semantic routing embeds both query and, typically a set of prompts. It then chooses the appropriate prompt based upon similarity. |
-
-:::tip
-
-See our RAG from Scratch video on [routing](https://youtu.be/pfpIndq7Fi8?feature=shared).
-
-:::
-
-#### Query Construction
-
-Third, consider whether any of your data sources require specific query formats. Many structured databases use SQL. Vector stores often have specific syntax for applying keyword filters to document metadata. **Using an LLM to convert a natural language query into a query syntax is a popular and powerful approach.**
-In particular, [text-to-SQL](/docs/tutorials/sql_qa/), [text-to-Cypher](/docs/tutorials/graph/), and [query analysis for metadata filters](/docs/tutorials/query_analysis/#query-analysis) are useful ways to interact with structured, graph, and vector databases respectively.
-
-| Name | When to Use | Description |
-|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [Text to SQL](/docs/tutorials/sql_qa/) | If users are asking questions that require information housed in a relational database, accessible via SQL. | This uses an LLM to transform user input into a SQL query. |
-| [Text-to-Cypher](/docs/tutorials/graph/) | If users are asking questions that require information housed in a graph database, accessible via Cypher. | This uses an LLM to transform user input into a Cypher query. |
-| [Self Query](/docs/how_to/self_query/) | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
-
-:::tip
-
-See our [blog post overview](https://blog.langchain.dev/query-construction/) and RAG from Scratch video on [query construction](https://youtu.be/kl6NwWYxvbM?feature=shared), the process of text-to-DSL where DSL is a domain specific language required to interact with a given database. This converts user questions into structured queries.
-
-:::
-
-#### Indexing
-
-Fourth, consider the design of your document index. A simple and powerful idea is to **decouple the documents that you index for retrieval from the documents that you pass to the LLM for generation.** Indexing frequently uses embedding models with vector stores, which [compress the semantic information in documents to fixed-size vectors](/docs/concepts/#embedding-models).
-
-Many RAG approaches focus on splitting documents into chunks and retrieving some number based on similarity to an input question for the LLM. But chunk size and chunk number can be difficult to set and affect results if they do not provide full context for the LLM to answer a question. Furthermore, LLMs are increasingly capable of processing millions of tokens.
-
-Two approaches can address this tension: (1) [Multi Vector](/docs/how_to/multi_vector/) retriever using an LLM to translate documents into any form (e.g., often into a summary) that is well-suited for indexing, but returns full documents to the LLM for generation. (2) [ParentDocument](/docs/how_to/parent_document_retriever/) retriever embeds document chunks, but also returns full documents. The idea is to get the best of both worlds: use concise representations (summaries or chunks) for retrieval, but use the full documents for answer generation.
-
-| Name | Index Type | Uses an LLM | When to Use | Description |
-|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [Vector store](/docs/how_to/vectorstore_retriever/) | Vector store | No | If you are just getting started and looking for something quick and easy. | This is the simplest method and the one that is easiest to get started with. It involves creating embeddings for each piece of text. |
-| [ParentDocument](/docs/how_to/parent_document_retriever/) | Vector store + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
-| [Multi Vector](/docs/how_to/multi_vector/) | Vector store + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
-| [Time-Weighted Vector store](/docs/how_to/time_weighted_vectorstore/) | Vector store | No | If you have timestamps associated with your documents, and you want to retrieve the most recent ones | This fetches documents based on a combination of semantic similarity (as in normal vector retrieval) and recency (looking at timestamps of indexed documents) |
-
-:::tip
-
-- See our RAG from Scratch video on [indexing fundamentals](https://youtu.be/bjb_EMsTDKI?feature=shared)
-- See our RAG from Scratch video on [multi vector retriever](https://youtu.be/gTCU9I6QqCE?feature=shared)
-
-:::
-
-Fifth, consider ways to improve the quality of your similarity search itself. Embedding models compress text into fixed-length (vector) representations that capture the semantic content of the document. This compression is useful for search / retrieval, but puts a heavy burden on that single vector representation to capture the semantic nuance / detail of the document. In some cases, irrelevant or redundant content can dilute the semantic usefulness of the embedding.
-
-[ColBERT](https://docs.google.com/presentation/d/1IRhAdGjIevrrotdplHNcc4aXgIYyKamUKTWtB3m3aMU/edit?usp=sharing) is an interesting approach to address this with a higher granularity embeddings: (1) produce a contextually influenced embedding for each token in the document and query, (2) score similarity between each query token and all document tokens, (3) take the max, (4) do this for all query tokens, and (5) take the sum of the max scores (in step 3) for all query tokens to get a query-document similarity score; this token-wise scoring can yield strong results.
-
-![](/img/colbert.png)
-
-There are some additional tricks to improve the quality of your retrieval. Embeddings excel at capturing semantic information, but may struggle with keyword-based queries. Many [vector stores](/docs/integrations/retrievers/pinecone_hybrid_search/) offer built-in [hybrid-search](https://docs.pinecone.io/guides/data/understanding-hybrid-search) to combine keyword and semantic similarity, which marries the benefits of both approaches. Furthermore, many vector stores have [maximal marginal relevance](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/mmr/), which attempts to diversify the results of a search to avoid returning similar and redundant documents.
-
-| Name | When to use | Description |
-|-------------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [ColBERT](/docs/integrations/providers/ragatouille/#using-colbert-as-a-reranker) | When higher granularity embeddings are needed. | ColBERT uses contextually influenced embeddings for each token in the document and query to get a granular query-document similarity score. [Paper](https://arxiv.org/abs/2112.01488). |
-| [Hybrid search](/docs/integrations/retrievers/pinecone_hybrid_search/) | When combining keyword-based and semantic similarity. | Hybrid search combines keyword and semantic similarity, marrying the benefits of both approaches. [Paper](https://arxiv.org/abs/2210.11934). |
-| [Maximal Marginal Relevance (MMR)](/docs/integrations/vectorstores/pinecone/#maximal-marginal-relevance-searches) | When needing to diversify search results. | MMR attempts to diversify the results of a search to avoid returning similar and redundant documents. |
-
-:::tip
-
-See our RAG from Scratch video on [ColBERT](https://youtu.be/cN6S0Ehm7_8?feature=shared>).
-
-:::
-
-#### Post-processing
-
-Sixth, consider ways to filter or rank retrieved documents. This is very useful if you are [combining documents returned from multiple sources](/docs/integrations/retrievers/cohere-reranker/#doing-reranking-with-coherererank), since it can can down-rank less relevant documents and / or [compress similar documents](/docs/how_to/contextual_compression/#more-built-in-compressors-filters).
-
-| Name | Index Type | Uses an LLM | When to Use | Description |
-|---------------------------|------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [Contextual Compression](/docs/how_to/contextual_compression/) | Any | Sometimes | If you are finding that your retrieved documents contain too much irrelevant information and are distracting the LLM. | This puts a post-processing step on top of another retriever and extracts only the most relevant information from retrieved documents. This can be done with embeddings or an LLM. |
-| [Ensemble](/docs/how_to/ensemble_retriever/) | Any | No | If you have multiple retrieval methods and want to try combining them. | This fetches documents from multiple retrievers and then combines them. |
-| [Re-ranking](/docs/integrations/retrievers/cohere-reranker/) | Any | Yes | If you want to rank retrieved documents based upon relevance, especially if you want to combine results from multiple retrieval methods . | Given a query and a list of documents, Rerank indexes the documents from most to least semantically relevant to the query. |
-
-:::tip
-
-See our RAG from Scratch video on [RAG-Fusion](https://youtu.be/77qELPbNgxA?feature=shared) ([paper](https://arxiv.org/abs/2402.03367)), on approach for post-processing across multiple queries: Rewrite the user question from multiple perspectives, retrieve documents for each rewritten question, and combine the ranks of multiple search result lists to produce a single, unified ranking with [Reciprocal Rank Fusion (RRF)](https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1).
-
-:::
-
-#### Generation
-
-**Finally, consider ways to build self-correction into your RAG system.** RAG systems can suffer from low quality retrieval (e.g., if a user question is out of the domain for the index) and / or hallucinations in generation. A naive retrieve-generate pipeline has no ability to detect or self-correct from these kinds of errors. The concept of ["flow engineering"](https://x.com/karpathy/status/1748043513156272416) has been introduced [in the context of code generation](https://arxiv.org/abs/2401.08500): iteratively build an answer to a code question with unit tests to check and self-correct errors. Several works have applied this RAG, such as Self-RAG and Corrective-RAG. In both cases, checks for document relevance, hallucinations, and / or answer quality are performed in the RAG answer generation flow.
-
-We've found that graphs are a great way to reliably express logical flows and have implemented ideas from several of these papers [using LangGraph](https://github.com/langchain-ai/langgraph/tree/main/examples/rag), as shown in the figure below (red - routing, blue - fallback, green - self-correction):
-- **Routing:** Adaptive RAG ([paper](https://arxiv.org/abs/2403.14403)). Route questions to different retrieval approaches, as discussed above
-- **Fallback:** Corrective RAG ([paper](https://arxiv.org/pdf/2401.15884.pdf)). Fallback to web search if docs are not relevant to query
-- **Self-correction:** Self-RAG ([paper](https://arxiv.org/abs/2310.11511)). Fix answers w/ hallucinations or don’t address question
-
-![](/img/langgraph_rag.png)
-
-| Name | When to use | Description |
-|-------------------|-----------------------------------------------------------|-------------|
-| Self-RAG | When needing to fix answers with hallucinations or irrelevant content. | Self-RAG performs checks for document relevance, hallucinations, and answer quality during the RAG answer generation flow, iteratively building an answer and self-correcting errors. |
-| Corrective-RAG | When needing a fallback mechanism for low relevance docs. | Corrective-RAG includes a fallback (e.g., to web search) if the retrieved documents are not relevant to the query, ensuring higher quality and more relevant retrieval. |
-
-:::tip
-
-See several videos and cookbooks showcasing RAG with LangGraph:
-- [LangGraph Corrective RAG](https://www.youtube.com/watch?v=E2shqsYwxck)
-- [LangGraph combining Adaptive, Self-RAG, and Corrective RAG](https://www.youtube.com/watch?v=-ROS6gfYIts)
-- [Cookbooks for RAG using LangGraph](https://github.com/langchain-ai/langgraph/tree/main/examples/rag)
-
-See our LangGraph RAG recipes with partners:
-- [Meta](https://github.com/meta-llama/llama-recipes/tree/main/recipes/3p_integrations/langchain)
-- [Mistral](https://github.com/mistralai/cookbook/tree/main/third_party/langchain)
-
-:::
-
-### Text splitting
-
-LangChain offers many different types of `text splitters`.
-These all live in the `langchain-text-splitters` package.
-
-Table columns:
-
-- **Name**: Name of the text splitter
-- **Classes**: Classes that implement this text splitter
-- **Splits On**: How this text splitter splits text
-- **Adds Metadata**: Whether or not this text splitter adds metadata about where each chunk came from.
-- **Description**: Description of the splitter, including recommendation on when to use it.
-
-
-| Name | Classes | Splits On | Adds Metadata | Description |
-|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Recursive | [RecursiveCharacterTextSplitter](/docs/how_to/recursive_text_splitter/), [RecursiveJsonSplitter](/docs/how_to/recursive_json_splitter/) | A list of user defined characters | | Recursively splits text. This splitting is trying to keep related pieces of text next to each other. This is the `recommended way` to start splitting text. |
-| HTML | [HTMLHeaderTextSplitter](/docs/how_to/HTML_header_metadata_splitter/), [HTMLSectionSplitter](/docs/how_to/HTML_section_aware_splitter/) | HTML specific characters | ✅ | Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML) |
-| Markdown | [MarkdownHeaderTextSplitter](/docs/how_to/markdown_header_metadata_splitter/), | Markdown specific characters | ✅ | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
-| Code | [many languages](/docs/how_to/code_splitter/) | Code (Python, JS) specific characters | | Splits text based on characters specific to coding languages. 15 different languages are available to choose from. |
-| Token | [many classes](/docs/how_to/split_by_token/) | Tokens | | Splits text on tokens. There exist a few different ways to measure tokens. |
-| Character | [CharacterTextSplitter](/docs/how_to/character_text_splitter/) | A user defined character | | Splits text based on a user defined character. One of the simpler methods. |
-| Semantic Chunker (Experimental) | [SemanticChunker](/docs/how_to/semantic-chunker/) | Sentences | | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from [Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) |
-| Integration: AI21 Semantic | [AI21SemanticTextSplitter](/docs/integrations/document_transformers/ai21_semantic_text_splitter/) | | ✅ | Identifies distinct topics that form coherent pieces of text and splits along those. |
-
-### Evaluation
-
-
-Evaluation is the process of assessing the performance and effectiveness of your LLM-powered applications.
-It involves testing the model's responses against a set of predefined criteria or benchmarks to ensure it meets the desired quality standards and fulfills the intended purpose.
-This process is vital for building reliable applications.
-
-![](/img/langsmith_evaluate.png)
-
-[LangSmith](https://docs.smith.langchain.com/) helps with this process in a few ways:
-
-- It makes it easier to create and curate datasets via its tracing and annotation features
-- It provides an evaluation framework that helps you define metrics and run your app against your dataset
-- It allows you to track results over time and automatically run your evaluators on a schedule or as part of CI/Code
-
-To learn more, check out [this LangSmith guide](https://docs.smith.langchain.com/concepts/evaluation).
-
-### Tracing
-
-
-A trace is essentially a series of steps that your application takes to go from input to output.
-Traces contain individual steps called `runs`. These can be individual calls from a model, retriever,
-tool, or sub-chains.
-Tracing gives you observability inside your chains and agents, and is vital in diagnosing issues.
-
-For a deeper dive, check out [this LangSmith conceptual guide](https://docs.smith.langchain.com/concepts/tracing).
diff --git a/docs/docs/concepts/agents.mdx b/docs/docs/concepts/agents.mdx
new file mode 100644
index 0000000000000..960eb2a975d1e
--- /dev/null
+++ b/docs/docs/concepts/agents.mdx
@@ -0,0 +1,25 @@
+# Agents
+
+By themselves, language models can't take actions - they just output text. Agents are systems that take a high-level task and use an LLM as a reasoning engine to decide what actions to take and execute those actions.
+
+[LangGraph](/docs/concepts/architecture#langgraph) is an extension of LangChain specifically aimed at creating highly controllable and customizable agents. We recommend that you use LangGraph for building agents.
+
+Please see the following resources for more information:
+
+* LangGraph docs on [common agent architectures](https://langchain-ai.github.io/langgraph/concepts/agentic_concepts/)
+* [Pre-built agents in LangGraph](https://langchain-ai.github.io/langgraph/reference/prebuilt/#langgraph.prebuilt.chat_agent_executor.create_react_agent)
+
+## Legacy agent concept: AgentExecutor
+
+LangChain previously introduced the `AgentExecutor` as a runtime for agents.
+While it served as an excellent starting point, its limitations became apparent when dealing with more sophisticated and customized agents.
+As a result, we're gradually phasing out `AgentExecutor` in favor of more flexible solutions in LangGraph.
+
+### Transitioning from AgentExecutor to langgraph
+
+If you're currently using `AgentExecutor`, don't worry! We've prepared resources to help you:
+
+1. For those who still need to use `AgentExecutor`, we offer a comprehensive guide on [how to use AgentExecutor](/docs/how_to/agent_executor).
+
+2. However, we strongly recommend transitioning to LangGraph for improved flexibility and control. To facilitate this transition, we've created a detailed [migration guide](/docs/how_to/migrate_agent) to help you move from `AgentExecutor` to LangGraph seamlessly.
+
diff --git a/docs/docs/concepts/architecture.mdx b/docs/docs/concepts/architecture.mdx
new file mode 100644
index 0000000000000..6a76b58fb297f
--- /dev/null
+++ b/docs/docs/concepts/architecture.mdx
@@ -0,0 +1,78 @@
+import ThemedImage from '@theme/ThemedImage';
+import useBaseUrl from '@docusaurus/useBaseUrl';
+
+# Architecture
+
+LangChain as a framework consists of a number of packages.
+
+
+
+
+## langchain-core
+
+This package contains base abstractions of different components and ways to compose them together.
+The interfaces for core components like LLMs, vector stores, retrievers and more are defined here.
+No third party integrations are defined here.
+The dependencies are kept purposefully very lightweight.
+
+## langchain
+
+The main `langchain` package contains chains, agents, and retrieval strategies that make up an application's cognitive architecture.
+These are NOT third party integrations.
+All chains, agents, and retrieval strategies here are NOT specific to any one integration, but rather generic across all integrations.
+
+## langchain-community
+
+This package contains third party integrations that are maintained by the LangChain community.
+Key partner packages are separated out (see below).
+This contains all integrations for various components (LLMs, vector stores, retrievers).
+All dependencies in this package are optional to keep the package as lightweight as possible.
+
+## Partner packages
+
+While the long tail of integrations is in `langchain-community`, we split popular integrations into their own packages (e.g. `langchain-openai`, `langchain-anthropic`, etc). This was done in order to improve support for these important integrations.
+
+For more information see:
+
+* A list [LangChain integrations](/docs/integrations/providers/)
+* The [LangChain API Reference](https://python.langchain.com/api_reference/) where you can find detailed information about the API reference of each partner package.
+
+## LangGraph
+
+`langgraph` is an extension of `langchain` aimed at building robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.
+
+LangGraph exposes high level interfaces for creating common types of agents, as well as a low-level API for composing custom flows.
+
+:::info[Further reading]
+
+* See our LangGraph overview [here](https://langchain-ai.github.io/langgraph/concepts/high_level/#core-principles).
+* See our LangGraph Academy Course [here](https://academy.langchain.com/courses/intro-to-langgraph).
+
+:::
+
+## LangServe
+
+A package to deploy LangChain chains as REST APIs. Makes it easy to get a production ready API up and running.
+
+:::important
+LangServe is designed to primarily deploy simple Runnables and work with well-known primitives in langchain-core.
+
+If you need a deployment option for LangGraph, you should instead be looking at LangGraph Cloud (beta) which will be better suited for deploying LangGraph applications.
+:::
+
+For more information, see the [LangServe documentation](/docs/langserve).
+
+
+## LangSmith
+
+A developer platform that lets you debug, test, evaluate, and monitor LLM applications.
+
+For more information, see the [LangSmith documentation](https://docs.smith.langchain.com)
diff --git a/docs/docs/concepts/async.mdx b/docs/docs/concepts/async.mdx
new file mode 100644
index 0000000000000..2a1d5acf57845
--- /dev/null
+++ b/docs/docs/concepts/async.mdx
@@ -0,0 +1,81 @@
+# Async programming with langchain
+
+:::info Prerequisites
+* [Runnable interface](/docs/concepts/runnables)
+* [asyncio](https://docs.python.org/3/library/asyncio.html)
+:::
+
+LLM based applications often involve a lot of I/O-bound operations, such as making API calls to language models, databases, or other services. Asynchronous programming (or async programming) is a paradigm that allows a program to perform multiple tasks concurrently without blocking the execution of other tasks, improving efficiency and responsiveness, particularly in I/O-bound operations.
+
+:::note
+You are expected to be familiar with asynchronous programming in Python before reading this guide. If you are not, please find appropriate resources online to learn how to program asynchronously in Python.
+This guide specifically focuses on what you need to know to work with LangChain in an asynchronous context, assuming that you are already familiar with asynch
+:::
+
+## Langchain asynchronous apis
+
+Many LangChain APIs are designed to be asynchronous, allowing you to build efficient and responsive applications.
+
+Typically, any method that may perform I/O operations (e.g., making API calls, reading files) will have an asynchronous counterpart.
+
+In LangChain, async implementations are located in the same classes as their synchronous counterparts, with the asynchronous methods having an "a" prefix. For example, the synchronous `invoke` method has an asynchronous counterpart called `ainvoke`.
+
+Many components of LangChain implement the [Runnable Interface](/docs/concepts/runnables), which includes support for asynchronous execution. This means that you can run Runnables asynchronously using the `await` keyword in Python.
+
+```python
+await some_runnable.ainvoke(some_input)
+```
+
+Other components like [Embedding Models](/docs/concepts/embedding_models) and [VectorStore](/docs/concepts/vectorstores) that do not implement the [Runnable Interface](/docs/concepts/runnables) usually still follow the same rule and include the asynchronous version of method in the same class with an "a" prefix.
+
+For example,
+
+```python
+await some_vectorstore.aadd_documents(documents)
+```
+
+Runnables created using the [LangChain Expression Language (LCEL)](/docs/concepts/lcel) can also be run asynchronously as they implement
+the full [Runnable Interface](/docs/concepts/runnables).
+
+Fore more information, please review the [API reference](https://python.langchain.com/api_reference/) for the specific component you are using.
+
+## Delegation to sync methods
+
+Most popular LangChain integrations implement asynchronous support of their APIs. For example, the `ainvoke` method of many ChatModel implementations uses the `httpx.AsyncClient` to make asynchronous HTTP requests to the model provider's API.
+
+When an asynchronous implementation is not available, LangChain tries to provide a default implementation, even if it incurs
+a **slight** overhead.
+
+By default, LangChain will delegate the execution of a unimplemented asynchronous methods to the synchronous counterparts. LangChain almost always assumes that the synchronous method should be treated as a blocking operation and should be run in a separate thread.
+This is done using [asyncio.loop.run_in_executor](https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor) functionality provided by the `asyncio` library. LangChain uses the default executor provided by the `asyncio` library, which lazily initializes a thread pool executor with a default number of threads that is reused in the given event loop. While this strategy incurs a slight overhead due to context switching between threads, it guarantees that every asynchronous method has a default implementation that works out of the box.
+
+## Performance
+
+Async code in LangChain should generally perform relatively well with minimal overhead out of the box, and is unlikely
+to be a bottleneck in most applications.
+
+The two main sources of overhead are:
+
+1. Cost of context switching between threads when [delegating to synchronous methods](#delegation-to-sync-methods). This can be addressed by providing a native asynchronous implementation.
+2. In [LCEL](/docs/concepts/lcel) any "cheap functions" that appear as part of the chain will be either scheduled as tasks on the event loop (if they are async) or run in a separate thread (if they are sync), rather than just be run inline.
+
+The latency overhead you should expect from these is between tens of microseconds to a few milliseconds.
+
+A more common source of performance issues arises from users accidentally blocking the event loop by calling synchronous code in an async context (e.g., calling `invoke` rather than `ainvoke`).
+
+## Compatibility
+
+LangChain is only compatible with the `asyncio` library, which is distributed as part of the Python standard library. It will not work with other async libraries like `trio` or `curio`.
+
+In Python 3.9 and 3.10, [asyncio's tasks](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) did not
+accept a `context` parameter. Due to this limitation, LangChain cannot automatically propagate the `RunnableConfig` down the call chain
+in certain scenarios.
+
+If you are experiencing issues with streaming, callbacks or tracing in async code and are using Python 3.9 or 3.10, this is a likely cause.
+
+Please read [Propagation RunnableConfig](/docs/concepts/runnables#propagation-RunnableConfig) for more details to learn how to propagate the `RunnableConfig` down the call chain manually (or upgrade to Python 3.11 where this is no longer an issue).
+
+## How to use in ipython and jupyter notebooks
+
+As of IPython 7.0, IPython supports asynchronous REPLs. This means that you can use the `await` keyword in the IPython REPL and Jupyter Notebooks without any additional setup. For more information, see the [IPython blog post](https://blog.jupyter.org/ipython-7-0-async-repl-a35ce050f7f7).
+
diff --git a/docs/docs/concepts/callbacks.mdx b/docs/docs/concepts/callbacks.mdx
new file mode 100644
index 0000000000000..6e3975271d8eb
--- /dev/null
+++ b/docs/docs/concepts/callbacks.mdx
@@ -0,0 +1,73 @@
+# Callbacks
+
+:::note Prerequisites
+- [Runnable interface](/docs/concepts/#runnable-interface)
+:::
+
+LangChain provides a callbacks system that allows you to hook into the various stages of your LLM application. This is useful for logging, monitoring, streaming, and other tasks.
+
+You can subscribe to these events by using the `callbacks` argument available throughout the API. This argument is list of handler objects, which are expected to implement one or more of the methods described below in more detail.
+
+## Callback events
+
+| Event | Event Trigger | Associated Method |
+|------------------|---------------------------------------------|-----------------------|
+| Chat model start | When a chat model starts | `on_chat_model_start` |
+| LLM start | When a llm starts | `on_llm_start` |
+| LLM new token | When an llm OR chat model emits a new token | `on_llm_new_token` |
+| LLM ends | When an llm OR chat model ends | `on_llm_end` |
+| LLM errors | When an llm OR chat model errors | `on_llm_error` |
+| Chain start | When a chain starts running | `on_chain_start` |
+| Chain end | When a chain ends | `on_chain_end` |
+| Chain error | When a chain errors | `on_chain_error` |
+| Tool start | When a tool starts running | `on_tool_start` |
+| Tool end | When a tool ends | `on_tool_end` |
+| Tool error | When a tool errors | `on_tool_error` |
+| Agent action | When an agent takes an action | `on_agent_action` |
+| Agent finish | When an agent ends | `on_agent_finish` |
+| Retriever start | When a retriever starts | `on_retriever_start` |
+| Retriever end | When a retriever ends | `on_retriever_end` |
+| Retriever error | When a retriever errors | `on_retriever_error` |
+| Text | When arbitrary text is run | `on_text` |
+| Retry | When a retry event is run | `on_retry` |
+
+## Callback handlers
+
+Callback handlers can either be `sync` or `async`:
+
+* Sync callback handlers implement the [BaseCallbackHandler](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.base.BaseCallbackHandler.html) interface.
+* Async callback handlers implement the [AsyncCallbackHandler](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.base.AsyncCallbackHandler.html) interface.
+
+During run-time LangChain configures an appropriate callback manager (e.g., [CallbackManager](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.manager.CallbackManager.html) or [AsyncCallbackManager](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.manager.AsyncCallbackManager.html) which will be responsible for calling the appropriate method on each "registered" callback handler when the event is triggered.
+
+## Passing callbacks
+
+The `callbacks` property is available on most objects throughout the API (Models, Tools, Agents, etc.) in two different places:
+
+- **Request time callbacks**: Passed at the time of the request in addition to the input data.
+Available on all standard `Runnable` objects. These callbacks are INHERITED by all children
+of the object they are defined on. For example, `chain.invoke({"number": 25}, {"callbacks": [handler]})`.
+- **Constructor callbacks**: `chain = TheNameOfSomeChain(callbacks=[handler])`. These callbacks
+are passed as arguments to the constructor of the object. The callbacks are scoped
+only to the object they are defined on, and are **not** inherited by any children of the object.
+
+:::warning
+Constructor callbacks are scoped only to the object they are defined on. They are **not** inherited by children
+of the object.
+:::
+
+If you're creating a custom chain or runnable, you need to remember to propagate request time
+callbacks to any child objects.
+
+:::important Async in Python<=3.10
+
+Any `RunnableLambda`, a `RunnableGenerator`, or `Tool` that invokes other runnables
+and is running `async` in python<=3.10, will have to propagate callbacks to child
+objects manually. This is because LangChain cannot automatically propagate
+callbacks to child objects in this case.
+
+This is a common reason why you may fail to see events being emitted from custom
+runnables or tools.
+:::
+
+For specifics on how to use callbacks, see the [relevant how-to guides here](/docs/how_to/#callbacks).
\ No newline at end of file
diff --git a/docs/docs/concepts/chat_history.mdx b/docs/docs/concepts/chat_history.mdx
new file mode 100644
index 0000000000000..967f93af968e0
--- /dev/null
+++ b/docs/docs/concepts/chat_history.mdx
@@ -0,0 +1,46 @@
+# Chat history
+
+:::info Prerequisites
+
+- [Messages](/docs/concepts/messages)
+- [Chat models](/docs/concepts/chat_models)
+- [Tool calling](/docs/concepts/tool_calling)
+:::
+
+Chat history is a record of the conversation between the user and the chat model. It is used to maintain context and state throughout the conversation. The chat history is sequence of [messages](/docs/concepts/messages), each of which is associated with a specific [role](/docs/concepts/messages#role), such as "user", "assistant", "system", or "tool".
+
+## Conversation patterns
+
+![Conversation patterns](/img/conversation_patterns.png)
+
+Most conversations start with a **system message** that sets the context for the conversation. This is followed by a **user message** containing the user's input, and then an **assistant message** containing the model's response.
+
+The **assistant** may respond directly to the user or if configured with tools request that a [tool](/docs/concepts/tool_calling) be invoked to perform a specific task.
+
+So a full conversation often involves a combination of two patterns of alternating messages:
+
+1. The **user** and the **assistant** representing a back-and-forth conversation.
+2. The **assistant** and **tool messages** representing an ["agentic" workflow](/docs/concepts/agents) where the assistant is invoking tools to perform specific tasks.
+
+## Managing chat history
+
+Since chat models have a maximum limit on input size, it's important to manage chat history and trim it as needed to avoid exceeding the [context window](/docs/concepts/chat_models#context_window).
+
+While processing chat history, it's essential to preserve a correct conversation structure.
+
+Key guidelines for managing chat history:
+
+- The conversation should follow one of these structures:
+ - The first message is either a "user" message or a "system" message, followed by a "user" and then an "assistant" message.
+ - The last message should be either a "user" message or a "tool" message containing the result of a tool call.
+- When using [tool calling](/docs/concepts/tool_calling), a "tool" message should only follow an "assistant" message that requested the tool invocation.
+
+:::tip
+Understanding correct conversation structure is essential for being able to properly implement
+[memory](https://langchain-ai.github.io/langgraph/concepts/memory/) in chat models.
+:::
+
+## Related resources
+
+- [How to trim messages](https://python.langchain.com/docs/how_to/trim_messages/)
+- [Memory guide](https://langchain-ai.github.io/langgraph/concepts/memory/) for information on implementing short-term and long-term memory in chat models using [LangGraph](https://langchain-ai.github.io/langgraph/).
diff --git a/docs/docs/concepts/chat_models.mdx b/docs/docs/concepts/chat_models.mdx
new file mode 100644
index 0000000000000..e924168e2cd71
--- /dev/null
+++ b/docs/docs/concepts/chat_models.mdx
@@ -0,0 +1,168 @@
+# Chat models
+
+## Overview
+
+Large Language Models (LLMs) are advanced machine learning models that excel in a wide range of language-related tasks such as text generation, translation, summarization, question answering, and more, without needing task-specific tuning for every scenario.
+
+Modern LLMs are typically accessed through a chat model interface that takes a list of [messages](/docs/concepts/messages) as input and returns a [message](/docs/concepts/messages) as output.
+
+The newest generation of chat models offer additional capabilities:
+
+* [Tool calling](/docs/concepts#tool-calling): Many popular chat models offer a native [tool calling](/docs/concepts#tool-calling) API. This API allows developers to build rich applications that enable AI to interact with external services, APIs, and databases. Tool calling can also be used to extract structured information from unstructured data and perform various other tasks.
+* [Structured output](/docs/concepts/structured_outputs): A technique to make a chat model respond in a structured format, such as JSON that matches a given schema.
+* [Multimodality](/docs/concepts/multimodality): The ability to work with data other than text; for example, images, audio, and video.
+
+## Features
+
+LangChain provides a consistent interface for working with chat models from different providers while offering additional features for monitoring, debugging, and optimizing the performance of applications that use LLMs.
+
+* Integrations with many chat model providers (e.g., Anthropic, OpenAI, Ollama, Microsoft Azure, Google Vertex, Amazon Bedrock, Hugging Face, Cohere, Groq). Please see [chat model integrations](/docs/integrations/chat/) for an up-to-date list of supported models.
+* Use either LangChain's [messages](/docs/concepts/messages) format or OpenAI format.
+* Standard [tool calling API](/docs/concepts#tool-calling): standard interface for binding tools to models, accessing tool call requests made by models, and sending tool results back to the model.
+* Standard API for structuring outputs (/docs/concepts/structured_outputs) via the `with_structured_output` method.
+* Provides support for [async programming](/docs/concepts/async), [efficient batching](/docs/concepts/runnables#batch), [a rich streaming API](/docs/concepts/streaming).
+* Integration with [LangSmith](https://docs.smith.langchain.com) for monitoring and debugging production-grade applications based on LLMs.
+* Additional features like standardized [token usage](/docs/concepts/messages#token_usage), [rate limiting](#rate-limiting), [caching](#cache) and more.
+
+## Integrations
+
+LangChain has many chat model integrations that allow you to use a wide variety of models from different providers.
+
+These integrations are one of two types:
+
+1. **Official models**: These are models that are officially supported by LangChain and/or model provider. You can find these models in the `langchain-` packages.
+2. **Community models**: There are models that are mostly contributed and supported by the community. You can find these models in the `langchain-community` package.
+
+LangChain chat models are named with a convention that prefixes "Chat" to their class names (e.g., `ChatOllama`, `ChatAnthropic`, `ChatOpenAI`, etc.).
+
+Please review the [chat model integrations](/docs/integrations/chat/) for a list of supported models.
+
+:::note
+Models that do **not** include the prefix "Chat" in their name or include "LLM" as a suffix in their name typically refer to older models that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output.
+:::
+
+
+## Interface
+
+LangChain chat models implement the [BaseChatModel](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html) interface. Because [BaseChatModel] also implements the [Runnable Interface](/docs/concepts/runnables), chat models support a [standard streaming interface](/docs/concepts/streaming), [async programming](/docs/concepts/async), optimized [batching](/docs/concepts/runnables#batch), and more. Please see the [Runnable Interface](/docs/concepts/runnables) for more details.
+
+Many of the key methods of chat models operate on [messages](/docs/concepts/messages) as input and return messages as output.
+
+Chat models offer a standard set of parameters that can be used to configure the model. These parameters are typically used to control the behavior of the model, such as the temperature of the output, the maximum number of tokens in the response, and the maximum time to wait for a response. Please see the [standard parameters](#standard-parameters) section for more details.
+
+:::note
+In documentation, we will often use the terms "LLM" and "Chat Model" interchangeably. This is because most modern LLMs are exposed to users via a chat model interface.
+
+However, LangChain also has implementations of older LLMs that do not follow the chat model interface and instead use an interface that takes a string as input and returns a string as output. These models are typically named without the "Chat" prefix (e.g., `Ollama`, `Anthropic`, `OpenAI`, etc.).
+These models implement the [BaseLLM](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.llms.BaseLLM.html#langchain_core.language_models.llms.BaseLLM) interface and may be named with the "LLM" suffix (e.g., `OllamaLLM`, `AnthropicLLM`, `OpenAILLM`, etc.). Generally, users should not use these models.
+:::
+
+### Key methods
+
+The key methods of a chat model are:
+
+1. **invoke**: The primary method for interacting with a chat model. It takes a list of [messages](/docs/concepts/messages) as input and returns a list of messages as output.
+2. **stream**: A method that allows you to stream the output of a chat model as it is generated.
+3. **batch**: A method that allows you to batch multiple requests to a chat model together for more efficient processing.
+4. **bind_tools**: A method that allows you to bind a tool to a chat model for use in the model's execution context.
+5. **with_structured_output**: A wrapper around the `invoke` method for models that natively support [structured output](/docs/concepts#structured_output).
+
+Other important methods can be found in the [BaseChatModel API Reference](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html).
+
+### Inputs and outputs
+
+Modern LLMs are typically accessed through a chat model interface that takes [messages](/docs/concepts/messages) as input and returns [messages](/docs/concepts/messages) as output. Messages are typically associated with a role (e.g., "system", "human", "assistant") and one or more content blocks that contain text or potentially multimodal data (e.g., images, audio, video).
+
+LangChain supports two message formats to interact with chat models:
+
+1. **LangChain Message Format**: LangChain's own message format, which is used by default and is used internally by LangChain.
+2. **OpenAI's Message Format**: OpenAI's message format.
+
+### Standard parameters
+
+Many chat models have standardized parameters that can be used to configure the model:
+
+| Parameter | Description |
+|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `model` | The name or identifier of the specific AI model you want to use (e.g., `"gpt-3.5-turbo"` or `"gpt-4"`). |
+| `temperature` | Controls the randomness of the model's output. A higher value (e.g., 1.0) makes responses more creative, while a lower value (e.g., 0.1) makes them more deterministic and focused. |
+| `timeout` | The maximum time (in seconds) to wait for a response from the model before canceling the request. Ensures the request doesn’t hang indefinitely. |
+| `max_tokens` | Limits the total number of tokens (words and punctuation) in the response. This controls how long the output can be. |
+| `stop` | Specifies stop sequences that indicate when the model should stop generating tokens. For example, you might use specific strings to signal the end of a response. |
+| `max_retries` | The maximum number of attempts the system will make to resend a request if it fails due to issues like network timeouts or rate limits. |
+| `api_key` | The API key required for authenticating with the model provider. This is usually issued when you sign up for access to the model. |
+| `base_url` | The URL of the API endpoint where requests are sent. This is typically provided by the model's provider and is necessary for directing your requests. |
+| `rate_limiter` | An optional [BaseRateLimiter](https://python.langchain.com/api_reference/core/rate_limiters/langchain_core.rate_limiters.BaseRateLimiter.html#langchain_core.rate_limiters.BaseRateLimiter) to space out requests to avoid exceeding rate limits. See [rate-limiting](#rate-limiting) below for more details. |
+
+Some important things to note:
+
+- Standard parameters only apply to model providers that expose parameters with the intended functionality. For example, some providers do not expose a configuration for maximum output tokens, so max_tokens can't be supported on these.
+- Standard params are currently only enforced on integrations that have their own integration packages (e.g. `langchain-openai`, `langchain-anthropic`, etc.), they're not enforced on models in ``langchain-community``.
+
+ChatModels also accept other parameters that are specific to that integration. To find all the parameters supported by a ChatModel head to the [API reference](https://python.langchain.com/api_reference/) for that model.
+
+## Tool calling
+
+Chat models can call [tools](/docs/concepts/tools) to perform tasks such as fetching data from a database, making API requests, or running custom code. Please
+see the [tool calling](/docs/concepts#tool-calling) guide for more information.
+
+## Structured outputs
+
+Chat models can be requested to respond in a particular format (e.g., JSON or matching a particular schema). This feature is extremely
+useful for information extraction tasks. Please read more about
+the technique in the [structured outputs](/docs/concepts#structured_output) guide.
+
+## Multimodality
+
+Large Language Models (LLMs) are not limited to processing text. They can also be used to process other types of data, such as images, audio, and video. This is known as [multimodality](/docs/concepts/multimodality).
+
+Currently, only some LLMs support multimodal inputs, and almost none support multimodal outputs. Please consult the specific model documentation for details.
+
+## Context window
+
+A chat model's context window refers to the maximum size of the input sequence the model can process at one time. While the context windows of modern LLMs are quite large, they still present a limitation that developers must keep in mind when working with chat models.
+
+If the input exceeds the context window, the model may not be able to process the entire input and could raise an error. In conversational applications, this is especially important because the context window determines how much information the model can "remember" throughout a conversation. Developers often need to manage the input within the context window to maintain a coherent dialogue without exceeding the limit. For more details on handling memory in conversations, refer to the [memory](https://langchain-ai.github.io/langgraph/concepts/memory/).
+
+The size of the input is measured in [tokens](/docs/concepts/tokens) which are the unit of processing that the model uses.
+
+## Advanced topics
+
+### Rate-limiting
+
+Many chat model providers impose a limit on the number of requests that can be made in a given time period.
+
+If you hit a rate limit, you will typically receive a rate limit error response from the provider, and will need to wait before making more requests.
+
+You have a few options to deal with rate limits:
+
+1. Try to avoid hitting rate limits by spacing out requests: Chat models accept a `rate_limiter` parameter that can be provided during initialization. This parameter is used to control the rate at which requests are made to the model provider. Spacing out the requests to a given model is a particularly useful strategy when benchmarking models to evaluate their performance. Please see the [how to handle rate limits](https://python.langchain.com/docs/how_to/chat_model_rate_limiting/) for more information on how to use this feature.
+2. Try to recover from rate limit errors: If you receive a rate limit error, you can wait a certain amount of time before retrying the request. The amount of time to wait can be increased with each subsequent rate limit error. Chat models have a `max_retries` parameter that can be used to control the number of retries. See the [standard parameters](#standard-parameters) section for more information.
+3. Fallback to another chat model: If you hit a rate limit with one chat model, you can switch to another chat model that is not rate-limited.
+
+### Caching
+
+Chat model APIs can be slow, so a natural question is whether to cache the results of previous conversations. Theoretically, caching can help improve performance by reducing the number of requests made to the model provider. In practice, caching chat model responses is a complex problem and should be approached with caution.
+
+The reason is that getting a cache hit is unlikely after the first or second interaction in a conversation if relying on caching the **exact** inputs into the model. For example, how likely do you think that multiple conversations start with the exact same message? What about the exact same three messages?
+
+An alternative approach is to use semantic caching, where you cache responses based on the meaning of the input rather than the exact input itself. This can be effective in some situations, but not in others.
+
+A semantic cache introduces a dependency on another model on the critical path of your application (e.g., the semantic cache may rely on an [embedding model](/docs/concepts/embedding_models) to convert text to a vector representation), and it's not guaranteed to capture the meaning of the input accurately.
+
+However, there might be situations where caching chat model responses is beneficial. For example, if you have a chat model that is used to answer frequently asked questions, caching responses can help reduce the load on the model provider and improve response times.
+
+Please see the [how to cache chat model responses](/docs/how_to/#chat-model-caching) guide for more details.
+
+## Related resources
+
+* How-to guides on using chat models: [how-to guides](/docs/how_to/#chat-models).
+* List of supported chat models: [chat model integrations](/docs/integrations/chat/).
+
+### Conceptual guides
+
+* [Messages](/docs/concepts/messages)
+* [Tool calling](/docs/concepts#tool-calling)
+* [Multimodality](/docs/concepts/multimodality)
+* [Structured outputs](/docs/concepts#structured_output)
+* [Tokens](/docs/concepts/tokens)
\ No newline at end of file
diff --git a/docs/docs/concepts/document_loaders.mdx b/docs/docs/concepts/document_loaders.mdx
new file mode 100644
index 0000000000000..a6a11ddfe7104
--- /dev/null
+++ b/docs/docs/concepts/document_loaders.mdx
@@ -0,0 +1,45 @@
+# Document loaders
+
+
+:::info[Prerequisites]
+
+* [Document loaders API reference](https://python.langchain.com/docs/how_to/#document-loaders)
+:::
+
+Document loaders are designed to load document objects. LangChain has hundreds of integrations with various data sources to load data from: Slack, Notion, Google Drive, etc.
+
+## Integrations
+
+You can find available integrations on the [Document Loaders Integrations page](https://python.langchain.com/docs/integrations/document_loaders/).
+
+## Interface
+
+Documents loaders implement the [BaseLoader interface](https://python.langchain.com/api_reference/core/document_loaders/langchain_core.document_loaders.base.BaseLoader.html).
+
+Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the `.load` method or `.lazy_load`.
+
+Here's a simple example:
+
+```python
+from langchain_community.document_loaders.csv_loader import CSVLoader
+
+loader = CSVLoader(
+ ... # <-- Integration specific parameters here
+)
+data = loader.load()
+```
+
+or if working with large datasets, you can use the `.lazy_load` method:
+
+```python
+for document in loader.lazy_load():
+ print(document)
+```
+
+## Related resources
+
+Please see the following resources for more information:
+
+* [How-to guides for document loaders](https://python.langchain.com/docs/how_to/#document-loaders)
+* [Document API reference](https://python.langchain.com/docs/how_to/#document-loaders)
+* [Document loaders integrations](https://python.langchain.com/docs/integrations/document_loaders/)
diff --git a/docs/docs/concepts/embedding_models.mdx b/docs/docs/concepts/embedding_models.mdx
new file mode 100644
index 0000000000000..978188421c6fd
--- /dev/null
+++ b/docs/docs/concepts/embedding_models.mdx
@@ -0,0 +1,130 @@
+# Embedding models
+
+
+:::info[Prerequisites]
+
+* [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)
+
+:::
+
+:::info[Note]
+This conceptual overview focuses on text-based embedding models.
+
+Embedding models can also be [multimodal](/docs/concepts/multimodality) though such models are not currently supported by LangChain.
+:::
+
+Imagine being able to capture the essence of any text - a tweet, document, or book - in a single, compact representation.
+This is the power of embedding models, which lie at the heart of many retrieval systems.
+Embedding models transform human language into a format that machines can understand and compare with speed and accuracy.
+These models take text as input and produce a fixed-length array of numbers, a numerical fingerprint of the text's semantic meaning.
+Embeddings allow search system to find relevant documents not just based on keyword matches, but on semantic understanding.
+
+## Key concepts
+
+![Conceptual Overview](/img/embeddings_concept.png)
+
+(1) **Embed text as a vector**: Embeddings transform text into a numerical vector representation.
+
+(2) **Measure similarity**: Embedding vectors can be comparing using simple mathematical operations.
+
+## Embedding
+
+### Historical context
+
+The landscape of embedding models has evolved significantly over the years.
+A pivotal moment came in 2018 when Google introduced [BERT (Bidirectional Encoder Representations from Transformers)](https://www.nvidia.com/en-us/glossary/bert/).
+BERT applied transformer models to embed text as a simple vector representation, which lead to unprecedented performance across various NLP tasks.
+However, BERT wasn't optimized for generating sentence embeddings efficiently.
+This limitation spurred the creation of [SBERT (Sentence-BERT)](https://www.sbert.net/examples/training/sts/README.html), which adapted the BERT architecture to generate semantically rich sentence embeddings, easily comparable via similarity metrics like cosine similarity, dramatically reduced the computational overhead for tasks like finding similar sentences.
+Today, the embedding model ecosystem is diverse, with numerous providers offering their own implementations.
+To navigate this variety, researchers and practitioners often turn to benchmarks like the Massive Text Embedding Benchmark (MTEB) [here](https://huggingface.co/blog/mteb) for objective comparisons.
+
+:::info[Further reading]
+
+* See the [seminal BERT paper](https://arxiv.org/abs/1810.04805).
+* See Cameron Wolfe's [excellent review](https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search?utm_source=profile&utm_medium=reader2) of embedding models.
+* See the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/blog/mteb) leaderboard for a comprehensive overview of embedding models.
+
+:::
+
+### Interface
+
+LangChain provides a universal interface for working with them, providing standard methods for common operations.
+This common interface simplifies interaction with various embedding providers through two central methods:
+
+- `embed_documents`: For embedding multiple texts (documents)
+- `embed_query`: For embedding a single text (query)
+
+This distinction is important, as some providers employ different embedding strategies for documents (which are to be searched) versus queries (the search input itself).
+To illustrate, here's a practical example using LangChain's `.embed_documents` method to embed a list of strings:
+
+```python
+from langchain_openai import OpenAIEmbeddings
+embeddings_model = OpenAIEmbeddings()
+embeddings = embeddings_model.embed_documents(
+ [
+ "Hi there!",
+ "Oh, hello!",
+ "What's your name?",
+ "My friends call me World",
+ "Hello World!"
+ ]
+)
+len(embeddings), len(embeddings[0])
+(5, 1536)
+```
+
+For convenience, you can also use the `embed_query` method to embed a single text:
+
+```python
+query_embedding = embeddings_model.embed_query("What is the meaning of life?")
+```
+
+:::info[Further reading]
+
+* See the full list of [LangChain embedding model integrations](/docs/integrations/text_embedding/).
+* See these [how-to guides](/docs/how_to/embed_text) for working with embedding models.
+
+:::
+
+### Integrations
+
+LangChain offers many embedding model integrations which you can find [on the embedding models](/docs/integrations/text_embedding/) integrations page.
+
+## Measure similarity
+
+Each embedding is essentially a set of coordinates, often in a high-dimensional space.
+In this space, the position of each point (embedding) reflects the meaning of its corresponding text.
+Just as similar words might be close to each other in a thesaurus, similar concepts end up close to each other in this embedding space.
+This allows for intuitive comparisons between different pieces of text.
+By reducing text to these numerical representations, we can use simple mathematical operations to quickly measure how alike two pieces of text are, regardless of their original length or structure.
+Some common similarity metrics include:
+
+- **Cosine Similarity**: Measures the cosine of the angle between two vectors.
+- **Euclidean Distance**: Measures the straight-line distance between two points.
+- **Dot Product**: Measures the projection of one vector onto another.
+
+The choice of similarity metric should be chosen based on the model.
+As an example, [OpenAI suggests cosine similarity for their embeddings](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use), which can be easily implemented:
+
+```python
+import numpy as np
+
+def cosine_similarity(vec1, vec2):
+ dot_product = np.dot(vec1, vec2)
+ norm_vec1 = np.linalg.norm(vec1)
+ norm_vec2 = np.linalg.norm(vec2)
+ return dot_product / (norm_vec1 * norm_vec2)
+
+similarity = cosine_similarity(query_result, document_result)
+print("Cosine Similarity:", similarity)
+```
+
+:::info[Further reading]
+
+* See Simon Willison’s [nice blog post and video](https://simonwillison.net/2023/Oct/23/embeddings/) on embeddings and similarity metrics.
+* See [this documentation](https://developers.google.com/machine-learning/clustering/dnn-clustering/supervised-similarity) from Google on similarity metrics to consider with embeddings.
+* See Pinecone's [blog post](https://www.pinecone.io/learn/vector-similarity/) on similarity metrics.
+* See OpenAI's [FAQ](https://platform.openai.com/docs/guides/embeddings/faq) on what similarity metric to use with OpenAI embeddings.
+
+:::
diff --git a/docs/docs/concepts/evaluation.mdx b/docs/docs/concepts/evaluation.mdx
new file mode 100644
index 0000000000000..274ef98367cbd
--- /dev/null
+++ b/docs/docs/concepts/evaluation.mdx
@@ -0,0 +1,17 @@
+# Evaluation
+
+
+Evaluation is the process of assessing the performance and effectiveness of your LLM-powered applications.
+It involves testing the model's responses against a set of predefined criteria or benchmarks to ensure it meets the desired quality standards and fulfills the intended purpose.
+This process is vital for building reliable applications.
+
+![](/img/langsmith_evaluate.png)
+
+[LangSmith](https://docs.smith.langchain.com/) helps with this process in a few ways:
+
+- It makes it easier to create and curate datasets via its tracing and annotation features
+- It provides an evaluation framework that helps you define metrics and run your app against your dataset
+- It allows you to track results over time and automatically run your evaluators on a schedule or as part of CI/Code
+
+To learn more, check out [this LangSmith guide](https://docs.smith.langchain.com/concepts/evaluation).
+
diff --git a/docs/docs/concepts/example_selectors.mdx b/docs/docs/concepts/example_selectors.mdx
new file mode 100644
index 0000000000000..32dad8c5fa443
--- /dev/null
+++ b/docs/docs/concepts/example_selectors.mdx
@@ -0,0 +1,20 @@
+# Example selectors
+
+:::note Prerequisites
+
+- [Chat models](/docs/concepts/chat_models/)
+- [Few-shot prompting](/docs/concepts/few_shot_prompting/)
+:::
+
+## Overview
+
+One common prompting technique for achieving better performance is to include examples as part of the prompt. This is known as [few-shot prompting](/docs/concepts/few_shot_prompting).
+
+This gives the [language model](/docs/concepts/chat_models/) concrete examples of how it should behave.
+Sometimes these examples are hardcoded into the prompt, but for more advanced situations it may be nice to dynamically select them.
+
+**Example Selectors** are classes responsible for selecting and then formatting examples into prompts.
+
+## Related resources
+
+* [Example selector how-to guides](/docs/how_to/#example-selectors)
\ No newline at end of file
diff --git a/docs/docs/concepts/few_shot_prompting.mdx b/docs/docs/concepts/few_shot_prompting.mdx
new file mode 100644
index 0000000000000..b7147addea25c
--- /dev/null
+++ b/docs/docs/concepts/few_shot_prompting.mdx
@@ -0,0 +1,85 @@
+# Few-shot prompting
+
+:::note Prerequisites
+
+- [Chat models](/docs/concepts/chat_models/)
+:::
+
+## Overview
+
+One of the most effective ways to improve model performance is to give a model examples of
+what you want it to do. The technique of adding example inputs and expected outputs
+to a model prompt is known as "few-shot prompting". The technique is based on the
+[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165) paper.
+There are a few things to think about when doing few-shot prompting:
+
+1. How are examples generated?
+2. How many examples are in each prompt?
+3. How are examples selected at runtime?
+4. How are examples formatted in the prompt?
+
+Here are the considerations for each.
+
+## 1. Generating examples
+
+The first and most important step of few-shot prompting is coming up with a good dataset of examples. Good examples should be relevant at runtime, clear, informative, and provide information that was not already known to the model.
+
+At a high-level, the basic ways to generate examples are:
+- Manual: a person/people generates examples they think are useful.
+- Better model: a better (presumably more expensive/slower) model's responses are used as examples for a worse (presumably cheaper/faster) model.
+- User feedback: users (or labelers) leave feedback on interactions with the application and examples are generated based on that feedback (for example, all interactions with positive feedback could be turned into examples).
+- LLM feedback: same as user feedback but the process is automated by having models evaluate themselves.
+
+Which approach is best depends on your task. For tasks where a small number core principles need to be understood really well, it can be valuable hand-craft a few really good examples.
+For tasks where the space of correct behaviors is broader and more nuanced, it can be useful to generate many examples in a more automated fashion so that there's a higher likelihood of there being some highly relevant examples for any runtime input.
+
+**Single-turn v.s. multi-turn examples**
+
+Another dimension to think about when generating examples is what the example is actually showing.
+
+The simplest types of examples just have a user input and an expected model output. These are single-turn examples.
+
+One more complex type if example is where the example is an entire conversation, usually in which a model initially responds incorrectly and a user then tells the model how to correct its answer.
+This is called a multi-turn example. Multi-turn examples can be useful for more nuanced tasks where its useful to show common errors and spell out exactly why they're wrong and what should be done instead.
+
+## 2. Number of examples
+
+Once we have a dataset of examples, we need to think about how many examples should be in each prompt.
+The key tradeoff is that more examples generally improve performance, but larger prompts increase costs and latency.
+And beyond some threshold having too many examples can start to confuse the model.
+Finding the right number of examples is highly dependent on the model, the task, the quality of the examples, and your cost and latency constraints.
+Anecdotally, the better the model is the fewer examples it needs to perform well and the more quickly you hit steeply diminishing returns on adding more examples.
+But, the best/only way to reliably answer this question is to run some experiments with different numbers of examples.
+
+## 3. Selecting examples
+
+Assuming we are not adding our entire example dataset into each prompt, we need to have a way of selecting examples from our dataset based on a given input. We can do this:
+- Randomly
+- By (semantic or keyword-based) similarity of the inputs
+- Based on some other constraints, like token size
+
+LangChain has a number of [`ExampleSelectors`](/docs/concepts/example_selectors) which make it easy to use any of these techniques.
+
+Generally, selecting by semantic similarity leads to the best model performance. But how important this is is again model and task specific, and is something worth experimenting with.
+
+## 4. Formatting examples
+
+Most state-of-the-art models these days are chat models, so we'll focus on formatting examples for those. Our basic options are to insert the examples:
+- In the system prompt as a string
+- As their own messages
+
+If we insert our examples into the system prompt as a string, we'll need to make sure it's clear to the model where each example begins and which parts are the input versus output. Different models respond better to different syntaxes, like [ChatML](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chat-markup-language), XML, TypeScript, etc.
+
+If we insert our examples as messages, where each example is represented as a sequence of Human, AI messages, we might want to also assign [names](/docs/concepts/#messages) to our messages like `"example_user"` and `"example_assistant"` to make it clear that these messages correspond to different actors than the latest input message.
+
+**Formatting tool call examples**
+
+One area where formatting examples as messages can be tricky is when our example outputs have tool calls. This is because different models have different constraints on what types of message sequences are allowed when any tool calls are generated.
+- Some models require that any AIMessage with tool calls be immediately followed by ToolMessages for every tool call,
+- Some models additionally require that any ToolMessages be immediately followed by an AIMessage before the next HumanMessage,
+- Some models require that tools are passed in to the model if there are any tool calls / ToolMessages in the chat history.
+
+These requirements are model-specific and should be checked for the model you are using. If your model requires ToolMessages after tool calls and/or AIMessages after ToolMessages and your examples only include expected tool calls and not the actual tool outputs, you can try adding dummy ToolMessages / AIMessages to the end of each example with generic contents to satisfy the API constraints.
+In these cases it's especially worth experimenting with inserting your examples as strings versus messages, as having dummy messages can adversely affect certain models.
+
+You can see a case study of how Anthropic and OpenAI respond to different few-shot prompting techniques on two different tool calling benchmarks [here](https://blog.langchain.dev/few-shot-prompting-to-improve-tool-calling-performance/).
diff --git a/docs/docs/concepts/index.mdx b/docs/docs/concepts/index.mdx
new file mode 100644
index 0000000000000..689db4b06a394
--- /dev/null
+++ b/docs/docs/concepts/index.mdx
@@ -0,0 +1,89 @@
+# Conceptual guide
+
+This guide provides explanations of the key concepts behind the LangChain framework and AI applications more broadly.
+
+We recommend that you go through at least one of the [Tutorials](/docs/tutorials) before diving into the conceptual guide. This will provide practical context that will make it easier to understand the concepts discussed here.
+
+The conceptual guide does not cover step-by-step instructions or specific implementation examples — those are found in the [How-to guides](/docs/how_to/) and [Tutorials](/docs/tutorials). For detailed reference material, please see the [API reference](https://python.langchain.com/api_reference/).
+
+## High level
+
+- **[Why LangChain?](/docs/concepts/why_langchain)**: Overview of the value that LangChain provides.
+- **[Architecture](/docs/concepts/architecture)**: How packages are organized in the LangChain ecosystem.
+
+## Concepts
+
+- **[Chat models](/docs/concepts/chat_models)**: LLMs exposed via a chat API that process sequences of messages as input and output a message.
+- **[Messages](/docs/concepts/messages)**: The unit of communication in chat models, used to represent model input and output.
+- **[Chat history](/docs/concepts/chat_history)**: A conversation represented as a sequence of messages, alternating between user messages and model responses.
+- **[Tools](/docs/concepts/tools)**: A function with an associated schema defining the function's name, description, and the arguments it accepts.
+- **[Tool calling](/docs/concepts/tool_calling)**: A type of chat model API that accepts tool schemas, along with messages, as input and returns invocations of those tools as part of the output message.
+- **[Structured output](/docs/concepts/structured_outputs)**: A technique to make a chat model respond in a structured format, such as JSON that matches a given schema.
+- **[Memory](https://langchain-ai.github.io/langgraph/concepts/memory/)**: Information about a conversation that is persisted so that it can be used in future conversations.
+- **[Multimodality](/docs/concepts/multimodality)**: The ability to work with data that comes in different forms, such as text, audio, images, and video.
+- **[Runnable interface](/docs/concepts/runnables)**: The base abstraction that many LangChain components and the LangChain Expression Language are built on.
+- **[LangChain Expression Language (LCEL)](/docs/concepts/lcel)**: A syntax for orchestrating LangChain components. Most useful for simpler applications.
+- **[Document loaders](/docs/concepts/document_loaders)**: Load a source as a list of documents.
+- **[Retrieval](/docs/concepts/retrieval)**: Information retrieval systems can retrieve structured or unstructured data from a datasource in response to a query.
+- **[Text splitters](/docs/concepts/text_splitters)**: Split long text into smaller chunks that can be individually indexed to enable granular retrieval.
+- **[Embedding models](/docs/concepts/embedding_models)**: Models that represent data such as text or images in a vector space.
+- **[Vector stores](/docs/concepts/vectorstores)**: Storage of and efficient search over vectors and associated metadata.
+- **[Retriever](/docs/concepts/retrievers)**: A component that returns relevant documents from a knowledge base in response to a query.
+- **[Retrieval Augmented Generation (RAG)](/docs/concepts/rag)**: A technique that enhances language models by combining them with external knowledge bases.
+- **[Agents](/docs/concepts/agents)**: Use a [language model](/docs/concepts/chat_models) to choose a sequence of actions to take. Agents can interact with external resources via [tool](/docs/concepts/tools).
+- **[Prompt templates](/docs/concepts/prompt_templates)**: Component for factoring out the static parts of a model "prompt" (usually a sequence of messages). Useful for serializing, versioning, and reusing these static parts.
+- **[Output parsers](/docs/concepts/output_parsers)**: Responsible for taking the output of a model and transforming it into a more suitable format for downstream tasks. Output parsers were primarily useful prior to the general availability of [tool calling](/docs/concepts/tool_calling) and [structured outputs](/docs/concepts/structured_outputs).
+- **[Few-shot prompting](/docs/concepts/few_shot_prompting)**: A technique for improving model performance by providing a few examples of the task to perform in the prompt.
+- **[Example selectors](/docs/concepts/example_selectors)**: Used to select the most relevant examples from a dataset based on a given input. Example selectors are used in few-shot prompting to select examples for a prompt.
+- **[Async programming](/docs/concepts/async)**: The basics that one should know to use LangChain in an asynchronous context.
+- **[Callbacks](/docs/concepts/callbacks)**: Callbacks enable the execution of custom auxiliary code in built-in components. Callbacks are used to stream outputs from LLMs in LangChain, trace the intermediate steps of an application, and more.
+- **[Tracing](/docs/concepts/tracing)**: The process of recording the steps that an application takes to go from input to output. Tracing is essential for debugging and diagnosing issues in complex applications.
+- **[Evaluation](/docs/concepts/evaluation)**: The process of assessing the performance and effectiveness of AI applications. This involves testing the model's responses against a set of predefined criteria or benchmarks to ensure it meets the desired quality standards and fulfills the intended purpose. This process is vital for building reliable applications.
+
+## Glossary
+
+- **[AIMessageChunk](/docs/concepts/messages#aimessagechunk)**: A partial response from an AI message. Used when streaming responses from a chat model.
+- **[AIMessage](/docs/concepts/messages#aimessage)**: Represents a complete response from an AI model.
+- **[astream_events](/docs/concepts/chat_models#key-methods)**: Stream granular information from [LCEL](/docs/concepts/lcel) chains.
+- **[BaseTool](/docs/concepts/tools#basetool)**: The base class for all tools in LangChain.
+- **[batch](/docs/concepts/runnables)**: Use to execute a runnable with batch inputs a Runnable.
+- **[bind_tools](/docs/concepts/chat_models#bind-tools)**: Allows models to interact with tools.
+- **[Caching](/docs/concepts/chat_models#caching)**: Storing results to avoid redundant calls to a chat model.
+- **[Chat models](/docs/concepts/multimodality#chat-models)**: Chat models that handle multiple data modalities.
+- **[Configurable runnables](/docs/concepts/runnables#configurable-Runnables)**: Creating configurable Runnables.
+- **[Context window](/docs/concepts/chat_models#context-window)**: The maximum size of input a chat model can process.
+- **[Conversation patterns](/docs/concepts/chat_history#conversation-patterns)**: Common patterns in chat interactions.
+- **[Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html)**: LangChain's representation of a document.
+- **[Embedding models](/docs/concepts/multimodality#embedding-models)**: Models that generate vector embeddings for various data types.
+- **[HumanMessage](/docs/concepts/messages#humanmessage)**: Represents a message from a human user.
+- **[InjectedState](/docs/concepts/tools#injectedstate)**: A state injected into a tool function.
+- **[InjectedStore](/docs/concepts/tools#injectedstore)**: A store that can be injected into a tool for data persistence.
+- **[InjectedToolArg](/docs/concepts/tools#injectedtoolarg)**: Mechanism to inject arguments into tool functions.
+- **[input and output types](/docs/concepts/runnables#input-and-output-types)**: Types used for input and output in Runnables.
+- **[Integration packages](/docs/concepts/architecture#partner-packages)**: Third-party packages that integrate with LangChain.
+- **[invoke](/docs/concepts/runnables)**: A standard method to invoke a Runnable.
+- **[JSON mode](/docs/concepts/structured_outputs#json-mode)**: Returning responses in JSON format.
+- **[langchain-community](/docs/concepts/architecture#langchain-community)**: Community-driven components for LangChain.
+- **[langchain-core](/docs/concepts/architecture#langchain-core)**: Core langchain package. Includes base interfaces and in-memory implementations.
+- **[langchain](/docs/concepts/architecture#langchain)**: A package for higher level components (e.g., some pre-built chains).
+- **[langgraph](/docs/concepts/architecture#langgraph)**: Powerful orchestration layer for LangChain. Use to build complex pipelines and workflows.
+- **[langserve](/docs/concepts/architecture#langserve)**: Use to deploy LangChain Runnables as REST endpoints. Uses FastAPI. Works primarily for LangChain Runnables, does not currently integrate with LangGraph.
+- **[Managing chat history](/docs/concepts/chat_history#managing-chat-history)**: Techniques to maintain and manage the chat history.
+- **[OpenAI format](/docs/concepts/messages#openai-format)**: OpenAI's message format for chat models.
+- **[Propagation of RunnableConfig](/docs/concepts/runnables#propagation-RunnableConfig)**: Propagating configuration through Runnables. Read if working with python 3.9, 3.10 and async.
+- **[rate-limiting](/docs/concepts/chat_models#rate-limiting)**: Client side rate limiting for chat models.
+- **[RemoveMessage](/docs/concepts/messages#remove-message)**: An abstraction used to remove a message from chat history, used primarily in LangGraph.
+- **[role](/docs/concepts/messages#role)**: Represents the role (e.g., user, assistant) of a chat message.
+- **[RunnableConfig](/docs/concepts/runnables#RunnableConfig)**: Use to pass run time information to Runnables (e.g., `run_name`, `run_id`, `tags`, `metadata`, `max_concurrency`, `recursion_limit`, `configurable`).
+- **[Standard parameters for chat models](/docs/concepts/chat_models#standard-parameters)**: Parameters such as API key, `temperature`, and `max_tokens`,
+- **[stream](/docs/concepts/streaming)**: Use to stream output from a Runnable or a graph.
+- **[Tokenization](/docs/concepts/tokens)**: The process of converting data into tokens and vice versa.
+- **[Tokens](/docs/concepts/tokens)**: The basic unit that a language model reads, processes, and generates under the hood.
+- **[Tool artifacts](/docs/concepts/tools#tool-artifacts)**: Add artifacts to the output of a tool that will not be sent to the model, but will be available for downstream processing.
+- **[Tool binding](/docs/concepts/tool_calling#tool-binding)**: Binding tools to models.
+- **[@tool](/docs/concepts/tools#@tool)**: Decorator for creating tools in LangChain.
+- **[Toolkits](/docs/concepts/tools#toolkits)**: A collection of tools that can be used together.
+- **[ToolMessage](/docs/concepts/messages#toolmessage)**: Represents a message that contains the results of a tool execution.
+- **[Vector stores](/docs/concepts/vectorstores)**: Datastores specialized for storing and efficiently searching vector embeddings.
+- **[with_structured_output](/docs/concepts/chat_models#with-structured-output)**: A helper method for chat models that natively support [tool calling](/docs/concepts/tool_calling) to get structured output matching a given schema specified via Pydantic, JSON schema or a function.
+- **[with_types](/docs/concepts/runnables#with_types)**: Method to overwrite the input and output types of a runnable. Useful when working with complex LCEL chains and deploying with LangServe.
diff --git a/docs/docs/concepts/key_value_stores.mdx b/docs/docs/concepts/key_value_stores.mdx
new file mode 100644
index 0000000000000..d8503dbc09360
--- /dev/null
+++ b/docs/docs/concepts/key_value_stores.mdx
@@ -0,0 +1,38 @@
+# Key-value stores
+
+## Overview
+
+LangChain provides a key-value store interface for storing and retrieving data.
+
+LangChain includes a [`BaseStore`](https://python.langchain.com/api_reference/core/stores/langchain_core.stores.BaseStore.html) interface,
+which allows for storage of arbitrary data. However, LangChain components that require KV-storage accept a
+more specific `BaseStore[str, bytes]` instance that stores binary data (referred to as a `ByteStore`), and internally take care of
+encoding and decoding data for their specific needs.
+
+This means that as a user, you only need to think about one type of store rather than different ones for different types of data.
+
+## Usage
+
+The key-value store interface in LangChain is used primarily for:
+
+1. Caching [embeddings](/docs/concepts/embedding_models) via [CachedBackedEmbeddings](https://python.langchain.com/api_reference/langchain/embeddings/langchain.embeddings.cache.CacheBackedEmbeddings.html#langchain.embeddings.cache.CacheBackedEmbeddings) to avoid recomputing embeddings for repeated queries or when re-indexing content.
+
+2. As a simple [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) persistence layer in some retrievers.
+
+Please see these how-to guides for more information:
+
+* [How to cache embeddings guide](https://python.langchain.com/docs/how_to/caching_embeddings/).
+* [How to retriever using multiple vectors per document](https://python.langchain.com/docs/how_to/custom_retriever/).
+
+## Interface
+
+All [`BaseStores`](https://python.langchain.com/api_reference/core/stores/langchain_core.stores.BaseStore.html) support the following interface. Note that the interface allows for modifying **multiple** key-value pairs at once:
+
+- `mget(key: Sequence[str]) -> List[Optional[bytes]]`: get the contents of multiple keys, returning `None` if the key does not exist
+- `mset(key_value_pairs: Sequence[Tuple[str, bytes]]) -> None`: set the contents of multiple keys
+- `mdelete(key: Sequence[str]) -> None`: delete multiple keys
+- `yield_keys(prefix: Optional[str] = None) -> Iterator[str]`: yield all keys in the store, optionally filtering by a prefix
+
+## Integrations
+
+Please reference the [stores integration page](/docs/integrations/stores/) for a list of available key-value store integrations.
diff --git a/docs/docs/concepts/lcel.mdx b/docs/docs/concepts/lcel.mdx
new file mode 100644
index 0000000000000..9378ec8e92852
--- /dev/null
+++ b/docs/docs/concepts/lcel.mdx
@@ -0,0 +1,221 @@
+# LangChain Expression Language (LCEL)
+
+:::info Prerequisites
+* [Runnable Interface](/docs/concepts/runnables)
+:::
+
+The **L**ang**C**hain **E**xpression **L**anguage (LCEL) takes a [declarative](https://en.wikipedia.org/wiki/Declarative_programming) approach to building new [Runnables](/docs/concepts/runnables) from existing Runnables.
+
+This means that you describe what you want to happen, rather than how you want it to happen, allowing LangChain to optimize the run-time execution of the chains.
+
+We often refer to a `Runnable` created using LCEL as a "chain". It's important to remember that a "chain" is `Runnable` and it implements the full [Runnable Interface](/docs/concepts/runnables).
+
+:::note
+* The [LCEL cheatsheet](https://python.langchain.com/docs/how_to/lcel_cheatsheet/) shows common patterns that involve the Runnable interface and LCEL expressions.
+* Please see the following list of [how-to guides](/docs/how_to/#langchain-expression-language-lcel) that cover common tasks with LCEL.
+* A list of built-in `Runnables` can be found in the [LangChain Core API Reference](https://python.langchain.com/api_reference/core/runnables.html). Many of these Runnables are useful when composing custom "chains" in LangChain using LCEL.
+:::
+
+## Benefits of LCEL
+
+LangChain optimizes the run-time execution of chains built with LCEL in a number of ways:
+
+- **Optimize parallel execution**: Run Runnables in parallel using [RunnableParallel](#RunnableParallel) or run multiple inputs through a given chain in parallel using the [Runnable Batch API](/docs/concepts/runnables#batch). Parallel execution can significantly reduce the latency as processing can be done in parallel instead of sequentially.
+- **Guarantee Async support**: Any chain built with LCEL can be run asynchronously using the [Runnable Async API](/docs/concepts/runnables#async-api). This can be useful when running chains in a server environment where you want to handle large number of requests concurrently.
+- **Simplify streaming**: LCEL chains can be streamed, allowing for incremental output as the chain is executed. LangChain can optimize the streaming of the output to minimize the time-to-first-token(time elapsed until the first chunk of output from a [chat model](/docs/concepts/chat_models) or [llm](/docs/concepts/llms) comes out).
+
+Other benefits include:
+
+- [**Seamless LangSmith tracing**](https://docs.smith.langchain.com)
+As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step.
+With LCEL, **all** steps are automatically logged to [LangSmith](https://docs.smith.langchain.com/) for maximum observability and debuggability.
+- **Standard API**: Because all chains are built using the Runnable interface, they can be used in the same way as any other Runnable.
+- [**Deployable with LangServe**](/docs/concepts/architecture#langserve): Chains built with LCEL can be deployed using for production use.
+
+## Should I use LCEL?
+
+LCEL is an [orchestration solution](https://en.wikipedia.org/wiki/Orchestration_(computing)) -- it allows LangChain to handle run-time execution of chains in an optimized way.
+
+While we have seen users run chains with hundreds of steps in production, we generally recommend using LCEL for simpler orchestration tasks. When the application requires complex state management, branching, cycles or multiple agents, we recommend that users take advantage of [LangGraph](/docs/concepts/architecture#langgraph).
+
+In LangGraph, users define graphs that specify the flow of the application. This allows users to keep using LCEL within individual nodes when LCEL is needed, while making it easy to define complex orchestration logic that is more readable and maintainable.
+
+Here are some guidelines:
+
+* If you are making a single LLM call, you don't need LCEL; instead call the underlying [chat model](/docs/concepts/chat_models) directly.
+* If you have a simple chain (e.g., prompt + llm + parser, simple retrieval set up etc.), LCEL is a reasonable fit, if you're taking advantage of the LCEL benefits.
+* If you're building a complex chain (e.g., with branching, cycles, multiple agents, etc.) use [LangGraph](/docs/concepts/architecture#langgraph) instead. Remember that you can always use LCEL within individual nodes in LangGraph.
+
+## Composition Primitives
+
+`LCEL` chains are built by composing existing `Runnables` together. The two main composition primitives are [RunnableSequence](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableSequence.html#langchain_core.runnables.base.RunnableSequence) and [RunnableParallel](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableParallel.html#langchain_core.runnables.base.RunnableParallel).
+
+Many other composition primitives (e.g., [RunnableAssign](
+https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.passthrough.RunnableAssign.html#langchain_core.runnables.passthrough.RunnableAssign
+)) can be thought of as variations of these two primitives.
+
+:::note
+You can find a list of all composition primitives in the [LangChain Core API Reference](https://python.langchain.com/api_reference/core/runnables.html).
+:::
+
+### RunnableSequence
+
+`RunnableSequence` is a composition primitive that allows you "chain" multiple runnables sequentially, with the output of one runnable serving as the input to the next.
+
+```python
+from langchain_core.runnables import RunnableSequence
+chain = RunnableSequence([runnable1, runnable2])
+```
+
+Invoking the `chain` with some input:
+
+```python
+final_output = chain.invoke(some_input)
+```
+
+corresponds to the following:
+
+```python
+output1 = runnable1.invoke(some_input)
+final_output = runnable2.invoke(output1)
+```
+
+:::note
+`runnable1` and `runnable2` are placeholders for any `Runnable` that you want to chain together.
+:::
+
+### RunnableParallel
+
+`RunnableParallel` is a composition primitive that allows you to run multiple runnables concurrently, with the same input provided to each.
+
+```python
+from langchain_core.runnables import RunnableParallel
+chain = RunnableParallel({
+ "key1": runnable1,
+ "key2": runnable2,
+})
+```
+
+Invoking the `chain` with some input:
+
+```python
+final_output = chain.invoke(some_input)
+```
+
+Will yield a `final_output` dictionary with the same keys as the input dictionary, but with the values replaced by the output of the corresponding runnable.
+
+```python
+{
+ "key1": runnable1.invoke(some_input),
+ "key2": runnable2.invoke(some_input),
+}
+```
+
+Recall, that the runnables are executed in parallel, so while the result is the same as
+dictionary comprehension shown above, the execution time is much faster.
+
+:::note
+`RunnableParallel`supports both synchronous and asynchronous execution (as all `Runnables` do).
+
+* For synchronous execution, `RunnableParallel` uses a [ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor) to run the runnables concurrently.
+* For asynchronous execution, `RunnableParallel` uses [asyncio.gather](https://docs.python.org/3/library/asyncio.html#asyncio.gather) to run the runnables concurrently.
+:::
+
+## Composition Syntax
+
+The usage of `RunnableSequence` and `RunnableParallel` is so common that we created a shorthand syntax for using them. This helps
+to make the code more readable and concise.
+
+### The `|` operator
+
+We have [overloaded](https://docs.python.org/3/reference/datamodel.html#special-method-names) the `|` operator to create a `RunnableSequence` from two `Runnables`.
+
+```python
+chain = runnable1 | runnable2
+```
+
+is Equivalent to:
+
+```python
+chain = RunnableSequence([runnable1, runnable2])
+```
+
+### The `.pipe` method`
+
+If you have moral qualms with operator overloading, you can use the `.pipe` method instead. This is equivalent to the `|` operator.
+
+```python
+chain = runnable1.pipe(runnable2)
+```
+
+### Coercion
+
+LCEL applies automatic type coercion to make it easier to compose chains.
+
+If you do not understand the type coercion, you can always use the `RunnableSequence` and `RunnableParallel` classes directly.
+
+This will make the code more verbose, but it will also make it more explicit.
+
+#### Dictionary to RunnableParallel
+
+Inside an LCEL expression, a dictionary is automatically converted to a `RunnableParallel`.
+
+For example, the following code:
+
+```python
+mapping = {
+ "key1": runnable1,
+ "key2": runnable2,
+}
+
+chain = mapping | runnable3
+```
+
+It gets automatically converted to the following:
+
+```python
+chain = RunnableSequence([RunnableParallel(mapping), runnable3])
+```
+
+:::caution
+You have to be careful because the `mapping` dictionary is not a `RunnableParallel` object, it is just a dictionary. This means that the following code will raise an `AttributeError`:
+
+```python
+mapping.invoke(some_input)
+```
+:::
+
+#### Function to RunnableLambda
+
+Inside an LCEL expression, a function is automatically converted to a `RunnableLambda`.
+
+```
+def some_func(x):
+ return x
+
+chain = some_func | runnable1
+```
+
+It gets automatically converted to the following:
+
+```python
+chain = RunnableSequence([RunnableLambda(some_func), runnable1])
+```
+
+:::caution
+You have to be careful because the lambda function is not a `RunnableLambda` object, it is just a function. This means that the following code will raise an `AttributeError`:
+
+```python
+lambda x: x + 1.invoke(some_input)
+```
+:::
+
+## Legacy Chains
+
+LCEL aims to provide consistency around behavior and customization over legacy subclassed chains such as `LLMChain` and
+`ConversationalRetrievalChain`. Many of these legacy chains hide important details like prompts, and as a wider variety
+of viable models emerge, customization has become more and more important.
+
+If you are currently using one of these legacy chains, please see [this guide for guidance on how to migrate](/docs/versions/migrating_chains).
+
+For guides on how to do specific tasks with LCEL, check out [the relevant how-to guides](/docs/how_to/#langchain-expression-language-lcel).
diff --git a/docs/docs/concepts/llms.mdx b/docs/docs/concepts/llms.mdx
new file mode 100644
index 0000000000000..5e2f7d98c7256
--- /dev/null
+++ b/docs/docs/concepts/llms.mdx
@@ -0,0 +1,3 @@
+# Large language models (llms)
+
+Please see the [Chat Model Concept Guide](/docs/concepts/chat_models) page for more information.
\ No newline at end of file
diff --git a/docs/docs/concepts/messages.mdx b/docs/docs/concepts/messages.mdx
new file mode 100644
index 0000000000000..811396883af06
--- /dev/null
+++ b/docs/docs/concepts/messages.mdx
@@ -0,0 +1,244 @@
+# Messages
+
+:::info Prerequisites
+- [Chat Models](/docs/concepts/chat_models)
+:::
+
+## Overview
+
+Messages are the unit of communication in [chat models](/docs/concepts/chat_models). They are used to represent the input and output of a chat model, as well as any additional context or metadata that may be associated with a conversation.
+
+Each message has a **role** (e.g., "user", "assistant"), **content** (e.g., text, multimodal data), and additional metadata that can vary depending on the chat model provider.
+
+LangChain provides a unified message format that can be used across chat models, allowing users to work with different chat models without worrying about the specific details of the message format used by each model provider.
+
+## What inside a message?
+
+A message typically consists of the following pieces of information:
+
+- **Role**: The role of the message (e.g., "user", "assistant").
+- **Content**: The content of the message (e.g., text, multimodal data).
+- Additional metadata: id, name, [token usage](/docs/concepts/tokens) and other model-specific metadata.
+
+### Role
+
+Roles are used to distinguish between different types of messages in a conversation and help the chat model understand how to respond to a given sequence of messages.
+
+| **Role** | **Description** |
+|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **system** | Used to tell the chat model how to behave and provide additional context. Not supported by all chat model providers. |
+| **user** | Represents input from a user interacting with the model, usually in the form of text or other interactive input. |
+| **assistant** | Represents a response from the model, which can include text or a request to invoke tools. |
+| **tool** | A message used to pass the results of a tool invocation back to the model after external data or processing has been retrieved. Used with chat models that support [tool calling](/docs/concepts/tool_calling). |
+| **function (legacy)** | This is a legacy role, corresponding to OpenAI's legacy function-calling API. **tool** role should be used instead. |
+
+### Content
+
+The content of a message text or a list of dictionaries representing [multimodal data](/docs/concepts/multimodality) (e.g., images, audio, video). The exact format of the content can vary between different chat model providers.
+
+Currently, most chat models support text as the primary content type, with some models also supporting multimodal data. However, support for multimodal data is still limited across most chat model providers.
+
+For more information see:
+* [HumanMessage](#humanmessage) -- for content in the input from the user.
+* [AIMessage](#aimessage) -- for content in the response from the model.
+* [Multimodality](/docs/concepts/multimodality) -- for more information on multimodal content.
+
+### Other Message Data
+
+Depending on the chat model provider, messages can include other data such as:
+
+- **ID**: An optional unique identifier for the message.
+- **Name**: An optional `name` property which allows differentiate between different entities/speakers with the same role. Not all models support this!
+- **Metadata**: Additional information about the message, such as timestamps, token usage, etc.
+- **Tool Calls**: A request made by the model to call one or more tools> See [tool calling](/docs/concepts/tool_calling) for more information.
+
+## Conversation Structure
+
+The sequence of messages into a chat model should follow a specific structure to ensure that the chat model can generate a valid response.
+
+For example, a typical conversation structure might look like this:
+
+1. **User Message**: "Hello, how are you?"
+2. **Assistant Message**: "I'm doing well, thank you for asking."
+3. **User Message**: "Can you tell me a joke?"
+4. **Assistant Message**: "Sure! Why did the scarecrow win an award? Because he was outstanding in his field!"
+
+Please read the [chat history](/docs/concepts/chat_history) guide for more information on managing chat history and ensuring that the conversation structure is correct.
+
+## LangChain Messages
+
+LangChain provides a unified message format that can be used across all chat models, allowing users to work with different chat models without worrying about the specific details of the message format used by each model provider.
+
+LangChain messages are Python objects that subclass from a [BaseMessage](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.base.BaseMessage.html).
+
+The five main message types are:
+
+- [SystemMessage](#systemmessage): corresponds to **system** role
+- [HumanMessage](#humanmessage): corresponds to **user** role
+- [AIMessage](#aimessage): corresponds to **assistant** role
+- [AIMessageChunk](#aimessagechunk): corresponds to **assistant** role, used for [streaming](/docs/concepts/streaming) responses
+- [ToolMessage](#toolmessage): corresponds to **tool** role
+
+Other important messages include:
+
+- [RemoveMessage](#removemessage) -- does not correspond to any role. This is an abstraction, mostly used in [LangGraph](/docs/concepts/architecture#langgraph) to manage chat history.
+- **Legacy** [FunctionMessage](#legacy-functionmessage): corresponds to the **function** role in OpenAI's **legacy** function-calling API.
+
+You can find more information about **messages** in the [API Reference](https://python.langchain.com/api_reference/core/messages.html).
+
+### SystemMessage
+
+A `SystemMessage` is used to prime the behavior of the AI model and provide additional context, such as instructing the model to adopt a specific persona or setting the tone of the conversation (e.g., "This is a conversation about cooking").
+
+Different chat providers may support system message in one of the following ways:
+
+* **Through a "system" message role**: In this case, a system message is included as part of the message sequence with the role explicitly set as "system."
+* **Through a separate API parameter for system instructions**: Instead of being included as a message, system instructions are passed via a dedicated API parameter.
+* **No support for system messages**: Some models do not support system messages at all.
+
+Most major chat model providers support system instructions via either a chat message or a separate API parameter. LangChain will automatically adapt based on the provider’s capabilities. If the provider supports a separate API parameter for system instructions, LangChain will extract the content of a system message and pass it through that parameter.
+
+If no system message is supported by the provider, in most cases LangChain will attempt to incorporate the system message's content into a HumanMessage or raise an exception if that is not possible. However, this behavior is not yet consistently enforced across all implementations, and if using a less popular implementation of a chat model (e.g., an implementation from the `langchain-community` package) it is recommended to check the specific documentation for that model.
+
+### HumanMessage
+
+The `HumanMessage` corresponds to the **"user"** role. A human message represents input from a user interacting with the model.
+
+#### Text Content
+
+Most chat models expect the user input to be in the form of text.
+
+```python
+from langchain_core.messages import HumanMessage
+
+model.invoke([HumanMessage(content="Hello, how are you?")])
+```
+
+:::tip
+When invoking a chat model with a string as input, LangChain will automatically convert the string into a `HumanMessage` object. This is mostly useful for quick testing.
+
+```python
+model.invoke("Hello, how are you?")
+```
+:::
+
+#### Multi-modal Content
+
+Some chat models accept multimodal inputs, such as images, audio, video, or files like PDFs.
+
+Please see the [multimodality](/docs/concepts/multimodality) guide for more information.
+
+### AIMessage
+
+`AIMessage` is used to represent a message with the role **"assistant"**. This is the response from the model, which can include text or a request to invoke tools. It could also include other media types like images, audio, or video -- though this is still uncommon at the moment.
+
+```python
+from langchain_core.messages import HumanMessage
+ai_message = model.invoke([HumanMessage("Tell me a joke")])
+ai_message # <-- AIMessage
+```
+
+An `AIMessage` has the following attributes. The attributes which are **standardized** are the ones that LangChain attempts to standardize across different chat model providers. **raw** fields are specific to the model provider and may vary.
+
+| Attribute | Standardized/Raw | Description |
+|----------------------|:-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `content` | Raw | Usually a string, but can be a list of content blocks. See [content](#content) for details. |
+| `tool_calls` | Standardized | Tool calls associated with the message. See [tool calling](/docs/concepts/tool_calling) for details. |
+| `invalid_tool_calls` | Standardized | Tool calls with parsing errors associated with the message. See [tool calling](/docs/concepts/tool_calling) for details. |
+| `usage_metadata` | Standardized | Usage metadata for a message, such as [token counts](/docs/concepts/tokens). See [Usage Metadata API Reference](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.ai.UsageMetadata.html) |
+| `id` | Standardized | An optional unique identifier for the message, ideally provided by the provider/model that created the message. |
+| `response_metadata` | Raw | Response metadata, e.g., response headers, logprobs, token counts. |
+
+#### content
+
+The **content** property of an `AIMessage` represents the response generated by the chat model.
+
+The content is either:
+
+- **text** -- the norm for virtually all chat models.
+- A **list of dictionaries** -- Each dictionary represents a content block and is associated with a `type`.
+ * Used by Anthropic for surfacing agent thought process when doing [tool calling](/docs/concepts/tool_calling).
+ * Used by OpenAI for audio outputs. Please see [multi-modal content](/docs/concepts/multimodality) for more information.
+
+:::important
+The **content** property is **not** standardized across different chat model providers, mostly because there are
+still few examples to generalize from.
+:::
+
+### AIMessageChunk
+
+It is common to [stream](/docs/concepts/streaming) responses for the chat model as they are being generated, so the user can see the response in real-time instead of waiting for the entire response to be generated before displaying it.
+
+It is returned from the `stream`, `astream` and `astream_events` methods of the chat model.
+
+For example,
+
+```python
+for chunk in model.stream([HumanMessage("what color is the sky?")]):
+ print(chunk)
+```
+
+`AIMessageChunk` follows nearly the same structure as `AIMessage`, but uses a different [ToolCallChunk](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.tool.ToolCallChunk.html#langchain_core.messages.tool.ToolCallChunk)
+to be able to stream tool calling in a standardized manner.
+
+
+#### Aggregating
+
+`AIMessageChunks` support the `+` operator to merge them into a single `AIMessage`. This is useful when you want to display the final response to the user.
+
+```python
+ai_message = chunk1 + chunk2 + chunk3 + ...
+```
+
+### ToolMessage
+
+This represents a message with role "tool", which contains the result of [calling a tool](/docs/concepts/tool_calling). In addition to `role` and `content`, this message has:
+
+- a `tool_call_id` field which conveys the id of the call to the tool that was called to produce this result.
+- an `artifact` field which can be used to pass along arbitrary artifacts of the tool execution which are useful to track but which should not be sent to the model.
+
+Please see [tool calling](/docs/concepts/tool_calling) for more information.
+
+### RemoveMessage
+
+This is a special message type that does not correspond to any roles. It is used
+for managing chat history in [LangGraph](/docs/concepts/architecture#langgraph).
+
+Please see the following for more information on how to use the `RemoveMessage`:
+
+* [Memory conceptual guide](https://langchain-ai.github.io/langgraph/concepts/memory/)
+* [How to delete messages](https://langchain-ai.github.io/langgraph/how-tos/memory/delete-messages/)
+
+### (Legacy) FunctionMessage
+
+This is a legacy message type, corresponding to OpenAI's legacy function-calling API. `ToolMessage` should be used instead to correspond to the updated tool-calling API.
+
+## OpenAI Format
+
+### Inputs
+
+Chat models also accept OpenAI's format as **inputs** to chat models:
+
+```python
+chat_model.invoke([
+ {
+ "role": "user",
+ "content": "Hello, how are you?",
+ },
+ {
+ "role": "assistant",
+ "content": "I'm doing well, thank you for asking.",
+ },
+ {
+ "role": "user",
+ "content": "Can you tell me a joke?",
+ }
+])
+```
+
+### Outputs
+
+At the moment, the output of the model will be in terms of LangChain messages, so you will need to convert the output to the OpenAI format if you
+need OpenAI format for the output as well.
+
+The [convert_to_openai_messages](https://python.langchain.com/api_reference/core/messages/langchain_core.messages.utils.convert_to_openai_messages.html) utility function can be used to convert from LangChain messages to OpenAI format.
\ No newline at end of file
diff --git a/docs/docs/concepts/multimodality.mdx b/docs/docs/concepts/multimodality.mdx
new file mode 100644
index 0000000000000..3692e4e1ef1ef
--- /dev/null
+++ b/docs/docs/concepts/multimodality.mdx
@@ -0,0 +1,88 @@
+# Multimodality
+
+## Overview
+
+**Multimodality** refers to the ability to work with data that comes in different forms, such as text, audio, images, and video. Multimodality can appear in various components, allowing models and systems to handle and process a mix of these data types seamlessly.
+
+- **Chat Models**: These could, in theory, accept and generate multimodal inputs and outputs, handling a variety of data types like text, images, audio, and video.
+- **Embedding Models**: Embedding Models can represent multimodal content, embedding various forms of data—such as text, images, and audio—into vector spaces.
+- **Vector Stores**: Vector stores could search over embeddings that represent multimodal data, enabling retrieval across different types of information.
+
+## Multimodality in chat models
+
+:::info Pre-requisites
+* [Chat models](/docs/concepts/chat_models)
+* [Messages](/docs/concepts/messages)
+:::
+
+Multimodal support is still relatively new and less common, model providers have not yet standardized on the "best" way to define the API. As such, LangChain's multimodal abstractions are lightweight and flexible, designed to accommodate different model providers' APIs and interaction patterns, but are **not** standardized across models.
+
+### How to use multimodal models
+
+* Use the [chat model integration table](/docs/integrations/chat/) to identify which models support multimodality.
+* Reference the [relevant how-to guides](/docs/how_to/#multimodal) for specific examples of how to use multimodal models.
+
+### What kind of multimodality is supported?
+
+#### Inputs
+
+Some models can accept multimodal inputs, such as images, audio, video, or files. The types of multimodal inputs supported depend on the model provider. For instance, [Google's Gemini](https://python.langchain.com/docs/integrations/chat/google_generative_ai/) supports documents like PDFs as inputs.
+
+Most chat models that support **multimodal inputs** also accept those values in OpenAI's content blocks format. So far this is restricted to image inputs. For models like Gemini which support video and other bytes input, the APIs also support the native, model-specific representations.
+
+The gist of passing multimodal inputs to a chat model is to use content blocks that specify a type and corresponding data. For example, to pass an image to a chat model:
+
+```python
+from langchain_core.messages import HumanMessage
+
+message = HumanMessage(
+ content=[
+ {"type": "text", "text": "describe the weather in this image"},
+ {"type": "image_url", "image_url": {"url": image_url}},
+ ],
+)
+response = model.invoke([message])
+```
+
+:::caution
+The exact format of the content blocks may vary depending on the model provider. Please refer to the chat model's
+integration documentation for the correct format. Find the integration in the [chat model integration table](/docs/integrations/chat/).
+:::
+
+#### Outputs
+
+Virtually no popular chat models support multimodal outputs at the time of writing (October 2024).
+
+The only exception is OpenAI's chat model ([gpt-4o-audio-preview](https://python.langchain.com/docs/integrations/chat/openai/)), which can generate audio outputs.
+
+Multimodal outputs will appear as part of the [AIMessage](/docs/concepts/messages/#aimessage) response object.
+
+Please see the [ChatOpenAI](/docs/integrations/chat/openai/) for more information on how to use multimodal outputs.
+
+#### Tools
+
+Currently, no chat model is designed to work **directly** with multimodal data in a [tool call request](/docs/concepts/tool_calling) or [ToolMessage](/docs/concepts/tool_calling) result.
+
+However, a chat model can easily interact with multimodal data by invoking tools with references (e.g., a URL) to the multimodal data, rather than the data itself. For example, any model capable of [tool calling](/docs/concepts/tool_calling) can be equipped with tools to download and process images, audio, or video.
+
+## Multimodality in embedding models
+
+:::info Prerequisites
+* [Embedding Models](/docs/concepts/embedding_models)
+:::
+
+**Embeddings** are vector representations of data used for tasks like similarity search and retrieval.
+
+The current [embedding interface](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html#langchain_core.embeddings.embeddings.Embeddings) used in LangChain is optimized entirely for text-based data, and will **not** work with multimodal data.
+
+As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the embedding interface to accommodate other data types like images, audio, and video.
+
+## Multimodality in vector stores
+
+:::info Prerequisites
+* [Vectorstores](/docs/concepts/vectorstores)
+:::
+
+Vector stores are databases for storing and retrieving embeddings, which are typically used in search and retrieval tasks. Similar to embeddings, vector stores are currently optimized for text-based data.
+
+As use cases involving multimodal search and retrieval tasks become more common, we expect to expand the vector store interface to accommodate other data types like images, audio, and video.
diff --git a/docs/docs/concepts/output_parsers.mdx b/docs/docs/concepts/output_parsers.mdx
new file mode 100644
index 0000000000000..a03daea8737a5
--- /dev/null
+++ b/docs/docs/concepts/output_parsers.mdx
@@ -0,0 +1,41 @@
+# Output parsers
+
+
+
+:::note
+
+The information here refers to parsers that take a text output from a model try to parse it into a more structured representation.
+More and more models are supporting function (or tool) calling, which handles this automatically.
+It is recommended to use function/tool calling rather than output parsing.
+See documentation for that [here](/docs/concepts/#function-tool-calling).
+
+:::
+
+`Output parser` is responsible for taking the output of a model and transforming it to a more suitable format for downstream tasks.
+Useful when you are using LLMs to generate structured data, or to normalize output from chat models and LLMs.
+
+LangChain has lots of different types of output parsers. This is a list of output parsers LangChain supports. The table below has various pieces of information:
+
+- **Name**: The name of the output parser
+- **Supports Streaming**: Whether the output parser supports streaming.
+- **Has Format Instructions**: Whether the output parser has format instructions. This is generally available except when (a) the desired schema is not specified in the prompt but rather in other parameters (like OpenAI function calling), or (b) when the OutputParser wraps another OutputParser.
+- **Calls LLM**: Whether this output parser itself calls an LLM. This is usually only done by output parsers that attempt to correct misformatted output.
+- **Input Type**: Expected input type. Most output parsers work on both strings and messages, but some (like OpenAI Functions) need a message with specific kwargs.
+- **Output Type**: The output type of the object returned by the parser.
+- **Description**: Our commentary on this output parser and when to use it.
+
+| Name | Supports Streaming | Has Format Instructions | Calls LLM | Input Type | Output Type | Description |
+|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|-------------------------|-----------|--------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [JSON](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.json.JSONOutputParser.html#langchain_core.output_parsers.json.JSONOutputParser) | ✅ | ✅ | | `str` \| `Message` | JSON object | Returns a JSON object as specified. You can specify a Pydantic model and it will return JSON for that model. Probably the most reliable output parser for getting structured data that does NOT use function calling. |
+| [XML](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.xml.XMLOutputParser.html#langchain_core.output_parsers.xml.XMLOutputParser) | ✅ | ✅ | | `str` \| `Message` | `dict` | Returns a dictionary of tags. Use when XML output is needed. Use with models that are good at writing XML (like Anthropic's). |
+| [CSV](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.list.CommaSeparatedListOutputParser.html#langchain_core.output_parsers.list.CommaSeparatedListOutputParser) | ✅ | ✅ | | `str` \| `Message` | `List[str]` | Returns a list of comma separated values. |
+| [OutputFixing](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.fix.OutputFixingParser.html#langchain.output_parsers.fix.OutputFixingParser) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the error message and the bad output to an LLM and ask it to fix the output. |
+| [RetryWithError](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.retry.RetryWithErrorOutputParser.html#langchain.output_parsers.retry.RetryWithErrorOutputParser) | | | ✅ | `str` \| `Message` | | Wraps another output parser. If that output parser errors, then this will pass the original inputs, the bad output, and the error message to an LLM and ask it to fix it. Compared to OutputFixingParser, this one also sends the original instructions. |
+| [Pydantic](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.pydantic.PydanticOutputParser.html#langchain_core.output_parsers.pydantic.PydanticOutputParser) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. |
+| [YAML](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.yaml.YamlOutputParser.html#langchain.output_parsers.yaml.YamlOutputParser) | | ✅ | | `str` \| `Message` | `pydantic.BaseModel` | Takes a user defined Pydantic model and returns data in that format. Uses YAML to encode it. |
+| [PandasDataFrame](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.pandas_dataframe.PandasDataFrameOutputParser.html#langchain.output_parsers.pandas_dataframe.PandasDataFrameOutputParser) | | ✅ | | `str` \| `Message` | `dict` | Useful for doing operations with pandas DataFrames. |
+| [Enum](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.enum.EnumOutputParser.html#langchain.output_parsers.enum.EnumOutputParser) | | ✅ | | `str` \| `Message` | `Enum` | Parses response into one of the provided enum values. |
+| [Datetime](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.datetime.DatetimeOutputParser.html#langchain.output_parsers.datetime.DatetimeOutputParser) | | ✅ | | `str` \| `Message` | `datetime.datetime` | Parses response into a datetime string. |
+| [Structured](https://python.langchain.com/api_reference/langchain/output_parsers/langchain.output_parsers.structured.StructuredOutputParser.html#langchain.output_parsers.structured.StructuredOutputParser) | | ✅ | | `str` \| `Message` | `Dict[str, str]` | An output parser that returns structured information. It is less powerful than other output parsers since it only allows for fields to be strings. This can be useful when you are working with smaller LLMs. |
+
+For specifics on how to use output parsers, see the [relevant how-to guides here](/docs/how_to/#output-parsers).
diff --git a/docs/docs/concepts/prompt_templates.mdx b/docs/docs/concepts/prompt_templates.mdx
new file mode 100644
index 0000000000000..b8bb74314db2d
--- /dev/null
+++ b/docs/docs/concepts/prompt_templates.mdx
@@ -0,0 +1,79 @@
+# Prompt Templates
+
+Prompt templates help to translate user input and parameters into instructions for a language model.
+This can be used to guide a model's response, helping it understand the context and generate relevant and coherent language-based output.
+
+Prompt Templates take as input a dictionary, where each key represents a variable in the prompt template to fill in.
+
+Prompt Templates output a PromptValue. This PromptValue can be passed to an LLM or a ChatModel, and can also be cast to a string or a list of messages.
+The reason this PromptValue exists is to make it easy to switch between strings and messages.
+
+There are a few different types of prompt templates:
+
+## String PromptTemplates
+
+These prompt templates are used to format a single string, and generally are used for simpler inputs.
+For example, a common way to construct and use a PromptTemplate is as follows:
+
+```python
+from langchain_core.prompts import PromptTemplate
+
+prompt_template = PromptTemplate.from_template("Tell me a joke about {topic}")
+
+prompt_template.invoke({"topic": "cats"})
+```
+
+## ChatPromptTemplates
+
+These prompt templates are used to format a list of messages. These "templates" consist of a list of templates themselves.
+For example, a common way to construct and use a ChatPromptTemplate is as follows:
+
+```python
+from langchain_core.prompts import ChatPromptTemplate
+
+prompt_template = ChatPromptTemplate([
+ ("system", "You are a helpful assistant"),
+ ("user", "Tell me a joke about {topic}")
+])
+
+prompt_template.invoke({"topic": "cats"})
+```
+
+In the above example, this ChatPromptTemplate will construct two messages when called.
+The first is a system message, that has no variables to format.
+The second is a HumanMessage, and will be formatted by the `topic` variable the user passes in.
+
+## MessagesPlaceholder
+
+
+This prompt template is responsible for adding a list of messages in a particular place.
+In the above ChatPromptTemplate, we saw how we could format two messages, each one a string.
+But what if we wanted the user to pass in a list of messages that we would slot into a particular spot?
+This is how you use MessagesPlaceholder.
+
+```python
+from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_core.messages import HumanMessage
+
+prompt_template = ChatPromptTemplate([
+ ("system", "You are a helpful assistant"),
+ MessagesPlaceholder("msgs")
+])
+
+prompt_template.invoke({"msgs": [HumanMessage(content="hi!")]})
+```
+
+This will produce a list of two messages, the first one being a system message, and the second one being the HumanMessage we passed in.
+If we had passed in 5 messages, then it would have produced 6 messages in total (the system message plus the 5 passed in).
+This is useful for letting a list of messages be slotted into a particular spot.
+
+An alternative way to accomplish the same thing without using the `MessagesPlaceholder` class explicitly is:
+
+```python
+prompt_template = ChatPromptTemplate([
+ ("system", "You are a helpful assistant"),
+ ("placeholder", "{msgs}") # <-- This is the changed part
+])
+```
+
+For specifics on how to use prompt templates, see the [relevant how-to guides here](/docs/how_to/#prompt-templates).
diff --git a/docs/docs/concepts/rag.mdx b/docs/docs/concepts/rag.mdx
new file mode 100644
index 0000000000000..eb4752b6ffe2d
--- /dev/null
+++ b/docs/docs/concepts/rag.mdx
@@ -0,0 +1,98 @@
+# Retrieval augmented generation (rag)
+
+:::info[Prerequisites]
+
+* [Retrieval](/docs/concepts/retrieval/)
+
+:::
+
+## Overview
+
+Retrieval Augmented Generation (RAG) is a powerful technique that enhances [language models](/docs/concepts/chat_models/) by combining them with external knowledge bases.
+RAG addresses [a key limitation of models](https://www.glean.com/blog/how-to-build-an-ai-assistant-for-the-enterprise): models rely on fixed training datasets, which can lead to outdated or incomplete information.
+When given a query, RAG systems first search a knowledge base for relevant information.
+The system then incorporates this retrieved information into the model's prompt.
+The model uses the provided context to generate a response to the query.
+By bridging the gap between vast language models and dynamic, targeted information retrieval, RAG is a powerful technique for building more capable and reliable AI systems.
+
+## Key concepts
+
+![Conceptual Overview](/img/rag_concepts.png)
+
+(1) **Retrieval system**: Retrieve relevant information from a knowledge base.
+
+(2) **Adding external knowledge**: Pass retrieved information to a model.
+
+## Retrieval system
+
+Model's have internal knowledge that is often fixed, or at least not updated frequently due to the high cost of training.
+This limits their ability to answer questions about current events, or to provide specific domain knowledge.
+To address this, there are various knowledge injection techniques like [fine-tuning](https://hamel.dev/blog/posts/fine_tuning_valuable.html) or continued pre-training.
+Both are [costly](https://www.glean.com/blog/how-to-build-an-ai-assistant-for-the-enterprise) and often [poorly suited](https://www.anyscale.com/blog/fine-tuning-is-for-form-not-facts) for factual retrieval.
+Using a retrieval system offers several advantages:
+
+- **Up-to-date information**: RAG can access and utilize the latest data, keeping responses current.
+- **Domain-specific expertise**: With domain-specific knowledge bases, RAG can provide answers in specific domains.
+- **Reduced hallucination**: Grounding responses in retrieved facts helps minimize false or invented information.
+- **Cost-effective knowledge integration**: RAG offers a more efficient alternative to expensive model fine-tuning.
+
+:::info[Further reading]
+
+See our conceptual guide on [retrieval](/docs/concepts/retrieval/).
+
+:::
+
+## Adding external knowledge
+
+With a retrieval system in place, we need to pass knowledge from this system to the model.
+A RAG pipeline typically achieves this following these steps:
+
+- Receive an input query.
+- Use the retrieval system to search for relevant information based on the query.
+- Incorporate the retrieved information into the prompt sent to the LLM.
+- Generate a response that leverages the retrieved context.
+
+As an example, here's a simple RAG workflow that passes information from a [retriever](/docs/concepts/retrievers/) to a [chat model](/docs/concepts/chat_models/):
+
+```python
+from langchain_openai import ChatOpenAI
+from langchain_core.messages import SystemMessage, HumanMessage
+
+# Define a system prompt that tells the model how to use the retrieved context
+system_prompt = """You are an assistant for question-answering tasks.
+Use the following pieces of retrieved context to answer the question.
+If you don't know the answer, just say that you don't know.
+Use three sentences maximum and keep the answer concise.
+Context: {context}:"""
+
+# Define a question
+question = """What are the main components of an LLM-powered autonomous agent system?"""
+
+# Retrieve relevant documents
+docs = retriever.invoke(question)
+
+# Combine the documents into a single string
+docs_text = "".join(d.page_content for d in docs)
+
+# Populate the system prompt with the retrieved context
+system_prompt_fmt = system_prompt.format(context=docs_text)
+
+# Create a model
+model = ChatOpenAI(model="gpt-4o", temperature=0)
+
+# Generate a response
+questions = model.invoke([SystemMessage(content=system_prompt_fmt),
+ HumanMessage(content=question)])
+```
+
+:::info[Further reading]
+
+RAG a deep area with many possible optimization and design choices:
+
+* See [this excellent blog](https://cameronrwolfe.substack.com/p/a-practitioners-guide-to-retrieval?utm_source=profile&utm_medium=reader2) from Cameron Wolfe for a comprehensive overview and history of RAG.
+* See our [RAG how-to guides](/docs/how_to/#qa-with-rag).
+* See our RAG [tutorials](/docs/tutorials/#working-with-external-knowledge).
+* See our RAG from Scratch course, with [code](https://github.com/langchain-ai/rag-from-scratch) and [video playlist](https://www.youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x).
+* Also, see our RAG from Scratch course [on Freecodecamp](https://youtu.be/sVcwVQRHIc8?feature=shared).
+
+:::
diff --git a/docs/docs/concepts/retrieval.mdx b/docs/docs/concepts/retrieval.mdx
new file mode 100644
index 0000000000000..37bb1eb506d41
--- /dev/null
+++ b/docs/docs/concepts/retrieval.mdx
@@ -0,0 +1,240 @@
+# Retrieval
+
+:::info[Prerequisites]
+
+* [Retrievers](/docs/concepts/retrievers/)
+* [Vectorstores](/docs/concepts/vectorstores/)
+* [Embeddings](/docs/concepts/embedding_models/)
+* [Text splitters](/docs/concepts/text_splitters/)
+
+:::
+
+:::danger[Security]
+
+Some of the concepts reviewed here utilize models to generate queries (e.g., for SQL or graph databases).
+There are inherent risks in doing this.
+Make sure that your database connection permissions are scoped as narrowly as possible for your application's needs.
+This will mitigate, though not eliminate, the risks of building a model-driven system capable of querying databases.
+For more on general security best practices, see our [security guide](/docs/security/).
+
+:::
+
+## Overview
+
+Retrieval systems are fundamental to many AI applications, efficiently identifying relevant information from large datasets.
+These systems accommodate various data formats:
+
+- Unstructured text (e.g., documents) is often stored in vector stores or lexical search indexes.
+- Structured data is typically housed in relational or graph databases with defined schemas.
+
+Despite this diversity in data formats, modern AI applications increasingly aim to make all types of data accessible through natural language interfaces.
+Models play a crucial role in this process by translating natural language queries into formats compatible with the underlying search index or database.
+This translation enables more intuitive and flexible interactions with complex data structures.
+
+## Key concepts
+
+![Retrieval](/img/retrieval_concept.png)
+
+(1) **Query analysis**: A process where models transform or construct search queries to optimize retrieval.
+
+(2) **Information retrieval**: Search queries are used to fetch information from various retrieval systems.
+
+## Query analysis
+
+While users typically prefer to interact with retrieval systems using natural language, retrieval systems can specific query syntax or benefit from particular keywords.
+Query analysis serves as a bridge between raw user input and optimized search queries. Some common applications of query analysis include:
+
+1. **Query Re-writing**: Queries can be re-written or expanded to improve semantic or lexical searches.
+2. **Query Construction**: Search indexes may require structured queries (e.g., SQL for databases).
+
+Query analysis employs models to transform or construct optimized search queries from raw user input.
+
+### Query re-writing
+
+Retrieval systems should ideally handle a wide spectrum of user inputs, from simple and poorly worded queries to complex, multi-faceted questions.
+To achieve this versatility, a popular approach is to use models to transform raw user queries into more effective search queries.
+This transformation can range from simple keyword extraction to sophisticated query expansion and reformulation.
+Here are some key benefits of using models for query analysis in unstructured data retrieval:
+
+1. **Query Clarification**: Models can rephrase ambiguous or poorly worded queries for clarity.
+2. **Semantic Understanding**: They can capture the intent behind a query, going beyond literal keyword matching.
+3. **Query Expansion**: Models can generate related terms or concepts to broaden the search scope.
+4. **Complex Query Handling**: They can break down multi-part questions into simpler sub-queries.
+
+Various techniques have been developed to leverage models for query re-writing, including:
+
+| Name | When to use | Description |
+|-----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Multi-query](/docs/how_to/MultiQueryRetriever/) | When you want to ensure high recall in retrieval by providing multiple pharsings of a question. | Rewrite the user question with multiple pharsings, retrieve documents for each rewritten question, return the unique documents for all queries. |
+| [Decomposition](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a question can be broken down into smaller subproblems. | Decompose a question into a set of subproblems / questions, which can either be solved sequentially (use the answer from first + retrieval to answer the second) or in parallel (consolidate each answer into final answer). |
+| [Step-back](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | When a higher-level conceptual understanding is required. | First prompt the LLM to ask a generic step-back question about higher-level concepts or principles, and retrieve relevant facts about them. Use this grounding to help answer the user question. [Paper](https://arxiv.org/pdf/2310.06117). |
+| [HyDE](https://github.com/langchain-ai/rag-from-scratch/blob/main/rag_from_scratch_5_to_9.ipynb) | If you have challenges retrieving relevant documents using the raw user inputs. | Use an LLM to convert questions into hypothetical documents that answer the question. Use the embedded hypothetical documents to retrieve real documents with the premise that doc-doc similarity search can produce more relevant matches. [Paper](https://arxiv.org/abs/2212.10496). |
+
+As an example, query decomposition can simply be accomplished using prompting and a structured output that enforces a list of sub-questions.
+These can then be run sequentially or in parallel on a downstream retrieval system.
+
+```python
+from pydantic import BaseModel, Field
+from langchain_openai import ChatOpenAI
+from langchain_core.messages import SystemMessage, HumanMessage
+
+# Define a pydantic model to enforce the output structure
+class Questions(BaseModel):
+ questions: List[str] = Field(
+ description="A list of sub-questions related to the input query."
+ )
+
+# Create an instance of the model and enforce the output structure
+model = ChatOpenAI(model="gpt-4o", temperature=0)
+structured_model = model.with_structured_output(Questions)
+
+# Define the system prompt
+system = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
+The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n"""
+
+# Pass the question to the model
+question = """What are the main components of an LLM-powered autonomous agent system?"""
+questions = structured_model.invoke([SystemMessage(content=system)]+[HumanMessage(content=question)])
+```
+
+:::tip
+
+See our RAG from Scratch videos for a few different specific approaches:
+- [Multi-query](https://youtu.be/JChPi0CRnDY?feature=shared)
+- [Decomposition](https://youtu.be/h0OPWlEOank?feature=shared)
+- [Step-back](https://youtu.be/xn1jEjRyJ2U?feature=shared)
+- [HyDE](https://youtu.be/SaDzIVkYqyY?feature=shared)
+
+:::
+
+### Query construction
+
+Query analysis also can focus on translating natural language queries into specialized query languages or filters.
+This translation is crucial for effectively interacting with various types of databases that house structured or semi-structured data.
+
+1. **Structured Data examples**: For relational and graph databases, Domain-Specific Languages (DSLs) are used to query data.
+ - **Text-to-SQL**: [Converts natural language to SQL](https://paperswithcode.com/task/text-to-sql) for relational databases.
+ - **Text-to-Cypher**: [Converts natural language to Cypher](https://neo4j.com/labs/neodash/2.4/user-guide/extensions/natural-language-queries/) for graph databases.
+
+2. **Semi-structured Data examples**: For vectorstores, queries can combine semantic search with metadata filtering.
+ - **Natural Language to Metadata Filters**: Converts user queries into [appropriate metadata filters](https://docs.pinecone.io/guides/data/filter-with-metadata).
+
+These approaches leverage models to bridge the gap between user intent and the specific query requirements of different data storage systems. Here are some popular techniques:
+
+| Name | When to Use | Description |
+|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Self Query](/docs/how_to/self_query/) | If users are asking questions that are better answered by fetching documents based on metadata rather than similarity with the text. | This uses an LLM to transform user input into two things: (1) a string to look up semantically, (2) a metadata filter to go along with it. This is useful because oftentimes questions are about the METADATA of documents (not the content itself). |
+| [Text to SQL](/docs/tutorials/sql_qa/) | If users are asking questions that require information housed in a relational database, accessible via SQL. | This uses an LLM to transform user input into a SQL query. |
+| [Text-to-Cypher](/docs/tutorials/graph/) | If users are asking questions that require information housed in a graph database, accessible via Cypher. | This uses an LLM to transform user input into a Cypher query. |
+
+As an example, here is how to use the `SelfQueryRetriever` to convert natural language queries into metadata filters.
+
+```python
+metadata_field_info = schema_for_metadata
+document_content_description = "Brief summary of a movie"
+llm = ChatOpenAI(temperature=0)
+retriever = SelfQueryRetriever.from_llm(
+ llm,
+ vectorstore,
+ document_content_description,
+ metadata_field_info,
+)
+```
+
+:::info[Further reading]
+
+* See our tutorials on [text-to-SQL](/docs/tutorials/sql_qa/), [text-to-Cypher](/docs/tutorials/graph/), and [query analysis for metadata filters](/docs/tutorials/query_analysis/).
+* See our [blog post overview](https://blog.langchain.dev/query-construction/).
+* See our RAG from Scratch video on [query construction](https://youtu.be/kl6NwWYxvbM?feature=shared).
+
+:::
+
+## Information retrieval
+
+### Common retrieval systems
+
+#### Lexical search indexes
+
+Many search engines are based upon matching words in a query to the words in each document.
+This approach is called lexical retrieval, using search [algorithms that are typically based upon word frequencies](https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search?utm_source=profile&utm_medium=reader2).
+The intution is simple: a word appears frequently both in the user’s query and a particular document, then this document might be a good match.
+
+The particular data structure used to implement this is often an [*inverted index*](https://www.geeksforgeeks.org/inverted-index/).
+This types of index contains a list of words and a mapping of each word to a list of locations at which it occurs in various documents.
+Using this data structure, it is possible to efficiently match the words in search queries to the documents in which they appear.
+[BM25](https://en.wikipedia.org/wiki/Okapi_BM25#:~:text=BM25%20is%20a%20bag%2Dof,slightly%20different%20components%20and%20parameters.) and [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) are [two popular lexical search algorithms](https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search?utm_source=profile&utm_medium=reader2).
+
+:::info[Further reading]
+
+* See the [BM25](/docs/integrations/retrievers/bm25/) retriever integration.
+* See the [Elasticsearch](/docs/integrations/retrievers/elasticsearch_retriever/) retriever integration.
+
+:::
+
+#### Vector indexes
+
+Vector indexes are an alternative way to index and store unstructured data.
+See our conceptual guide on [vectorstores](/docs/concepts/vectorstores/) for a detailed overview.
+In short, rather than using word frequencies, vectorstores use an [embedding model](/docs/concepts/embedding_models/) to compress documents into high-dimensional vector representation.
+This allows for efficient similarity search over embedding vectors using simple mathematical operations like cosine similarity.
+
+:::info[Further reading]
+
+* See our [how-to guide](/docs/how_to/vectorstore_retriever/) for more details on working with vectorstores.
+* See our [list of vectorstore integrations](/docs/integrations/vectorstores/).
+* See Cameron Wolfe's [blog post](https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search?utm_source=profile&utm_medium=reader2) on the basics of vector search.
+
+:::
+
+#### Relational databases
+
+Relational databases are a fundamental type of structured data storage used in many applications.
+They organize data into tables with predefined schemas, where each table represents an entity or relationship.
+Data is stored in rows (records) and columns (attributes), allowing for efficient querying and manipulation through SQL (Structured Query Language).
+Relational databases excel at maintaining data integrity, supporting complex queries, and handling relationships between different data entities.
+
+:::info[Further reading]
+
+* See our [tutorial](/docs/tutorials/sql_qa/) for working with SQL databases.
+* See our [SQL database toolkit](/docs/integrations/tools/sql_database/).
+
+:::
+
+#### Graph databases
+
+Graph databases are a specialized type of database designed to store and manage highly interconnected data.
+Unlike traditional relational databases, graph databases use a flexible structure consisting of nodes (entities), edges (relationships), and properties.
+This structure allows for efficient representation and querying of complex, interconnected data.
+Graph databases store data in a graph structure, with nodes, edges, and properties.
+They are particularly useful for storing and querying complex relationships between data points, such as social networks, supply-chain management, fraud detection, and recommendation services
+
+:::info[Further reading]
+
+* See our [tutorial](/docs/tutorials/graph/) for working with graph databases.
+* See our [list of graph database integrations](/docs/integrations/graphs/).
+* See Neo4j's [starter kit for LangChain](https://neo4j.com/developer-blog/langchain-neo4j-starter-kit/).
+
+:::
+
+### Retriever
+
+LangChain provides a unified interface for interacting with various retrieval systems through the [retriever](/docs/concepts/retrievers/) concept. The interface is straightforward:
+
+1. Input: A query (string)
+2. Output: A list of documents (standardized LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects)
+
+You can create a retriever using any of the retrieval systems mentioned earlier. The query analysis techniques we discussed are particularly useful here, as they enable natural language interfaces for databases that typically require structured query languages.
+For example, you can build a retriever for a SQL database using text-to-SQL conversion. This allows a natural language query (string) to be transformed into a SQL query behind the scenes.
+Regardless of the underlying retrieval system, all retrievers in LangChain share a common interface. You can use them with the simple `invoke` method:
+
+
+```python
+docs = retriever.invoke(query)
+```
+
+:::info[Further reading]
+
+* See our [conceptual guide on retrievers](/docs/concepts/retrievers/).
+* See our [how-to guide](/docs/how_to/#retrievers) on working with retrievers.
+
+:::
diff --git a/docs/docs/concepts/retrievers.mdx b/docs/docs/concepts/retrievers.mdx
new file mode 100644
index 0000000000000..5aaa893b7fde1
--- /dev/null
+++ b/docs/docs/concepts/retrievers.mdx
@@ -0,0 +1,145 @@
+# Retrievers
+
+
+
+:::info[Prerequisites]
+
+* [Vectorstores](/docs/concepts/vectorstores/)
+* [Embeddings](/docs/concepts/embedding_models/)
+* [Text splitters](/docs/concepts/text_splitters/)
+
+:::
+
+## Overview
+
+Many different types of retrieval systems exist, including vectorstores, graph databases, and relational databases.
+With the rise on popularity of large language models, retrieval systems have become an important component in AI application (e.g., [RAG](/docs/concepts/rag/)).
+Because of their importance and variability, LangChain provides a uniform interface for interacting with different types of retrieval systems.
+The LangChain [retriever](/docs/concepts/retrievers/) interface is straightforward:
+
+1. Input: A query (string)
+2. Output: A list of documents (standardized LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects)
+
+## Key concept
+
+![Retriever](/img/retriever_concept.png)
+
+All retrievers implement a simple interface for retrieving documents using natural language queries.
+
+## Interface
+
+The only requirement for a retriever is the ability to accepts a query and return documents.
+In particular, [LangChain's retriever class](https://api.python.langchain.com/en/latest/retrievers/langchain_core.retrievers.BaseRetriever.html) only requires that the `_get_relevant_documents` method is implemented, which takes a `query: str` and returns a list of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects that are most relevant to the query.
+The underlying logic used to get relevant documents is specified by the retriever and can be whatever is most useful for the application.
+
+A LangChain retriever is a [runnable](/docs/how_to/lcel_cheatsheet/), which is a standard interface is for LangChain components.
+This means that it has a few common methods, including `invoke`, that are used to interact with it. A retriever can be invoked with a query:
+
+```python
+docs = retriever.invoke(query)
+```
+
+Retrievers return a list of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects, which have two attributes:
+
+* `page_content`: The content of this document. Currently is a string.
+* `metadata`: Arbitrary metadata associated with this document (e.g., document id, file name, source, etc).
+
+:::info[Further reading]
+
+* See our [how-to guide](/docs/how_to/custom_retriever/) on building your own custom retriever.
+
+:::
+
+## Common types
+
+Despite the flexibility of the retriever interface, a few common types of retrieval systems are frequently used.
+
+### Search apis
+
+It's important to note that retrievers don't need to actually *store* documents.
+For example, we can be built retrievers on top of search APIs that simply return search results!
+See our retriever integrations with [Amazon Kendra](https://python.langchain.com/docs/integrations/retrievers/amazon_kendra_retriever/) or [Wikipedia Search](https://python.langchain.com/docs/integrations/retrievers/wikipedia/).
+
+### Relational or graph database
+
+Retrievers can be built on top of relational or graph databases.
+In these cases, [query analysis](/docs/concepts/retrieval/) techniques to construct a structured query from natural language is critical.
+For example, you can build a retriever for a SQL database using text-to-SQL conversion. This allows a natural language query (string) retriever to be transformed into a SQL query behind the scenes.
+
+:::info[Further reading]
+
+* See our [tutorial](/docs/tutorials/sql_qa/) for context on how to build a retreiver using a SQL database and text-to-SQL.
+* See our [tutorial](/docs/tutorials/graph/) for context on how to build a retreiver using a graph database and text-to-Cypher.
+
+:::
+
+### Lexical search
+
+As discussed in our conceptual review of [retrieval](/docs/concepts/retrieval/), many search engines are based upon matching words in a query to the words in each document.
+[BM25](https://en.wikipedia.org/wiki/Okapi_BM25#:~:text=BM25%20is%20a%20bag%2Dof,slightly%20different%20components%20and%20parameters.) and [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) are [two popular lexical search algorithms](https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search?utm_source=profile&utm_medium=reader2).
+LangChain has retrievers for many popular lexical search algorithms / engines.
+
+:::info[Further reading]
+
+* See the [BM25](/docs/integrations/retrievers/bm25/) retriever integration.
+* See the [TF-IDF](/docs/integrations/retrievers/tf_idf/) retriever integration.
+* See the [Elasticsearch](/docs/integrations/retrievers/elasticsearch_retriever/) retriever integration.
+
+:::
+
+### Vectorstore
+
+[Vectorstores](/docs/concepts/vectorstores/) are a powerful and efficient way to index and retrieve unstructured data.
+An vectorstore can be used as a retriever by calling the `as_retriever()` method.
+
+```python
+vectorstore = MyVectorStore()
+retriever = vectorstore.as_retriever()
+```
+
+## Advanced retrieval patterns
+
+### Ensemble
+
+Because the retriever interface is so simple, returning a list of `Document` objects given a search query, it is possible to combine multiple retrievers using ensembling.
+This is particularly useful when you have multiple retrievers that are good at finding different types of relevant documents.
+It is easy to create an [ensemble retriever](/docs/how_to/ensemble_retriever/) that combines multiple retrievers with linear weighted scores:
+
+```python
+# Initialize the ensemble retriever
+ensemble_retriever = EnsembleRetriever(
+ retrievers=[bm25_retriever, vector_store_retriever], weights=[0.5, 0.5]
+)
+```
+
+When ensembling, how do we combine search results from many retrievers?
+This motivates the concept of re-ranking, which takes the output of multiple retrievers and combines them using a more sophisticated algorithm such as [Reciprocal Rank Fusion (RRF)](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf).
+
+### Source document retention
+
+Many retrievers utilize some kind of index to make documents easily searchable.
+The process of indexing can include a transformation step (e.g., vectorstores often use document splitting).
+Whatever transformation is used, can be very useful to retain a link between the *transformed document* and the original, giving the retriever the ability to return the *original* document.
+
+![Retrieval with full docs](/img/retriever_full_docs.png)
+
+This is particularly useful in AI applications, because it ensures no loss in document context for the model.
+For example, you may use small chunk size for indexing documents in a vectorstore.
+If you return *only* the chunks as the retrieval result, then the model will have lost the original document context for the chunks.
+
+LangChain has two different retrievers that can be used to address this challenge.
+The [Multi-Vector](/docs/how_to/multi_vector/) retriever allows the user to use any document transformation (e.g., use an LLM to write a summary of the document) for indexing while retaining linkage to the source document.
+The [ParentDocument](/docs/how_to/parent_document_retriever/) retriever links document chunks from a text-splitter transformation for indexing while retaining linkage to the source document.
+
+| Name | Index Type | Uses an LLM | When to Use | Description |
+|-----------------------------------------------------------|-------------------------------|---------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [ParentDocument](/docs/how_to/parent_document_retriever/) | Vector store + Document Store | No | If your pages have lots of smaller pieces of distinct information that are best indexed by themselves, but best retrieved all together. | This involves indexing multiple chunks for each document. Then you find the chunks that are most similar in embedding space, but you retrieve the whole parent document and return that (rather than individual chunks). |
+| [Multi Vector](/docs/how_to/multi_vector/) | Vector store + Document Store | Sometimes during indexing | If you are able to extract information from documents that you think is more relevant to index than the text itself. | This involves creating multiple vectors for each document. Each vector could be created in a myriad of ways - examples include summaries of the text and hypothetical questions. |
+
+:::info[Further reading]
+
+* See our [how-to guide](/docs/how_to/parent_document_retriever/) on using the ParentDocument retriever.
+* See our [how-to guide](/docs/how_to/multi_vector/) on using the MultiVector retriever.
+* See our RAG from Scratch video on the [multi vector retriever](https://youtu.be/gTCU9I6QqCE?feature=shared).
+
+:::
diff --git a/docs/docs/concepts/runnables.mdx b/docs/docs/concepts/runnables.mdx
new file mode 100644
index 0000000000000..678d38bddf7b6
--- /dev/null
+++ b/docs/docs/concepts/runnables.mdx
@@ -0,0 +1,352 @@
+# Runnable interface
+
+The Runnable interface is foundational for working with LangChain components, and it's implemented across many of them, such as [language models](/docs/concepts/chat_models), [output parsers](/docs/concepts/output_parsers), [retrievers](/docs/concepts/retrievers), [compiled LangGraph graphs](
+https://langchain-ai.github.io/langgraph/concepts/low_level/#compiling-your-graph) and more.
+
+This guide covers the main concepts and methods of the Runnable interface, which allows developers to interact with various LangChain components in a consistent and predictable manner.
+
+:::info Related Resources
+* The ["Runnable" Interface API Reference](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable) provides a detailed overview of the Runnable interface and its methods.
+* A list of built-in `Runnables` can be found in the [LangChain Core API Reference](https://python.langchain.com/api_reference/core/runnables.html). Many of these Runnables are useful when composing custom "chains" in LangChain using the [LangChain Expression Language (LCEL)](/docs/concepts/lcel).
+:::
+
+## Overview of runnable interface
+
+The Runnable way defines a standard interface that allows a Runnable component to be:
+
+* [Invoked](/docs/how_to/lcel_cheatsheet/#invoke-a-runnable): A single input is transformed into an output.
+* [Batched](/docs/how_to/lcel_cheatsheet/#batch-a-runnable/): Multiple inputs are efficiently transformed into outputs.
+* [Streamed](/docs/how_to/lcel_cheatsheet/#stream-a-runnable): Outputs are streamed as they are produced.
+* Inspected: Schematic information about Runnable's input, output, and configuration can be accessed.
+* Composed: Multiple Runnables can be composed to work together using [the LangChain Expression Language (LCEL)](/docs/concepts/lcel) to create complex pipelines.
+
+Please review the [LCEL Cheatsheet](/docs/how_to/lcel_cheatsheet) for some common patterns that involve the Runnable interface and LCEL expressions.
+
+
+### Optimized parallel execution (batch)
+
+
+LangChain Runnables offer a built-in `batch` (and `batch_as_completed`) API that allow you to process multiple inputs in parallel.
+
+Using these methods can significantly improve performance when needing to process multiple independent inputs, as the
+processing can be done in parallel instead of sequentially.
+
+The two batching options are:
+
+* `batch`: Process multiple inputs in parallel, returning results in the same order as the inputs.
+* `batch_as_completed`: Process multiple inputs in parallel, returning results as they complete. Results may arrive out of order, but each includes the input index for matching.
+
+The default implementation of `batch` and `batch_as_completed` use a thread pool executor to run the `invoke` method in parallel. This allows for efficient parallel execution without the need for users to manage threads, and speeds up code that is I/O-bound (e.g., making API requests, reading files, etc.). It will not be as effective for CPU-bound operations, as the GIL (Global Interpreter Lock) in Python will prevent true parallel execution.
+
+Some Runnables may provide their own implementations of `batch` and `batch_as_completed` that are optimized for their specific use case (e.g.,
+rely on a `batch` API provided by a model provider).
+
+:::note
+The async versions of `abatch` and `abatch_as_completed` these rely on asyncio's [gather](https://docs.python.org/3/library/asyncio-task.html#asyncio.gather) and [as_completed](https://docs.python.org/3/library/asyncio-task.html#asyncio.as_completed) functions to run the `ainvoke` method in parallel.
+:::
+
+:::tip
+When processing a large number of inputs using `batch` or `batch_as_completed`, users may want to control the maximum number of parallel calls. This can be done by setting the `max_concurrency` attribute in the `RunnableConfig` dictionary. See the [RunnableConfig](/docs/concepts/runnables#RunnableConfig) for more information.
+
+Chat Models also have a built-in [rate limiter](/docs/concepts/chat_models#rate-limiting) that can be used to control the rate at which requests are made.
+:::
+
+### Asynchronous support
+
+
+Runnables expose an asynchronous API, allowing them to be called using the `await` syntax in Python. Asynchronous methods can be identified by the "a" prefix (e.g., `ainvoke`, `abatch`, `astream`, `abatch_as_completed`).
+
+Please refer to the [Async Programming with LangChain](/docs/concepts/async) guide for more details.
+
+## Streaming apis
+
+
+Streaming is critical in making applications based on LLMs feel responsive to end-users.
+
+Runnables expose the following three streaming APIs:
+
+1. sync [stream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.stream) and async [astream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream): yields the output a Runnable as it is generated.
+2. The async `astream_events`: a more advanced streaming API that allows streaming intermediate steps and final output
+3. The **legacy** async `astream_log`: a legacy streaming API that streams intermediate steps and final output
+
+Please refer to the [Streaming Conceptual Guide](/docs/concepts/streaming) for more details on how to stream in LangChain.
+
+## Input and output types
+
+Every `Runnable` is characterized by an input and output type. These input and output types can be any Python object, and are defined by the Runnable itself.
+
+Runnable methods that result in the execution of the Runnable (e.g., `invoke`, `batch`, `stream`, `astream_events`) work with these input and output types.
+
+* invoke: Accepts an input and returns an output.
+* batch: Accepts a list of inputs and returns a list of outputs.
+* stream: Accepts an input and returns a generator that yields outputs.
+
+The **input type** and **output type** vary by component:
+
+| Component | Input Type | Output Type |
+|--------------|--------------------------------------------------|-----------------------|
+| Prompt | dictionary | PromptValue |
+| ChatModel | a string, list of chat messages or a PromptValue | ChatMessage |
+| LLM | a string, list of chat messages or a PromptValue | String |
+| OutputParser | the output of an LLM or ChatModel | Depends on the parser |
+| Retriever | a string | List of Documents |
+| Tool | a string or dictionary, depending on the tool | Depends on the tool |
+
+Please refer to the individual component documentation for more information on the input and output types and how to use them.
+
+### Inspecting schemas
+
+:::note
+This is an advanced feature that is unnecessary for most users. You should probably
+skip this section unless you have a specific need to inspect the schema of a Runnable.
+:::
+
+In some advanced uses, you may want to programmatically **inspect** the Runnable and determine what input and output types the Runnable expects and produces.
+
+The Runnable interface provides methods to get the [JSON Schema](https://json-schema.org/) of the input and output types of a Runnable, as well as [Pydantic schemas](https://docs.pydantic.dev/latest/) for the input and output types.
+
+These APIs are mostly used internally for unit-testing and by [LangServe](/docs/concepts/architecture#langserve) which uses the APIs for input validation and generation of [OpenAPI documentation](https://www.openapis.org/).
+
+In addition, to the input and output types, some Runnables have been set up with additional run time configuration options.
+There are corresponding APIs to get the Pydantic Schema and JSON Schema of the configuration options for the Runnable.
+Please see the [Configurable Runnables](#configurable-runnables) section for more information.
+
+| Method | Description |
+|-------------------------|------------------------------------------------------------------|
+| `get_input_schema` | Gives the Pydantic Schema of the input schema for the Runnable. |
+| `get_output_chema` | Gives the Pydantic Schema of the output schema for the Runnable. |
+| `config_schema` | Gives the Pydantic Schema of the config schema for the Runnable. |
+| `get_input_jsonschema` | Gives the JSONSchema of the input schema for the Runnable. |
+| `get_output_jsonschema` | Gives the JSONSchema of the output schema for the Runnable. |
+| `get_config_jsonschema` | Gives the JSONSchema of the config schema for the Runnable. |
+
+
+#### With_types
+
+LangChain will automatically try to infer the input and output types of a Runnable based on available information.
+
+Currently, this inference does not work well for more complex Runnables that are built using [LCEL](/docs/concepts/lcel) composition, and the inferred input and / or output types may be incorrect. In these cases, we recommend that users override the inferred input and output types using the `with_types` method ([API Reference](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.with_types
+).
+
+## RunnableConfig
+
+Any of the methods that are used to execute the runnable (e.g., `invoke`, `batch`, `stream`, `astream_events`) accept a second argument called
+`RunnableConfig` ([API Reference](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.config.RunnableConfig.html#RunnableConfig)). This argument is a dictionary that contains configuration for the Runnable that will be used
+at run time during the execution of the runnable.
+
+A `RunnableConfig` can have any of the following properties defined:
+
+| Attribute | Description |
+|-----------------|--------------------------------------------------------------------------------------------|
+| run_name | Name used for the given Runnable (not inherited). |
+| run_id | Unique identifier for this call. sub-calls will get their own unique run ids. |
+| tags | Tags for this call and any sub-calls. |
+| metadata | Metadata for this call and any sub-calls. |
+| callbacks | Callbacks for this call and any sub-calls. |
+| max_concurrency | Maximum number of parallel calls to make (e.g., used by batch). |
+| recursion_limit | Maximum number of times a call can recurse (e.g., used by Runnables that return Runnables) |
+| configurable | Runtime values for configurable attributes of the Runnable. |
+
+Passing `config` to the `invoke` method is done like so:
+
+```python
+some_runnable.invoke(
+ some_input,
+ config={
+ 'run_name': 'my_run',
+ 'tags': ['tag1', 'tag2'],
+ 'metadata': {'key': 'value'}
+
+ }
+)
+```
+
+### Propagation of RunnableConfig
+
+Many `Runnables` are composed of other Runnables, and it is important that the `RunnableConfig` is propagated to all sub-calls made by the Runnable. This allows providing run time configuration values to the parent Runnable that are inherited by all sub-calls.
+
+If this were not the case, it would be impossible to set and propagate [callbacks](/docs/concepts/callbacks) or other configuration values like `tags` and `metadata` which
+are expected to be inherited by all sub-calls.
+
+There are two main patterns by which new `Runnables` are created:
+
+1. Declaratively using [LangChain Expression Language (LCEL)](/docs/concepts/lcel):
+
+ ```python
+ chain = prompt | chat_model | output_parser
+ ```
+
+2. Using a [custom Runnable](#custom-runnables) (e.g., `RunnableLambda`) or using the `@tool` decorator:
+
+ ```python
+ def foo(input):
+ # Note that .invoke() is used directly here
+ return bar_runnable.invoke(input)
+ foo_runnable = RunnableLambda(foo)
+ ```
+
+LangChain will try to propagate `RunnableConfig` automatically for both of the patterns.
+
+For handling the second pattern, LangChain relies on Python's [contextvars](https://docs.python.org/3/library/contextvars.html).
+
+In Python 3.11 and above, this works out of the box, and you do not need to do anything special to propagate the `RunnableConfig` to the sub-calls.
+
+In Python 3.9 and 3.10, if you are using **async code**, you need to manually pass the `RunnableConfig` through to the `Runnable` when invoking it.
+
+This is due to a limitation in [asyncio's tasks](https://docs.python.org/3/library/asyncio-task.html#asyncio.create_task) in Python 3.9 and 3.10 which did
+not accept a `context` argument).
+
+Propagating the `RunnableConfig` manually is done like so:
+
+```python
+async def foo(input, config): # <-- Note the config argument
+ return await bar_runnable.ainvoke(input, config=config)
+
+foo_runnable = RunnableLambda(foo)
+```
+
+:::caution
+When using Python 3.10 or lower and writing async code, `RunnableConfig` cannot be propagated
+automatically, and you will need to do it manually! This is a common pitfall when
+attempting to stream data using `astream_events` and `astream_log` as these methods
+rely on proper propagation of [callbacks](/docs/concepts/callbacks) defined inside of `RunnableConfig`.
+:::
+
+### Setting custom run name, tags, and metadata
+
+The `run_name`, `tags`, and `metadata` attributes of the `RunnableConfig` dictionary can be used to set custom values for the run name, tags, and metadata for a given Runnable.
+
+The `run_name` is a string that can be used to set a custom name for the run. This name will be used in logs and other places to identify the run. It is not inherited by sub-calls.
+
+The `tags` and `metadata` attributes are lists and dictionaries, respectively, that can be used to set custom tags and metadata for the run. These values are inherited by sub-calls.
+
+Using these attributes can be useful for tracking and debugging runs, as they will be surfaced in [LangSmith](https://docs.smith.langchain.com/) as trace attributes that you can
+filter and search on.
+
+The attributes will also be propagated to [callbacks](/docs/concepts/callbacks), and will appear in streaming APIs like [astream_events](/docs/concepts/streaming) as part of each event in the stream.
+
+:::note Related
+* [How-to trace with LangChain](https://docs.smith.langchain.com/how_to_guides/tracing/trace_with_langchain)
+:::
+
+### Setting run id
+
+:::note
+This is an advanced feature that is unnecessary for most users.
+:::
+
+You may need to set a custom `run_id` for a given run, in case you want
+to reference it later or correlate it with other systems.
+
+The `run_id` MUST be a valid UUID string and **unique** for each run. It is used to identify
+the parent run, sub-class will get their own unique run ids automatically.
+
+To set a custom `run_id`, you can pass it as a key-value pair in the `config` dictionary when invoking the Runnable:
+
+```python
+import uuid
+
+run_id = uuid.uuid4()
+
+some_runnable.invoke(
+ some_input,
+ config={
+ 'run_id': run_id
+ }
+)
+
+# Do something with the run_id
+```
+
+### Setting recursion limit
+
+:::note
+This is an advanced feature that is unnecessary for most users.
+:::
+
+Some Runnables may return other Runnables, which can lead to infinite recursion if not handled properly. To prevent this, you can set a `recursion_limit` in the `RunnableConfig` dictionary. This will limit the number of times a Runnable can recurse.
+
+### Setting max concurrency
+
+If using the `batch` or `batch_as_completed` methods, you can set the `max_concurrency` attribute in the `RunnableConfig` dictionary to control the maximum number of parallel calls to make. This can be useful when you want to limit the number of parallel calls to prevent overloading a server or API.
+
+
+:::tip
+If you're trying to rate limit the number of requests made by a **Chat Model**, you can use the built-in [rate limiter](/docs/concepts/chat_models#rate-limiting) instead of setting `max_concurrency`, which will be more effective.
+
+See the [How to handle rate limits](https://python.langchain.com/docs/how_to/chat_model_rate_limiting/) guide for more information.
+:::
+
+### Setting configurable
+
+The `configurable` field is used to pass runtime values for configurable attributes of the Runnable.
+
+It is used frequently in [LangGraph](/docs/concepts/architecture#langgraph) with
+[LangGraph Persistence](https://langchain-ai.github.io/langgraph/concepts/persistence/)
+and [memory](https://langchain-ai.github.io/langgraph/concepts/memory/).
+
+It is used for a similar purpose in [RunnableWithMessageHistory](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.history.RunnableWithMessageHistory.html#langchain_core.runnables.history.RunnableWithMessageHistory) to specify either
+a `session_id` / `conversation_id` to keep track of conversation history.
+
+In addition, you can use it to specify any custom configuration options to pass to any [Configurable Runnable](#configurable-runnables) that they create.
+
+### Setting callbacks
+
+Use this option to configure [callbacks](/docs/concepts/callbacks) for the runnable at
+runtime. The callbacks will be passed to all sub-calls made by the runnable.
+
+```python
+some_runnable.invoke(
+ some_input,
+ {
+ "callbacks": [
+ SomeCallbackHandler(),
+ AnotherCallbackHandler(),
+ ]
+ }
+)
+```
+
+Please read the [Callbacks Conceptual Guide](/docs/concepts/callbacks) for more information on how to use callbacks in LangChain.
+
+:::important
+If you're using Python 3.9 or 3.10 in an async environment, you must propagate
+the `RunnableConfig` manually to sub-calls in some cases. Please see the
+[Propagating RunnableConfig](#propagation-of-RunnableConfig) section for more information.
+:::
+
+## Creating a runnable from a function
+
+You may need to create a custom Runnable that runs arbitrary logic. This is especially
+useful if using [LangChain Expression Language (LCEL)](/docs/concepts/lcel) to compose
+multiple Runnables and you need to add custom processing logic in one of the steps.
+
+There are two ways to create a custom Runnable from a function:
+
+* `RunnableLambda`: Use this simple transformations where streaming is not required.
+* `RunnableGenerator`: use this for more complex transformations when streaming is needed.
+
+See the [How to run custom functions](/docs/how_to/functions) guide for more information on how to use `RunnableLambda` and `RunnableGenerator`.
+
+:::important
+Users should not try to subclass Runnables to create a new custom Runnable. It is
+much more complex and error-prone than simply using `RunnableLambda` or `RunnableGenerator`.
+:::
+
+## Configurable runnables
+
+:::note
+This is an advanced feature that is unnecessary for most users.
+
+It helps with configuration of large "chains" created using the [LangChain Expression Language (LCEL)](/docs/concepts/lcel)
+and is leveraged by [LangServe](/docs/concepts/architecture#langserve) for deployed Runnables.
+:::
+
+Sometimes you may want to experiment with, or even expose to the end user, multiple different ways of doing things with your Runnable. This could involve adjusting parameters like the temperature in a chat model or even switching between different chat models.
+
+To simplify this process, the Runnable interface provides two methods for creating configurable Runnables at runtime:
+
+* `configurable_fields`: This method allows you to configure specific **attributes** in a Runnable. For example, the `temperature` attribute of a chat model.
+* `configurable_alternatives`: This method enables you to specify **alternative** Runnables that can be run during run time. For example, you could specify a list of different chat models that can be used.
+
+See the [How to configure runtime chain internals](/docs/how_to/configure) guide for more information on how to configure runtime chain internals.
diff --git a/docs/docs/concepts/streaming.mdx b/docs/docs/concepts/streaming.mdx
new file mode 100644
index 0000000000000..7ab681b533ebb
--- /dev/null
+++ b/docs/docs/concepts/streaming.mdx
@@ -0,0 +1,191 @@
+# Streaming
+
+:::info Prerequisites
+* [Runnable Interface](/docs/concepts/runnables)
+* [Chat Models](/docs/concepts/chat_models)
+:::
+
+**Streaming** is crucial for enhancing the responsiveness of applications built on [LLMs](/docs/concepts/chat_models). By displaying output progressively, even before a complete response is ready, streaming significantly improves user experience (UX), particularly when dealing with the latency of LLMs.
+
+## Overview
+
+Generating full responses from [LLMs](/docs/concepts/chat_models) often incurs a delay of several seconds, which becomes more noticeable in complex applications with multiple model calls. Fortunately, LLMs generate responses iteratively, allowing for intermediate results to be displayed as they are produced. By streaming these intermediate outputs, LangChain enables smoother UX in LLM-powered apps and offers built-in support for streaming at the core of its design.
+
+In this guide, we'll discuss streaming in LLM applications and explore how LangChain's streaming APIs facilitate real-time output from various components in your application.
+
+## What to stream in LLM applications
+
+In applications involving LLMs, several types of data can be streamed to improve user experience by reducing perceived latency and increasing transparency. These include:
+
+### 1. Streaming LLM outputs
+
+The most common and critical data to stream is the output generated by the LLM itself. LLMs often take time to generate full responses, and by streaming the output in real-time, users can see partial results as they are produced. This provides immediate feedback and helps reduce the wait time for users.
+
+### 2. Streaming pipeline or workflow progress
+
+Beyond just streaming LLM output, it’s useful to stream progress through more complex workflows or pipelines, giving users a sense of how the application is progressing overall. This could include:
+
+- **In LangGraph Workflows:**
+With [LangGraph](/docs/concepts/architecture#langgraph), workflows are composed of nodes and edges that represent various steps. Streaming here involves tracking changes to the **graph state** as individual **nodes** request updates. This allows for more granular monitoring of which node in the workflow is currently active, giving real-time updates about the status of the workflow as it progresses through different stages.
+
+- **In LCEL Pipelines:**
+Streaming updates from an [LCEL](/docs/concepts/lcel) pipeline involves capturing progress from individual **sub-runnables**. For example, as different steps or components of the pipeline execute, you can stream which sub-runnable is currently running, providing real-time insight into the overall pipeline's progress.
+
+Streaming pipeline or workflow progress is essential in providing users with a clear picture of where the application is in the execution process.
+
+### 3. Streaming custom data
+
+In some cases, you may need to stream **custom data** that goes beyond the information provided by the pipeline or workflow structure. This custom information is injected within a specific step in the workflow, whether that step is a tool or a LangGraph node. For example, you could stream updates about what a tool is doing in real-time or the progress through a LangGraph node. This granular data, which is emitted directly from within the step, provides more detailed insights into the execution of the workflow and is especially useful in complex processes where more visibility is needed.
+
+## Streaming APIs
+
+LangChain two main APIs for streaming output in real-time. These APIs are supported by any component that implements the [Runnable Interface](/docs/concepts/runnables), including [LLMs](/docs/concepts/chat_models), [compiled LangGraph graphs](https://langchain-ai.github.io/langgraph/concepts/low_level/), and any Runnable generated with [LCEL](/docs/concepts/lcel).
+
+1. sync [stream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.stream) and async [astream](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream): Use to stream outputs from individual Runnables (e.g., a chat model) as they are generated or stream any workflow created with LangGraph.
+2. The async only [astream_events](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream_events): Use this API to get access to custom events and intermediate outputs from LLM applications built entirely with [LCEL](/docs/concepts/lcel). Note that this API is available, but not needed when working with LangGraph.
+
+:::note
+In addition, there is a **legacy** async [astream_log](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.Runnable.html#langchain_core.runnables.base.Runnable.astream_log) API. This API is not recommended for new projects it is more complex and less feature-rich than the other streaming APIs.
+:::
+
+### `stream()` and `astream()`
+
+The `stream()` method returns an iterator that yields chunks of output synchronously as they are produced. You can use a `for` loop to process each chunk in real-time. For example, when using an LLM, this allows the output to be streamed incrementally as it is generated, reducing the wait time for users.
+
+The type of chunk yielded by the `stream()` and `astream()` methods depends on the component being streamed. For example, when streaming from an [LLM](/docs/concepts/chat_models) each component will be an [AIMessageChunk](/docs/concepts/messages#aimessagechunk); however, for other components, the chunk may be different.
+
+The `stream()` method returns an iterator that yields these chunks as they are produced. For example,
+
+```python
+for chunk in component.stream(some_input):
+ # IMPORTANT: Keep the processing of each chunk as efficient as possible.
+ # While you're processing the current chunk, the upstream component is
+ # waiting to produce the next one. For example, if working with LangGraph,
+ # graph execution is paused while the current chunk is being processed.
+ # In extreme cases, this could even result in timeouts (e.g., when llm outputs are
+ # streamed from an API that has a timeout).
+ print(chunk)
+```
+
+The [asynchronous version](/docs/concepts/async), `astream()`, works similarly but is designed for non-blocking workflows. You can use it in asynchronous code to achieve the same real-time streaming behavior.
+
+#### Usage with chat models
+
+When using `stream()` or `astream()` with chat models, the output is streamed as [AIMessageChunks](/docs/concepts/messages#aimessagechunk) as it is generated by the LLM. This allows you to present or process the LLM's output incrementally as it's being produced, which is particularly useful in interactive applications or interfaces.
+
+#### Usage with LangGraph
+
+[LangGraph](/docs/concepts/architecture#langgraph) compiled graphs are [Runnables](/docs/concepts/runnables) and support the standard streaming APIs.
+
+When using the *stream* and *astream* methods with LangGraph, you can **one or more** [streaming mode](https://langchain-ai.github.io/langgraph/reference/types/#langgraph.types.StreamMode) which allow you to control the type of output that is streamed. The available streaming modes are:
+
+- **"values"**: Emit all values of the [state](https://langchain-ai.github.io/langgraph/concepts/low_level/) for each step.
+- **"updates"**: Emit only the node name(s) and updates that were returned by the node(s) after each step.
+- **"debug"**: Emit debug events for each step.
+- **"messages"**: Emit LLM [messages](/docs/concepts/messages) [token-by-token](/docs/concepts/tokens).
+- **"custom"**: Emit custom output witten using [LangGraph's StreamWriter](https://langchain-ai.github.io/langgraph/reference/types/#langgraph.types.StreamWriter).
+
+For more information, please see:
+* [LangGraph streaming conceptual guide](https://langchain-ai.github.io/langgraph/concepts/streaming/) for more information on how to stream when working with LangGraph.
+* [LangGraph streaming how-to guides](https://langchain-ai.github.io/langgraph/how-tos/#streaming) for specific examples of streaming in LangGraph.
+
+#### Usage with LCEL
+
+If you compose multiple Runnables using [LangChain’s Expression Language (LCEL)](/docs/concepts/lcel), the `stream()` and `astream()` methods will, by convention, stream the output of the last step in the chain. This allows the final processed result to be streamed incrementally. **LCEL** tries to optimize streaming latency in pipelines such that the streaming results from the last step are available as soon as possible.
+
+
+
+### `astream_events`
+
+
+:::tip
+Use the `astream_events` API to access custom data and intermediate outputs from LLM applications built entirely with [LCEL](/docs/concepts/lcel).
+
+While this API is available for use with [LangGraph](/docs/concepts/architecture#langgraph) as well, it is usually not necessary when working with LangGraph, as the `stream` and `astream` methods provide comprehensive streaming capabilities for LangGraph graphs.
+:::
+
+For chains constructed using **LCEL**, the `.stream()` method only streams the output of the final step from te chain. This might be sufficient for some applications, but as you build more complex chains of several LLM calls together, you may want to use the intermediate values of the chain alongside the final output. For example, you may want to return sources alongside the final generation when building a chat-over-documents app.
+
+There are ways to do this [using callbacks](/docs/concepts/#callbacks-1), or by constructing your chain in such a way that it passes intermediate
+values to the end with something like chained [`.assign()`](/docs/how_to/passthrough/) calls, but LangChain also includes an
+`.astream_events()` method that combines the flexibility of callbacks with the ergonomics of `.stream()`. When called, it returns an iterator
+which yields [various types of events](/docs/how_to/streaming/#event-reference) that you can filter and process according
+to the needs of your project.
+
+Here's one small example that prints just events containing streamed chat model output:
+
+```python
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_anthropic import ChatAnthropic
+
+model = ChatAnthropic(model="claude-3-sonnet-20240229")
+
+prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")
+parser = StrOutputParser()
+chain = prompt | model | parser
+
+async for event in chain.astream_events({"topic": "parrot"}, version="v2"):
+ kind = event["event"]
+ if kind == "on_chat_model_stream":
+ print(event, end="|", flush=True)
+```
+
+You can roughly think of it as an iterator over callback events (though the format differs) - and you can use it on almost all LangChain components!
+
+See [this guide](/docs/how_to/streaming/#using-stream-events) for more detailed information on how to use `.astream_events()`, including a table listing available events.
+
+## Writing custom data to the stream
+
+To write custom data to the stream, you will need to choose one of the following methods based on the component you are working with:
+
+1. LangGraph's [StreamWriter](https://langchain-ai.github.io/langgraph/reference/types/#langgraph.types.StreamWriter) can be used to write custom data that will surface through **stream** and **astream** APIs when working with LangGraph. **Important** this is a LangGraph feature, so it is not available when working with pure LCEL. See [how to streaming custom data](https://langchain-ai.github.io/langgraph/how-tos/streaming-content/) for more information.
+2. [dispatch_events](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.manager.dispatch_custom_event.html#) / [adispatch_events](https://python.langchain.com/api_reference/core/callbacks/langchain_core.callbacks.manager.adispatch_custom_event.html) can be used to write custom data that will be surfaced through the **astream_events** API. See [how to dispatch custom callback events](https://python.langchain.com/docs/how_to/callbacks_custom_events/#astream-events-api) for more information.
+
+## "Auto-Streaming" Chat Models
+
+LangChain simplifies streaming from [chat models](/docs/concepts/chat_models) by automatically enabling streaming mode in certain cases, even when you’re not explicitly calling the streaming methods. This is particularly useful when you use the non-streaming `invoke` method but still want to stream the entire application, including intermediate results from the chat model.
+
+### How It Works
+
+When you call the `invoke` (or `ainvoke`) method on a chat model, LangChain will automatically switch to streaming mode if it detects that you are trying to stream the overall application.
+
+Under the hood, it'll have `invoke` (or `ainvoke`) use the `stream` (or `astream`) method to generate its output. The result of the invocation will be the same as far as the code that was using `invoke` is concerned; however, while the chat model is being streamed, LangChain will take care of invoking `on_llm_new_token` events in LangChain's [callback system](/docs/concepts/callbacks). These callback events
+allow LangGraph `stream`/`astream` and `astream_events` to surface the chat model's output in real-time.
+
+Example:
+
+```python
+def node(state):
+ ...
+ # The code below uses the invoke method, but LangChain will
+ # automatically switch to streaming mode
+ # when it detects that the overall
+ # application is being streamed.
+ ai_message = model.invoke(state["messages"])
+ ...
+
+for chunk in compiled_graph.stream(..., mode="messages"):
+ ...
+```
+## Async Programming
+
+LangChain offers both synchronous (sync) and asynchronous (async) versions of many of its methods. The async methods are typically prefixed with an "a" (e.g., `ainvoke`, `astream`). When writing async code, it's crucial to consistently use these asynchronous methods to ensure non-blocking behavior and optimal performance.
+
+If streaming data fails to appear in real-time, please ensure that you are using the correct async methods for your workflow.
+
+Please review the [async programming in LangChain guide](/docs/concepts/async) for more information on writing async code with LangChain.
+
+## Related Resources
+
+Please see the following how-to guides for specific examples of streaming in LangChain:
+* [LangGraph conceptual guide on streaming](https://langchain-ai.github.io/langgraph/concepts/streaming/)
+* [LangGraph streaming how-to guides](https://langchain-ai.github.io/langgraph/how-tos/#streaming)
+* [How to stream runnables](/docs/how_to/streaming/): This how-to guide goes over common streaming patterns with LangChain components (e.g., chat models) and with [LCEL](/docs/concepts/lcel).
+* [How to stream chat models](/docs/how_to/chat_streaming/)
+* [How to stream tool calls](/docs/how_to/tool_streaming/)
+
+For writing custom data to the stream, please see the following resources:
+
+* If using LangGraph, see [how to stream custom data](https://langchain-ai.github.io/langgraph/how-tos/streaming-content/).
+* If using LCEL, see [how to dispatch custom callback events](https://python.langchain.com/docs/how_to/callbacks_custom_events/#astream-events-api).
\ No newline at end of file
diff --git a/docs/docs/concepts/structured_outputs.mdx b/docs/docs/concepts/structured_outputs.mdx
new file mode 100644
index 0000000000000..f58150d5c609d
--- /dev/null
+++ b/docs/docs/concepts/structured_outputs.mdx
@@ -0,0 +1,148 @@
+# Structured outputs
+
+## Overview
+
+For many applications, such as chatbots, models need to respond to users directly in natural language.
+However, there are scenarios where we need models to output in a *structured format*.
+For example, we might want to store the model output in a database and ensure that the output conforms to the database schema.
+This need motivates the concept of structured output, where models can be instructed to respond with a particular output structure.
+
+![Structured output](/img/structured_output.png)
+
+## Key concepts
+
+**(1) Schema definition:** The output structure is represented as a schema, which can be defined in several ways.
+**(2) Returning structured output:** The model is given this schema, and is instructed to return output that conforms to it.
+
+## Recommended usage
+
+This pseudo-code illustrates the recommended workflow when using structured output.
+LangChain provides a method, [`with_structured_output()`](/docs/how_to/structured_output/#the-with_structured_output-method), that automates the process of binding the schema to the [model](/docs/concepts/chat_models/) and parsing the output.
+This helper function is available for all model providers that support structured output.
+
+```python
+# Define schema
+schema = {"foo": "bar"}
+# Bind schema to model
+model_with_structure = model.with_structured_output(schema)
+# Invoke the model to produce structured output that matches the schema
+structured_output = model_with_structure.invoke(user_input)
+```
+
+## Schema definition
+
+The central concept is that the output structure of model responses needs to be represented in some way.
+While types of objects you can use depend on the model you're working with, there are common types of objects that are typically allowed or recommended for structured output in Python.
+
+The simplest and most common format for structured output is a JSON-like structure, which in Python can be represented as a dictionary (dict) or list (list).
+JSON objects (or dicts in Python) are often used directly when the tool requires raw, flexible, and minimal-overhead structured data.
+
+```json
+{
+ "answer": "The answer to the user's question",
+ "followup_question": "A followup question the user could ask"
+}
+```
+
+As a second example, [Pydantic](https://docs.pydantic.dev/latest/) is particularly useful for defining structured output schemas because it offers type hints and validation.
+Here's an example of a Pydantic schema:
+
+```python
+from pydantic import BaseModel, Field
+class ResponseFormatter(BaseModel):
+ """Always use this tool to structure your response to the user."""
+ answer: str = Field(description="The answer to the user's question")
+ followup_question: str = Field(description="A followup question the user could ask")
+
+```
+
+## Returning structured output
+
+With a schema defined, we need a way to instruct the model to use it.
+While one approach is to include this schema in the prompt and *ask nicely* for the model to use it, this is not recommended.
+Several more powerful methods that utilizes native features in the model provider's API are available.
+
+### Using tool calling
+
+Many [model providers support](/docs/integrations/chat/) tool calling, a concept discussed in more detail in our [tool calling guide](/docs/concepts/tool_calling/).
+In short, tool calling involves binding a tool to a model and, when appropriate, the model can *decide* to call this tool and ensure its response conforms to the tool's schema.
+With this in mind, the central concept is strightforward: *simply bind our schema to a model as a tool!*
+Here is an example using the `ResponseFormatter` schema defined above:
+
+```python
+from langchain_openai import ChatOpenAI
+model = ChatOpenAI(model="gpt-4o", temperature=0)
+# Bind responseformatter schema as a tool to the model
+model_with_tools = model.bind_tools([ResponseFormatter])
+# Invoke the model
+ai_msg = model_with_tools.invoke("What is the powerhouse of the cell?")
+```
+
+The arguments of the tool call are already extracted as a dictionary.
+This dictionary can be optionally parsed into a Pydantic object, matching our original `ResponseFormatter` schema.
+
+```python
+# Get the tool call arguments
+ai_msg.tool_calls[0]["args"]
+{'answer': "The powerhouse of the cell is the mitochondrion. Mitochondria are organelles that generate most of the cell's supply of adenosine triphosphate (ATP), which is used as a source of chemical energy.",
+ 'followup_question': 'What is the function of ATP in the cell?'}
+# Parse the dictionary into a pydantic object
+pydantic_object = ResponseFormatter.model_validate(ai_msg.tool_calls[0]["args"])
+```
+
+### JSON mode
+
+In addition to tool calling, some model providers support a feature called `JSON mode`.
+This supports JSON schema definition as input and enforces the model to produce a conforming JSON output.
+You can find a table of model providers that support JSON mode [here](/docs/integrations/chat/).
+Here is an example of how to use JSON mode with OpenAI:
+
+```python
+from langchain_openai import ChatOpenAI
+model = ChatOpenAI(model="gpt-4o", model_kwargs={ "response_format": { "type": "json_object" } })
+ai_msg = model.invoke("Return a JSON object with key 'random_ints' and a value of 10 random ints in [0-99]")
+ai_msg.content
+'\n{\n "random_ints": [23, 47, 89, 15, 34, 76, 58, 3, 62, 91]\n}'
+```
+
+One important point to flag: the model *still* returns a string, which needs to be parsed into a JSON object.
+This can, of course, simply use the `json` library or a JSON output parser if you need more adavanced functionality.
+See this [how-to guide on the JSON output parser](/docs/how_to/output_parser_json) for more details.
+
+```python
+import json
+json_object = json.loads(ai_msg.content)
+{'random_ints': [23, 47, 89, 15, 34, 76, 58, 3, 62, 91]}
+```
+
+## Structured output method
+
+There a few challenges when producing structured output with the above methods:
+
+(1) If using tool calling, tool call arguments needs to be parsed from a dictionary back to the original schema.
+
+(2) In addition, the model needs to be instructed to *always* use the tool when we want to enforce structured output, which is a provider specific setting.
+
+(3) If using JSON mode, the output needs to be parsed into a JSON object.
+
+With these challenges in mind, LangChain provides a helper function (`with_structured_output()`) to streamline the process.
+
+![Diagram of with structured output](/img/with_structured_output.png)
+
+This both binds the schema to the model as a tool and parses the output to the specified output schema.
+
+```python
+# Bind the schema to the model
+model_with_structure = model.with_structured_output(ResponseFormatter)
+# Invoke the model
+structured_output = model_with_structure.invoke("What is the powerhouse of the cell?")
+# Get back the pydantic object
+structured_output
+ResponseFormatter(answer="The powerhouse of the cell is the mitochondrion. Mitochondria are organelles that generate most of the cell's supply of adenosine triphosphate (ATP), which is used as a source of chemical energy.", followup_question='What is the function of ATP in the cell?')
+```
+
+:::info[Further reading]
+
+For more details on usage, see our [how-to guide](/docs/how_to/structured_output/#the-with_structured_output-method).
+
+:::
\ No newline at end of file
diff --git a/docs/docs/concepts/text_splitters.mdx b/docs/docs/concepts/text_splitters.mdx
new file mode 100644
index 0000000000000..c5575a219f513
--- /dev/null
+++ b/docs/docs/concepts/text_splitters.mdx
@@ -0,0 +1,135 @@
+# Text splitters
+
+
+:::info[Prerequisites]
+
+* [Documents](/docs/concepts/retrievers/#interface)
+* Tokenization(/docs/concepts/tokens)
+:::
+
+## Overview
+
+Document splitting is often a crucial preprocessing step for many applications.
+It involves breaking down large texts into smaller, manageable chunks.
+This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems.
+There are several strategies for splitting documents, each with its own advantages.
+
+## Key concepts
+
+![Conceptual Overview](/img/text_splitters.png)
+
+Text splitters split documents into smaller chunks for use in downstream applications.
+
+## Why split documents?
+
+There are several reasons to split documents:
+
+- **Handling non-uniform document lengths**: Real-world document collections often contain texts of varying sizes. Splitting ensures consistent processing across all documents.
+- **Overcoming model limitations**: Many embedding models and language models have maximum input size constraints. Splitting allows us to process documents that would otherwise exceed these limits.
+- **Improving representation quality**: For longer documents, the quality of embeddings or other representations may degrade as they try to capture too much information. Splitting can lead to more focused and accurate representations of each section.
+- **Enhancing retrieval precision**: In information retrieval systems, splitting can improve the granularity of search results, allowing for more precise matching of queries to relevant document sections.
+- **Optimizing computational resources**: Working with smaller chunks of text can be more memory-efficient and allow for better parallelization of processing tasks.
+
+Now, the next question is *how* to split the documents into chunks! There are several strategies, each with its own advantages.
+
+:::info[Further reading]
+* See Greg Kamradt's [chunkviz](https://chunkviz.up.railway.app/) to visualize different splitting strategies discussed below.
+:::
+
+## Approaches
+
+### Length-based
+
+The most intuitive strategy is to split documents based on their length. This simple yet effective approach ensures that each chunk doesn't exceed a specified size limit.
+Key benefits of length-based splitting:
+- Straightforward implementation
+- Consistent chunk sizes
+- Easily adaptable to different model requirements
+
+Types of length-based splitting:
+- **Token-based**: Splits text based on the number of tokens, which is useful when working with language models.
+- **Character-based**: Splits text based on the number of characters, which can be more consistent across different types of text.
+
+Example implementation using LangChain's `CharacterTextSplitter` with token-based splitting:
+
+```python
+from langchain_text_splitters import CharacterTextSplitter
+text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
+ encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
+)
+texts = text_splitter.split_text(document)
+```
+
+:::info[Further reading]
+
+* See the how-to guide for [token-based](/docs/how_to/split_by_token/) splitting.
+* See the how-to guide for [character-based](/docs/how_to/character_text_splitter/) splitting.
+
+:::
+
+### Text-structured based
+
+Text is naturally organized into hierarchical units such as paragraphs, sentences, and words.
+We can leverage this inherent structure to inform our splitting strategy, creating split that maintain natural language flow, maintain semantic coherence within split, and adapts to varying levels of text granularity.
+LangChain's [`RecursiveCharacterTextSplitter`](/docs/how_to/recursive_text_splitter/) implements this concept:
+- The `RecursiveCharacterTextSplitter` attempts to keep larger units (e.g., paragraphs) intact.
+- If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
+- This process continues down to the word level if necessary.
+
+Here is example usage:
+
+```python
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=0)
+texts = text_splitter.split_text(document)
+```
+
+:::info[Further reading]
+
+* See the how-to guide for [recursive text splitting](/docs/how_to/recursive_text_splitter/).
+
+:::
+
+### Document-structured based
+
+Some documents have an inherent structure, such as HTML, Markdown, or JSON files.
+In these cases, it's beneficial to split the document based on its structure, as it often naturally groups semantically related text.
+Key benefits of structure-based splitting:
+- Preserves the logical organization of the document
+- Maintains context within each chunk
+- Can be more effective for downstream tasks like retrieval or summarization
+
+Examples of structure-based splitting:
+- **Markdown**: Split based on headers (e.g., #, ##, ###)
+- **HTML**: Split using tags
+- **JSON**: Split by object or array elements
+- **Code**: Split by functions, classes, or logical blocks
+
+:::info[Further reading]
+
+* See the how-to guide for [Markdown splitting](/docs/how_to/markdown_header_metadata_splitter/).
+* See the how-to guide for [Recursive JSON splitting](/docs/how_to/recursive_json_splitter/).
+* See the how-to guide for [Code splitting](/docs/how_to/code_splitter/).
+* See the how-to guide for [HTML splitting](/docs/how_to/HTML_header_metadata_splitter/).
+
+:::
+
+### Semantic meaning based
+
+Unlike the previous methods, semantic-based splitting actually considers the *content* of the text.
+While other approaches use document or text structure as proxies for semantic meaning, this method directly analyzes the text's semantics.
+There are several ways to implement this, but conceptually the approach is split text when there are significant changes in text *meaning*.
+As an example, we can use a sliding window approach to generate embeddings, and compare the embeddings to find significant differences:
+
+- Start with the first few sentences and generate an embedding.
+- Move to the next group of sentences and generate another embedding (e.g., using a sliding window approach).
+- Compare the embeddings to find significant differences, which indicate potential "break points" between semantic sections.
+
+This technique helps create chunks that are more semantically coherent, potentially improving the quality of downstream tasks like retrieval or summarization.
+
+:::info[Further reading]
+
+* See the how-to guide for [splitting text based on semantic meaning](/docs/how_to/semantic-chunker/).
+* See Greg Kamradt's [notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) showcasing semantic splitting.
+
+:::
diff --git a/docs/docs/concepts/tokens.mdx b/docs/docs/concepts/tokens.mdx
new file mode 100644
index 0000000000000..d42755e8d561a
--- /dev/null
+++ b/docs/docs/concepts/tokens.mdx
@@ -0,0 +1,58 @@
+# Tokens
+
+Modern large language models (LLMs) are typically based on a transformer architecture that processes a sequence of units known as tokens. Tokens are the fundamental elements that models use to break down input and generate output. In this section, we'll discuss what tokens are and how they are used by language models.
+
+## What is a token?
+
+A **token** is the basic unit that a language model reads, processes, and generates. These units can vary based on how the model provider defines them, but in general, they could represent:
+
+* A whole word (e.g., "apple"),
+* A part of a word (e.g., "app"),
+* Or other linguistic components such as punctuation or spaces.
+
+The way the model tokenizes the input depends on its **tokenizer algorithm**, which converts the input into tokens. Similarly, the model’s output comes as a stream of tokens, which is then decoded back into human-readable text.
+
+## How tokens work in language models
+
+The reason language models use tokens is tied to how they understand and predict language. Rather than processing characters or entire sentences directly, language models focus on **tokens**, which represent meaningful linguistic units. Here's how the process works:
+
+1. **Input Tokenization**: When you provide a model with a prompt (e.g., "LangChain is cool!"), the tokenizer algorithm splits the text into tokens. For example, the sentence could be tokenized into parts like `["Lang", "Chain", " is", " cool", "!"]`. Note that token boundaries don’t always align with word boundaries.
+ ![](/img/tokenization.png)
+
+2. **Processing**: The transformer architecture behind these models processes tokens sequentially to predict the next token in a sentence. It does this by analyzing the relationships between tokens, capturing context and meaning from the input.
+3. **Output Generation**: The model generates new tokens one by one. These output tokens are then decoded back into human-readable text.
+
+Using tokens instead of raw characters allows the model to focus on linguistically meaningful units, which helps it capture grammar, structure, and context more effectively.
+
+## Tokens don’t have to be text
+
+Although tokens are most commonly used to represent text, they don’t have to be limited to textual data. Tokens can also serve as abstract representations of **multi-modal data**, such as:
+
+- **Images**,
+- **Audio**,
+- **Video**,
+- And other types of data.
+
+At the time of writing, virtually no models support **multi-modal output**, and only a few models can handle **multi-modal inputs** (e.g., text combined with images or audio). However, as advancements in AI continue, we expect **multi-modality** to become much more common. This would allow models to process and generate a broader range of media, significantly expanding the scope of what tokens can represent and how models can interact with diverse types of data.
+
+:::note
+In principle, **anything that can be represented as a sequence of tokens** could be modeled in a similar way. For example, **DNA sequences**—which are composed of a series of nucleotides (A, T, C, G)—can be tokenized and modeled to capture patterns, make predictions, or generate sequences. This flexibility allows transformer-based models to handle diverse types of sequential data, further broadening their potential applications across various domains, including bioinformatics, signal processing, and other fields that involve structured or unstructured sequences.
+:::
+
+Please see the [multimodality](/docs/concepts/multimodality) section for more information on multi-modal inputs and outputs.
+
+## Why not use characters?
+
+Using tokens instead of individual characters makes models both more efficient and better at understanding context and grammar. Tokens represent meaningful units, like whole words or parts of words, allowing models to capture language structure more effectively than by processing raw characters. Token-level processing also reduces the number of units the model has to handle, leading to faster computation.
+
+In contrast, character-level processing would require handling a much larger sequence of input, making it harder for the model to learn relationships and context. Tokens enable models to focus on linguistic meaning, making them more accurate and efficient in generating responses.
+
+## How tokens correspond to text
+
+Please see this post from [OpenAI](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them) for more details on how tokens are counted and how they correspond to text.
+
+According to the OpenAI post, the approximate token counts for English text are as follows:
+
+* 1 token ~= 4 chars in English
+* 1 token ~= ¾ words
+* 100 tokens ~= 75 words
\ No newline at end of file
diff --git a/docs/docs/concepts/tool_calling.mdx b/docs/docs/concepts/tool_calling.mdx
new file mode 100644
index 0000000000000..e377688334640
--- /dev/null
+++ b/docs/docs/concepts/tool_calling.mdx
@@ -0,0 +1,149 @@
+# Tool calling
+
+:::info[Prerequisites]
+* [Tools](/docs/concepts/tools)
+* [Chat Models](/docs/concepts/chat_models)
+:::
+
+
+## Overview
+
+Many AI applications interact directly with humans. In these cases, it is appropriate for models to respond in natural language.
+But what about cases where we want a model to also interact *directly* with systems, such as databases or an API?
+These systems often have a particular input schema; for example, APIs frequently have a required payload structure.
+This need motivates the concept of *tool calling*. You can use [tool calling](https://platform.openai.com/docs/guides/function-calling/example-use-cases) to request model responses that match a particular schema.
+
+:::info
+You will sometimes hear the term `function calling`. We use this term interchangeably with `tool calling`.
+:::
+
+![Conceptual overview of tool calling](/img/tool_calling_concept.png)
+
+## Key concepts
+
+**(1) Tool Creation:** Use the [@tool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.convert.tool.html) decorator to create a [tool](/docs/concepts/tools). A tool is an association between a function and its schema.
+**(2) Tool Binding:** The tool needs to be connected to a model that supports tool calling. This gives the model awareness of the tool and the associated input schema required by the tool.
+**(3) Tool Calling:** When appropriate, the model can decide to call a tool and ensure its response conforms to the tool's input schema.
+**(4) Tool Execution:** The tool can be executed using the arguments provided by the model.
+
+![Conceptual parts of tool calling](/img/tool_calling_components.png)
+
+## Recommended usage
+
+This pseudo-code illustrates the recommended workflow for using tool calling.
+Created tools are passed to `.bind_tools()` method as a list.
+This model can be called, as usual. If a tool call is made, model's response will contain the tool call arguments.
+The tool call arguments can be passed directly to the tool.
+
+```python
+# Tool creation
+tools = [my_tool]
+# Tool binding
+model_with_tools = model.bind_tools(tools)
+# Tool calling
+response = model_with_tools.invoke(user_input)
+```
+
+## Tool creation
+
+The recommended way to create a tool is using the `@tool` decorator.
+
+```python
+from langchain_core.tools import tool
+
+@tool
+def multiply(a: int, b: int) -> int:
+ """Multiply a and b."""
+ return a * b
+```
+
+:::info[Further reading]
+
+* See our conceptual guide on [tools](/docs/concepts/tools/) for more details.
+* See our [model integrations](/docs/integrations/chat/) that support tool calling.
+* See our [how-to guide](/docs/how_to/tool_calling/) on tool calling.
+
+:::
+
+## Tool binding
+
+[Many](https://platform.openai.com/docs/guides/function-calling) [model providers](https://platform.openai.com/docs/guides/function-calling) support tool calling.
+
+:::tip
+See our [model integration page](/docs/integrations/chat/) for a list of providers that support tool calling.
+:::
+
+The central concept to understand is that LangChain provides a standardized interface for connecting tools to models.
+The `.bind_tools()` method can be used to specify which tools are available for a model to call.
+
+```python
+model_with_tools = model.bind_tools([tools_list])
+```
+
+As a specific example, let's take a function `multiply` and bind it as a tool to a model that supports tool calling.
+
+```python
+def multiply(a: int, b: int) -> int:
+ """Multiply a and b.
+
+ Args:
+ a: first int
+ b: second int
+ """
+ return a * b
+
+llm_with_tools = tool_calling_model.bind_tools([multiply])
+```
+
+## Tool calling
+
+![Diagram of a tool call by a model](/img/tool_call_example.png)
+
+A key principle of tool calling is that the model decides when to use a tool based on the input's relevance. The model doesn't always need to call a tool.
+For example, given an unrelated input, the model would not call the tool:
+
+```python
+result = llm_with_tools.invoke("Hello world!")
+```
+
+The result would be an `AIMessage` containing the model's response in natural language (e.g., "Hello!").
+However, if we pass an input *relevant to the tool*, the model should choose to call it:
+
+```python
+result = llm_with_tools.invoke("What is 2 multiplied by 3?")
+```
+
+As before, the output `result` will be an `AIMessage`.
+But, if the tool was called, `result` will have a `tool_calls` attribute.
+This attribute includes everything needed to execute the tool, including the tool name and input arguments:
+
+```
+result.tool_calls
+{'name': 'multiply', 'args': {'a': 2, 'b': 3}, 'id': 'xxx', 'type': 'tool_call'}
+```
+
+For more details on usage, see our [how-to guides](/docs/how_to/#tools)!
+
+## Tool execution
+
+[Tools](/docs/concepts/tools/) implement the [Runnable](/docs/concepts/runnables/) interface, which means that they can be invoked (e.g., `tool.invoke(args)`) directly.
+
+[LangGraph](https://langchain-ai.github.io/langgraph/) offers pre-built components (e.g., [`ToolNode`](https://langchain-ai.github.io/langgraph/reference/prebuilt/#toolnode)) that will often invoke the tool in behalf of the user.
+
+:::info[Further reading]
+
+* See our [how-to guide](/docs/how_to/tool_calling/) on tool calling.
+* See the [LangGraph documentation on using ToolNode](https://langchain-ai.github.io/langgraph/how-tos/tool-calling/).
+
+:::
+
+## Best practices
+
+When designing [tools](/docs/concepts/tools/) to be used by a model, it is important to keep in mind that:
+
+* Models that have explicit [tool-calling APIs](/docs/concepts/#functiontool-calling) will be better at tool calling than non-fine-tuned models.
+* Models will perform better if the tools have well-chosen names and descriptions.
+* Simple, narrowly scoped tools are easier for models to use than complex tools.
+* Asking the model to select from a large list of tools poses challenges for the model.
+
+
diff --git a/docs/docs/concepts/tools.mdx b/docs/docs/concepts/tools.mdx
new file mode 100644
index 0000000000000..fe5910cdb2237
--- /dev/null
+++ b/docs/docs/concepts/tools.mdx
@@ -0,0 +1,211 @@
+# Tools
+
+:::info Prerequisites
+- [Chat models](/docs/concepts/chat_models/)
+:::
+
+## Overview
+
+The **tool** abstraction in LangChain associates a python **function** with a **schema** that defines the function's **name**, **description** and **input**.
+
+**Tools** can be passed to [chat models](/docs/concepts/chat_models) that support [tool calling](/docs/concepts/tool_calling) allowing the model to request the execution of a specific function with specific inputs.
+
+## Key concepts
+
+- Tools are a way to encapsulate a function and its schema in a way that can be passed to a chat model.
+- Create tools using the [@tool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.convert.tool.html) decorator, which simplifies the process of tool creation, supporting the following:
+ - Automatically infer the tool's **name**, **description** and **inputs**, while also supporting customization.
+ - Defining tools that return **artifacts** (e.g. images, dataframes, etc.)
+ - Hiding input arguments from the schema (and hence from the model) using **injected tool arguments**.
+
+## Tool interface
+
+The tool interface is defined in the [BaseTool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.base.BaseTool.html#langchain_core.tools.base.BaseTool) class which is a subclass of the [Runnable Interface](/docs/concepts/runnables).
+
+The key attributes that correspond to the tool's **schema**:
+
+- **name**: The name of the tool.
+- **description**: A description of what the tool does.
+- **args**: Property that returns the JSON schema for the tool's arguments.
+
+The key methods to execute the function associated with the **tool**:
+
+- **invoke**: Invokes the tool with the given arguments.
+- **ainvoke**: Invokes the tool with the given arguments, asynchronously. Used for [async programming with Langchain](/docs/concepts/async).
+
+## Create tools using the `@tool` decorator
+
+The recommended way to create tools is using the [@tool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.convert.tool.html) decorator. This decorator is designed to simplify the process of tool creation and should be used in most cases. After defining a function, you can decorate it with [@tool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.convert.tool.html) to create a tool that implements the [Tool Interface](#tool-interface).
+
+```python
+from langchain_core.tools import tool
+
+@tool
+def multiply(a: int, b: int) -> int:
+ """Multiply two numbers."""
+ return a * b
+```
+
+For more details on how to create tools, see the [how to create custom tools](/docs/how_to/custom_tools/) guide.
+
+:::note
+LangChain has a few other ways to create tools; e.g., by sub-classing the [BaseTool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.base.BaseTool.html#langchain_core.tools.base.BaseTool) class or by using `StructuredTool`. These methods are shown in the [how to create custom tools guide](/docs/how_to/custom_tools/), but
+we generally recommend using the `@tool` decorator for most cases.
+:::
+
+## Use the tool directly
+
+Once you have defined a tool, you can use it directly by calling the function. For example, to use the `multiply` tool defined above:
+
+```python
+multiply.invoke({"a": 2, "b": 3})
+```
+
+### Inspect
+
+You can also inspect the tool's schema and other properties:
+
+```python
+print(multiply.name) # multiply
+print(multiply.description) # Multiply two numbers.
+print(multiply.args)
+# {
+# 'type': 'object',
+# 'properties': {'a': {'type': 'integer'}, 'b': {'type': 'integer'}},
+# 'required': ['a', 'b']
+# }
+```
+
+:::note
+If you're using pre-built LangChain or LangGraph components like [create_react_agent](https://langchain-ai.github.io/langgraph/reference/prebuilt/#langgraph.prebuilt.chat_agent_executor.create_react_agent),you might not need to interact with tools directly. However, understanding how to use them can be valuable for debugging and testing. Additionally, when building custom LangGraph workflows, you may find it necessary to work with tools directly.
+:::
+
+## Configuring the schema
+
+The `@tool` decorator offers additional options to configure the schema of the tool (e.g., modify name, description
+or parse the function's doc-string to infer the schema).
+
+Please see the [API reference for @tool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.convert.tool.html) for more details and review the [how to create custom tools](/docs/how_to/custom_tools/) guide for examples.
+
+## Tool artifacts
+
+**Tools** are utilities that can be called by a model, and whose outputs are designed to be fed back to a model. Sometimes, however, there are artifacts of a tool's execution that we want to make accessible to downstream components in our chain or agent, but that we don't want to expose to the model itself. For example if a tool returns a custom object, a dataframe or an image, we may want to pass some metadata about this output to the model without passing the actual output to the model. At the same time, we may want to be able to access this full output elsewhere, for example in downstream tools.
+
+```python
+@tool(response_format="content_and_artifact")
+def some_tool(...) -> Tuple[str, Any]:
+ """Tool that does something."""
+ ...
+ return 'Message for chat model', some_artifact
+```
+
+See [how to return artifacts from tools](/docs/how_to/tool_artifacts/) for more details.
+
+## Special type annotations
+
+There are a number of special type annotations that can be used in the tool's function signature to configure the run time behavior of the tool.
+
+The following type annotations will end up **removing** the argument from the tool's schema. This can be useful for arguments that should not be exposed to the model and that the model should not be able to control.
+
+- **InjectedToolArg**: Value should be injected manually at runtime using `.invoke` or `.ainvoke`.
+- **RunnableConfig**: Pass in the RunnableConfig object to the tool.
+- **InjectedState**: Pass in the overall state of the LangGraph graph to the tool.
+- **InjectedStore**: Pass in the LangGraph store object to the tool.
+
+You can also use the `Annotated` type with a string literal to provide a **description** for the corresponding argument that **WILL** be exposed in the tool's schema.
+
+- **Annotated[..., "string literal"]** -- Adds a description to the argument that will be exposed in the tool's schema.
+
+### InjectedToolArg
+
+There are cases where certain arguments need to be passed to a tool at runtime but should not be generated by the model itself. For this, we use the `InjectedToolArg` annotation, which allows certain parameters to be hidden from the tool's schema.
+
+For example, if a tool requires a `user_id` to be injected dynamically at runtime, it can be structured in this way:
+
+```python
+from langchain_core.tools import tool, InjectedToolArg
+
+@tool
+def user_specific_tool(input_data: str, user_id: InjectedToolArg) -> str:
+ """Tool that processes input data."""
+ return f"User {user_id} processed {input_data}"
+```
+
+Annotating the `user_id` argument with `InjectedToolArg` tells LangChain that this argument should not be exposed as part of the
+tool's schema.
+
+See [how to pass run time values to tools](https://python.langchain.com/docs/how_to/tool_runtime/) for more details on how to use `InjectedToolArg`.
+
+
+### RunnableConfig
+
+You can use the `RunnableConfig` object to pass custom run time values to tools.
+
+If you need to access the [RunnableConfig](/docs/concepts/runnables/#RunnableConfig) object from within a tool. This can be done by using the `RunnableConfig` annotation in the tool's function signature.
+
+```python
+from langchain_core.runnables import RunnableConfig
+
+@tool
+async def some_func(..., config: RunnableConfig) -> ...:
+ """Tool that does something."""
+ # do something with config
+ ...
+
+await some_func.ainvoke(..., config={"configurable": {"value": "some_value"}})
+```
+
+The `config` will not be part of the tool's schema and will be injected at runtime with appropriate values.
+
+:::note
+You may need to access the `config` object to manually propagate it to subclass. This happens if you're working with python 3.9 / 3.10 in an [async](/docs/concepts/async) environment and need to manually propagate the `config` object to sub-calls.
+
+Please read [Propagation RunnableConfig](/docs/concepts/runnables#propagation-RunnableConfig) for more details to learn how to propagate the `RunnableConfig` down the call chain manually (or upgrade to Python 3.11 where this is no longer an issue).
+:::
+
+### InjectedState
+
+Please see the [InjectedState](https://langchain-ai.github.io/langgraph/reference/prebuilt/#langgraph.prebuilt.tool_node.InjectedState) documentation for more details.
+
+### InjectedStore
+
+Please see the [InjectedStore](https://langchain-ai.github.io/langgraph/reference/prebuilt/#langgraph.prebuilt.tool_node.InjectedStore) documentation for more details.
+
+## Best practices
+
+When designing tools to be used by models, keep the following in mind:
+
+- Tools that are well-named, correctly-documented and properly type-hinted are easier for models to use.
+- Design simple and narrowly scoped tools, as they are easier for models to use correctly.
+- Use chat models that support [tool-calling](/docs/concepts/tool_calling) APIs to take advantage of tools.
+
+
+## Toolkits
+
+
+LangChain has a concept of **toolkits**. This a very thin abstraction that groups tools together that
+are designed to be used together for specific tasks.
+
+### Interface
+
+All Toolkits expose a `get_tools` method which returns a list of tools. You can therefore do:
+
+```python
+# Initialize a toolkit
+toolkit = ExampleTookit(...)
+
+# Get list of tools
+tools = toolkit.get_tools()
+```
+
+## Related resources
+
+See the following resources for more information:
+
+- [API Reference for @tool](https://python.langchain.com/api_reference/core/tools/langchain_core.tools.convert.tool.html)
+- [How to create custom tools](https://python.langchain.com/docs/how_to/custom_tools/)
+- [How to pass run time values to tools](https://python.langchain.com/docs/how_to/tool_runtime/)
+- [All LangChain tool how-to guides](https://docs.langchain.com/docs/how_to/#tools)
+- [Additional how-to guides that show usage with LangGraph](https://langchain-ai.github.io/langgraph/how-tos/tool-calling/)
+- Tool integrations, see the [tool integration docs](https://docs.langchain.com/docs/integrations/tools/).
+
diff --git a/docs/docs/concepts/tracing.mdx b/docs/docs/concepts/tracing.mdx
new file mode 100644
index 0000000000000..659992eeb9573
--- /dev/null
+++ b/docs/docs/concepts/tracing.mdx
@@ -0,0 +1,10 @@
+# Tracing
+
+
+
+A trace is essentially a series of steps that your application takes to go from input to output.
+Traces contain individual steps called `runs`. These can be individual calls from a model, retriever,
+tool, or sub-chains.
+Tracing gives you observability inside your chains and agents, and is vital in diagnosing issues.
+
+For a deeper dive, check out [this LangSmith conceptual guide](https://docs.smith.langchain.com/concepts/tracing).
diff --git a/docs/docs/concepts/vectorstores.mdx b/docs/docs/concepts/vectorstores.mdx
new file mode 100644
index 0000000000000..44cefe54dee13
--- /dev/null
+++ b/docs/docs/concepts/vectorstores.mdx
@@ -0,0 +1,191 @@
+# Vector stores
+
+
+:::info[Prerequisites]
+
+* [Embeddings](/docs/concepts/embedding_models/)
+* [Text splitters](/docs/concepts/text_splitters/)
+
+:::
+:::info[Note]
+
+This conceptual overview focuses on text-based indexing and retrieval for simplicity.
+However, embedding models can be [multi-modal](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings)
+and vector stores can be used to store and retrieve a variety of data types beyond text.
+:::
+
+## Overview
+
+Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations.
+
+These vectors, called [embeddings](/docs/concepts/embedding_models/), capture the semantic meaning of data that has been embedded.
+
+Vector stores are frequently used to search over unstructured data, such as text, images, and audio, to retrieve relevant information based on semantic similarity rather than exact keyword matches.
+
+![Vectorstores](/img/vectorstores.png)
+
+## Integrations
+
+LangChain has a large number of vectorstore integrations, allowing users to easily switch between different vectorstore implementations.
+
+Please see the [full list of LangChain vectorstore integrations](/docs/integrations/vectorstores/).
+
+## Interface
+
+LangChain provides a standard interface for working with vector stores, allowing users to easily switch between different vectorstore implementations.
+
+The interface consists of basic methods for writing, deleting and searching for documents in the vector store.
+
+The key methods are:
+
+- `add_documents`: Add a list of texts to the vector store.
+- `delete_documents`: Delete a list of documents from the vector store.
+- `similarity_search`: Search for similar documents to a given query.
+
+
+## Initialization
+
+Most vectors in LangChain accept an embedding model as an argument when initializing the vector store.
+
+We will use LangChain's [InMemoryVectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html) implementation to illustrate the API.
+
+```python
+from langchain_core.vectorstores import InMemoryVectorStore
+# Initialize with an embedding model
+vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())
+```
+
+## Adding documents
+
+To add documents, use the `add_documents` method.
+
+This API works with a list of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.
+`Document` objects all have `page_content` and `metadata` attributes, making them a universal way to store unstructured text and associated metadata.
+
+```python
+from langchain_core.documents import Document
+
+document_1 = Document(
+ page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
+ metadata={"source": "tweet"},
+)
+
+document_2 = Document(
+ page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
+ metadata={"source": "news"},
+)
+
+documents = [document_1, document_2]
+
+vector_store.add_documents(documents=documents)
+```
+
+You should usually provide IDs for the documents you add to the vector store, so
+that instead of adding the same document multiple times, you can update the existing document.
+
+```python
+vector_store.add_documents(documents=documents, ids=["doc1", "doc2"])
+```
+
+## Delete
+
+To delete documents, use the `delete_documents` method which takes a list of document IDs to delete.
+
+```python
+vector_store.delete_documents(ids=["doc1"])
+```
+
+## Search
+
+Vectorstores embed and store the documents that added.
+If we pass in a query, the vectorstore will embed the query, perform a similarity search over the embedded documents, and return the most similar ones.
+This captures two important concepts: first, there needs to be a way to measure the similarity between the query and *any* [embedded](/docs/concepts/embedding_models/) document.
+Second, there needs to be an algorithm to efficiently perform this similarity search across *all* embedded documents.
+
+### Similarity metrics
+
+A critical advantage of embeddings vectors is they can be compared using many simple mathematical operations:
+
+- **Cosine Similarity**: Measures the cosine of the angle between two vectors.
+- **Euclidean Distance**: Measures the straight-line distance between two points.
+- **Dot Product**: Measures the projection of one vector onto another.
+
+The choice of similarity metric can sometimes be selected when initializing the vectorstore. Please refer
+to the documentation of the specific vectorstore you are using to see what similarity metrics are supported.
+
+:::info[Further reading]
+
+* See [this documentation](https://developers.google.com/machine-learning/clustering/dnn-clustering/supervised-similarity) from Google on similarity metrics to consider with embeddings.
+* See Pinecone's [blog post](https://www.pinecone.io/learn/vector-similarity/) on similarity metrics.
+* See OpenAI's [FAQ](https://platform.openai.com/docs/guides/embeddings/faq) on what similarity metric to use with OpenAI embeddings.
+
+:::
+
+### Similarity search
+
+Given a similarity metric to measure the distance between the embedded query and any embedded document, we need an algorithm to efficiently search over *all* the embedded documents to find the most similar ones.
+There are various ways to do this. As an example, many vectorstores implement [HNSW (Hierarchical Navigable Small World)](https://www.pinecone.io/learn/series/faiss/hnsw/), a graph-based index structure that allows for efficient similarity search.
+Regardless of the search algorithm used under the hood, the LangChain vectorstore interface has a `similarity_search` method for all integrations.
+This will take the search query, create an embedding, find similar documents, and return them as a list of [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html).
+
+```python
+query = "my query"
+docs = vectorstore.similarity_search(query)
+```
+
+Many vectorstores support search parameters to be passed with the `similarity_search` method. See the documentation for the specific vectorstore you are using to see what parameters are supported.
+As an example [Pinecone](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html#langchain_pinecone.vectorstores.PineconeVectorStore.similarity_search) several parameters that are important general concepts:
+Many vectorstores support [the `k`](/docs/integrations/vectorstores/pinecone/#query-directly), which controls the number of Documents to return, and `filter`, which allows for filtering documents by metadata.
+
+- `query (str) – Text to look up documents similar to.`
+- `k (int) – Number of Documents to return. Defaults to 4.`
+- `filter (dict | None) – Dictionary of argument(s) to filter on metadata`
+
+:::info[Further reading]
+
+* See the [how-to guide](/docs/how_to/vectorstores/) for more details on how to use the `similarity_search` method.
+* See the [integrations page](/docs/integrations/vectorstores/) for more details on arguments that can be passed in to the `similarity_search` method for specific vectorstores.
+
+:::
+
+### Metadata filtering
+
+While vectorstore implement a search algorithm to efficiently search over *all* the embedded documents to find the most similar ones, many also support filtering on metadata.
+This allows structured filters to reduce the size of the similarity search space. These two concepts work well together:
+
+1. **Semantic search**: Query the unstructured data directly, often using via embedding or keyword similarity.
+2. **Metadata search**: Apply structured query to the metadata, filering specific documents.
+
+Vectorstore support for metadata filtering is typically dependent on the underlying vector store implementation.
+
+Here is example usage with [Pinecone](/docs/integrations/vectorstores/pinecone/#query-directly), showing that we filter for all documents that have the metadata key `source` with value `tweet`.
+
+```python
+vectorstore.similarity_search(
+ "LangChain provides abstractions to make working with LLMs easy",
+ k=2,
+ filter={"source": "tweet"},
+)
+```
+
+:::info[Further reading]
+
+* See Pinecone's [documentation](https://docs.pinecone.io/guides/data/filter-with-metadata) on filtering with metadata.
+* See the [list of LangChain vectorstore integrations](/docs/integrations/retrievers/self_query/) that support metadata filtering.
+
+:::
+
+## Advanced search and retrieval techniques
+
+While algorithms like HNSW provide the foundation for efficient similarity search in many cases, additional techniques can be employed to improve search quality and diversity.
+For example, [maximal marginal relevance](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/example_selectors/mmr/) is a re-ranking algorithm used to diversify search results, which is applied after the initial similarity search to ensure a more diverse set of results.
+As a second example, some [vector stores](/docs/integrations/retrievers/pinecone_hybrid_search/) offer built-in [hybrid-search](https://docs.pinecone.io/guides/data/understanding-hybrid-search) to combine keyword and semantic similarity search, which marries the benefits of both approaches.
+At the moment, there is no unified way to perform hybrid search using LangChain vectorstores, but it is generally exposed as a keyword argument that is passed in with `similarity_search`.
+See this [how-to guide on hybrid search](/docs/how_to/hybrid/) for more details.
+
+| Name | When to use | Description |
+|-------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
+| [Hybrid search](/docs/integrations/retrievers/pinecone_hybrid_search/) | When combining keyword-based and semantic similarity. | Hybrid search combines keyword and semantic similarity, marrying the benefits of both approaches. [Paper](https://arxiv.org/abs/2210.11934). |
+| [Maximal Marginal Relevance (MMR)](/docs/integrations/vectorstores/pinecone/#maximal-marginal-relevance-searches) | When needing to diversify search results. | MMR attempts to diversify the results of a search to avoid returning similar and redundant documents. |
+
+
diff --git a/docs/docs/concepts/why_langchain.mdx b/docs/docs/concepts/why_langchain.mdx
new file mode 100644
index 0000000000000..1eae06eea3705
--- /dev/null
+++ b/docs/docs/concepts/why_langchain.mdx
@@ -0,0 +1,109 @@
+# Why langchain?
+
+The goal of `langchain` the Python package and LangChain the company is to make it as easy possible for developers to build applications that reason.
+While LangChain originally started as a single open source package, it has evolved into a company and a whole ecosystem.
+This page will talk about the LangChain ecosystem as a whole.
+Most of the components within in the LangChain ecosystem can be used by themselves - so if you feel particularly drawn to certain components but not others, that is totally fine! Pick and choose whichever components you like best.
+
+## Features
+
+There are several primary needs that LangChain aims to address:
+
+1. **Standardized component interfaces:** The growing number of [models](/docs/integrations/chat/) and [related components](/docs/integrations/vectorstores/) for AI applications has resulted in a wide variety of different APIs that developers need to learn and use.
+This diversity can make it challenging for developers to switch between providers or combine components when building applications.
+LangChain exposes a standard interface for key components, making it easy to switch between providers.
+
+2. **Orchestration:** As applications become more complex, combining multiple components and models, there's [a growing need to efficiently connect these elements into control flows](https://lilianweng.github.io/posts/2023-06-23-agent/) that can [accomplish diverse tasks](https://www.sequoiacap.com/article/generative-ais-act-o1/).
+[Orchestration](https://en.wikipedia.org/wiki/Orchestration_(computing)) is crucial for building such applications.
+
+3. **Observability and evaluation:** As applications become more complex, it becomes increasingly difficult to understand what is happening within them.
+Furthermore, the pace of development can become rate-limited by the [paradox of choice](https://en.wikipedia.org/wiki/Paradox_of_choice):
+for example, developers often wonder how to engineer their prompt or which LLM best balances accuracy, latency, and cost.
+[Observability](https://en.wikipedia.org/wiki/Observability) and evaluations can help developers monitor their applications and rapidly answer these types of questions with confidence.
+
+
+## Standardized component interfaces
+
+LangChain provides common interfaces for components that are central to many AI applications.
+As an example, all [chat models](/docs/concepts/chat_models/) implement the [BaseChatModel](https://python.langchain.com/api_reference/core/language_models/langchain_core.language_models.chat_models.BaseChatModel.html) interface.
+This provides a standard way to interact with chat models, supporting important but often provider-specific features like [tool calling](/docs/concepts/tool_calling/) and [structured outputs](/docs/concepts/structured_outputs/).
+
+
+### Example: chat models
+
+Many [model providers](/docs/concepts/chat_models/) support [tool calling](/docs/concepts/tool_calling/), a critical features for many applications (e.g., [agents](https://langchain-ai.github.io/langgraph/concepts/agentic_concepts/)), that allows a developer to request model responses that match a particular schema.
+The APIs for each provider differ.
+LangChain's [chat model](/docs/concepts/chat_models/) interface provides a common way to bind [tools](/docs/concepts/tools) to a model in order to support [tool calling](/docs/concepts/tool_calling/):
+
+```python
+# Tool creation
+tools = [my_tool]
+# Tool binding
+model_with_tools = model.bind_tools(tools)
+```
+
+Similarly, getting models to produce [structured outputs](/docs/concepts/structured_outputs/) is an extremely common use case.
+Providers support different approaches for this, including [JSON mode or tool calling](https://platform.openai.com/docs/guides/structured-outputs), with different APIs.
+LangChain's [chat model](/docs/concepts/chat_models/) interface provides a common way to produce structured outputs using the `with_structured_output()` method:
+
+```python
+# Define schema
+schema = ...
+# Bind schema to model
+model_with_structure = model.with_structured_output(schema)
+```
+
+### Example: retrievers
+
+In the context of [RAG](/docs/concepts/rag/) and LLM application components, LangChain's [retriever](/docs/concepts/retrievers/) interface provides a standard way to connect to many different types of data services or databases (e.g., [vector stores](/docs/concepts/vectorstores) or databases).
+The underlying implementation of the retriever depends on the type of data store or database you are connecting to, but all retrievers implement the [runnable interface](/docs/concepts/runnables/), meaning they can be invoked in a common manner.
+
+```python
+documents = my_retriever.invoke("What is the meaning of life?")
+```
+
+## Orchestration
+
+While standardization for individual components is useful, we've increasingly seen that developers want to *combine* components into more complex applications.
+This motivates the need for [orchestration](https://en.wikipedia.org/wiki/Orchestration_(computing)).
+There are several common characteristics of LLM applications that this orchestration layer should support:
+
+* **Complex control flow:** The application requires complex patterns such as cycles (e.g., a loop that reiterates until a condition is met).
+* **[Persistence](https://langchain-ai.github.io/langgraph/concepts/persistence/):** The application needs to maintain [short-term and / or long-term memory](https://langchain-ai.github.io/langgraph/concepts/memory/).
+* **[Human-in-the-loop](https://langchain-ai.github.io/langgraph/concepts/human_in_the_loop/):** The application needs human interaction, e.g., pausing, reviewing, editing, approving certain steps.
+
+The recommended way to do orchestration for these complex applications is [LangGraph](https://langchain-ai.github.io/langgraph/concepts/high_level/).
+LangGraph is a library that gives developers a high degree of control by expressing the flow of the application as a set of nodes and edges.
+LangGraph comes with built-in support for [persistence](https://langchain-ai.github.io/langgraph/concepts/persistence/), [human-in-the-loop](https://langchain-ai.github.io/langgraph/concepts/human_in_the_loop/), [memory](https://langchain-ai.github.io/langgraph/concepts/memory/), and other features.
+It's particularly well suited for building [agents](https://langchain-ai.github.io/langgraph/concepts/agentic_concepts/) or [multi-agent](https://langchain-ai.github.io/langgraph/concepts/multi_agent/) applications.
+Importantly, individual LangChain components can be used within LangGraph nodes, but you can also use LangGraph **without** using LangChain components.
+
+:::info[Further reading]
+
+Have a look at our free course, [Introduction to LangGraph](https://academy.langchain.com/courses/intro-to-langgraph), to learn more about how to use LangGraph to build complex applications.
+
+:::
+
+## Observability and evaluation
+
+The pace of AI application development is often rate-limited by high-quality evaluations because there is a paradox of choice.
+Developers often wonder how to engineer their prompt or which LLM best balances accuracy, latency, and cost.
+High quality tracing and evaluations can help you rapidly answer these types of questions with confidence.
+[LangSmith](https://docs.smith.langchain.com/) is our platform that supports observability and evaluation for AI applications.
+See our conceptual guides on [evaluations](https://docs.smith.langchain.com/concepts/evaluation) and [tracing](https://docs.smith.langchain.com/concepts/tracing) for more details.
+
+:::info[Further reading]
+
+See our video playlist on [LangSmith tracing and evaluations](https://youtube.com/playlist?list=PLfaIDFEXuae0um8Fj0V4dHG37fGFU8Q5S&feature=shared) for more details.
+
+:::
+
+## Conclusion
+
+LangChain offers standard interfaces for components that are central to many AI applications, which offers a few specific advantages:
+- **Ease of swapping providers:** It allows you to swap out different component providers without having to change the underlying code.
+- **Advanced features:** It provides common methods for more advanced features, such as [streaming](/docs/concepts/runnables/#streaming) and [tool calling](/docs/concepts/tool_calling/).
+
+[LangGraph](https://langchain-ai.github.io/langgraph/concepts/high_level/) makes it possible to orchestrate complex applications (e.g., [agents](/docs/concepts/agents/)) and provide features like including [persistence](https://langchain-ai.github.io/langgraph/concepts/persistence/), [human-in-the-loop](https://langchain-ai.github.io/langgraph/concepts/human_in_the_loop/), or [memory](https://langchain-ai.github.io/langgraph/concepts/memory/).
+
+[LangSmith](https://docs.smith.langchain.com/) makes it possible to iterate with confidence on your applications, by providing LLM-specific observability and framework for testing and evaluating your application.
diff --git a/docs/docs/integrations/chat/groq.ipynb b/docs/docs/integrations/chat/groq.ipynb
index 59898319b5474..e4a4ad24aa4ac 100644
--- a/docs/docs/integrations/chat/groq.ipynb
+++ b/docs/docs/integrations/chat/groq.ipynb
@@ -17,7 +17,7 @@
"source": [
"# ChatGroq\n",
"\n",
- "This will help you getting started with Groq [chat models](../../concepts.mdx#chat-models). For detailed documentation of all ChatGroq features and configurations head to the [API reference](https://python.langchain.com/api_reference/groq/chat_models/langchain_groq.chat_models.ChatGroq.html). For a list of all Groq models, visit this [link](https://console.groq.com/docs/models).\n",
+ "This will help you getting started with Groq [chat models](../../concepts/chat_models.mdx). For detailed documentation of all ChatGroq features and configurations head to the [API reference](https://python.langchain.com/api_reference/groq/chat_models/langchain_groq.chat_models.ChatGroq.html). For a list of all Groq models, visit this [link](https://console.groq.com/docs/models).\n",
"\n",
"## Overview\n",
"### Integration details\n",
diff --git a/docs/docs/integrations/chat/together.ipynb b/docs/docs/integrations/chat/together.ipynb
index 9cbdbfe47beff..cd47bc390f403 100644
--- a/docs/docs/integrations/chat/together.ipynb
+++ b/docs/docs/integrations/chat/together.ipynb
@@ -18,7 +18,7 @@
"# ChatTogether\n",
"\n",
"\n",
- "This page will help you get started with Together AI [chat models](../../concepts.mdx#chat-models). For detailed documentation of all ChatTogether features and configurations head to the [API reference](https://python.langchain.com/api_reference/together/chat_models/langchain_together.chat_models.ChatTogether.html).\n",
+ "This page will help you get started with Together AI [chat models](../../concepts/chat_models.mdx). For detailed documentation of all ChatTogether features and configurations head to the [API reference](https://python.langchain.com/api_reference/together/chat_models/langchain_together.chat_models.ChatTogether.html).\n",
"\n",
"[Together AI](https://www.together.ai/) offers an API to query [50+ leading open-source models](https://docs.together.ai/docs/chat-models)\n",
"\n",
diff --git a/docs/sidebars.js b/docs/sidebars.js
index d3539a2d8e90f..6f02bf20fdcfb 100644
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@@ -47,7 +47,17 @@ module.exports = {
className: 'hidden',
}],
},
- "concepts",
+ {
+ type: "category",
+ link: {type: 'doc', id: 'concepts/index'},
+ label: "Conceptual Guide",
+ collapsible: false,
+ items: [{
+ type: 'autogenerated',
+ dirName: 'concepts',
+ className: 'hidden',
+ }],
+ },
{
type: "category",
label: "Ecosystem",
diff --git a/docs/static/img/agent_types.png b/docs/static/img/agent_types.png
new file mode 100644
index 0000000000000..3cefe033438e9
Binary files /dev/null and b/docs/static/img/agent_types.png differ
diff --git a/docs/static/img/conversation_patterns.png b/docs/static/img/conversation_patterns.png
new file mode 100644
index 0000000000000..1cf45cc987d43
Binary files /dev/null and b/docs/static/img/conversation_patterns.png differ
diff --git a/docs/static/img/embeddings_concept.png b/docs/static/img/embeddings_concept.png
new file mode 100644
index 0000000000000..692ed1d4dc68b
Binary files /dev/null and b/docs/static/img/embeddings_concept.png differ
diff --git a/docs/static/img/rag_concepts.png b/docs/static/img/rag_concepts.png
new file mode 100644
index 0000000000000..3093f925f0589
Binary files /dev/null and b/docs/static/img/rag_concepts.png differ
diff --git a/docs/static/img/retrieval_concept.png b/docs/static/img/retrieval_concept.png
new file mode 100644
index 0000000000000..93e9db6f4b1e6
Binary files /dev/null and b/docs/static/img/retrieval_concept.png differ
diff --git a/docs/static/img/retrieval_high_level.png b/docs/static/img/retrieval_high_level.png
new file mode 100644
index 0000000000000..461fe773de08b
Binary files /dev/null and b/docs/static/img/retrieval_high_level.png differ
diff --git a/docs/static/img/retriever_concept.png b/docs/static/img/retriever_concept.png
new file mode 100644
index 0000000000000..4a288d3d49be2
Binary files /dev/null and b/docs/static/img/retriever_concept.png differ
diff --git a/docs/static/img/retriever_full_docs.png b/docs/static/img/retriever_full_docs.png
new file mode 100644
index 0000000000000..a50ef823f5fc0
Binary files /dev/null and b/docs/static/img/retriever_full_docs.png differ
diff --git a/docs/static/img/structured_output.png b/docs/static/img/structured_output.png
new file mode 100644
index 0000000000000..00511a2a11163
Binary files /dev/null and b/docs/static/img/structured_output.png differ
diff --git a/docs/static/img/text_splitters.png b/docs/static/img/text_splitters.png
new file mode 100644
index 0000000000000..6f5c06a217430
Binary files /dev/null and b/docs/static/img/text_splitters.png differ
diff --git a/docs/static/img/tool_call_example.png b/docs/static/img/tool_call_example.png
new file mode 100644
index 0000000000000..9e122f43f6e23
Binary files /dev/null and b/docs/static/img/tool_call_example.png differ
diff --git a/docs/static/img/tool_calling_agent.png b/docs/static/img/tool_calling_agent.png
new file mode 100644
index 0000000000000..12bd9a33701e3
Binary files /dev/null and b/docs/static/img/tool_calling_agent.png differ
diff --git a/docs/static/img/tool_calling_components.png b/docs/static/img/tool_calling_components.png
new file mode 100644
index 0000000000000..582fd7057c897
Binary files /dev/null and b/docs/static/img/tool_calling_components.png differ
diff --git a/docs/static/img/tool_calling_concept.png b/docs/static/img/tool_calling_concept.png
new file mode 100644
index 0000000000000..7abdee69226e2
Binary files /dev/null and b/docs/static/img/tool_calling_concept.png differ
diff --git a/docs/static/img/vectorstores.png b/docs/static/img/vectorstores.png
new file mode 100644
index 0000000000000..fb6604c1c8175
Binary files /dev/null and b/docs/static/img/vectorstores.png differ
diff --git a/docs/static/img/with_structured_output.png b/docs/static/img/with_structured_output.png
new file mode 100644
index 0000000000000..bf14853dc0634
Binary files /dev/null and b/docs/static/img/with_structured_output.png differ