Skip to content

Commit

Permalink
Add release notes for 2.6.0 (#374)
Browse files Browse the repository at this point in the history
* Add release notes for 2.6.0

* Update note about filter syntax

* Change the header level

---------

Co-authored-by: bilgeyucel <[email protected]>
  • Loading branch information
julian-risch and bilgeyucel authored Oct 3, 2024
1 parent 8f3a8f5 commit 2b97584
Showing 1 changed file with 112 additions and 0 deletions.
112 changes: 112 additions & 0 deletions content/release-notes/2.6.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: Haystack 2.6.0
description: Release notes for Haystack 2.6.0
toc: True
date: 2024-10-02
last_updated: 2024-10-02
tags: ["Release Notes"]
link: https://github.com/deepset-ai/haystack/releases/tag/v2.6.0
---

# Release Notes

## ⬆️ Upgrade Notes

- `gpt-3.5-turbo` was replaced by `gpt-4o-mini` as the default model for all components relying on OpenAI API
- Support for the legacy filter syntax and operators (e.g., "$and", "$or", "$eq", "$lt", etc.), which originated in Haystack v1, has been fully removed. Users must now use only the new filter syntax. See the [docs](https://docs.haystack.deepset.ai/docs/metadata-filtering) for more details.

## 🚀 New Features
- Added a new component `DocumentNDCGEvaluator`, which is similar to `DocumentMRREvaluator` and useful for retrieval evaluation. It calculates the normalized discounted cumulative gain, an evaluation metric useful when there are multiple ground truth relevant documents and the order in which they are retrieved is important.

- Add new `CSVToDocument` component. Loads the file as bytes object. Adds the loaded string as a new document that can be used for further processing by the Document Splitter.

- Adds support for zero shot document classification via new `TransformersZeroShotDocumentClassifier` component. This allows you to classify documents into user-defined classes (binary and multi-label classification) using pre-trained models from Hugging Face.

- Added the option to use a custom splitting function in `DocumentSplitter`. The function must accept a string as input and return a list of strings, representing the split units. To use the feature initialise `DocumentSplitter` with `split_by="function"` providing the custom splitting function as `splitting_function=custom_function`.

- Add new `JSONConverter` Component to convert JSON files to Document. Optionally it can use jq to filter the source JSON files and extract only specific parts.

```python
import json
from haystack.components.converters import JSONConverter
from haystack.dataclasses import ByteStream
data = {
"laureates": [
{
"firstname": "Enrico",
"surname": "Fermi",
"motivation": "for his demonstrations of the existence of new radioactive elements produced "
"by neutron irradiation, and for his related discovery of nuclear reactions brought about by slow neutrons",
},
{
"firstname": "Rita",
"surname": "Levi-Montalcini",
"motivation": "for their discoveries of growth factors",
},
],
}
source = ByteStream.from_string(json.dumps(data))
converter = JSONConverter(jq_schema=".laureates[]", content_key="motivation", extra_meta_fields=["firstname", "surname"])
results = converter.run(sources=[source])
documents = results["documents"] print(documents[0].content)
# 'for his demonstrations of the existence of new radioactive elements produced by
# neutron irradiation, and for his related discovery of nuclear reactions brought
# about by slow neutrons'
print(documents[0].meta)
# {'firstname': 'Enrico', 'surname': 'Fermi'}
print(documents[1].content)
# 'for their discoveries of growth factors' print(documents[1].meta) # {'firstname': 'Rita', 'surname': 'Levi-Montalcini'}
```

- Added a new `NLTKDocumentSplitter`, a component enhancing document preprocessing capabilities with NLTK. This feature allows for fine-grained control over the splitting of documents into smaller parts based on configurable criteria such as word count, sentence boundaries, and page breaks. It supports multiple languages and offers options for handling sentence boundaries and abbreviations, facilitating better handling of various document types for further processing tasks.

- Updates `SentenceTransformersDocumentEmbedder` and `SentenceTransformersTextEmbedder` so `model_max_length` passed through `tokenizer_kwargs` also updates the `max_seq_length` of the underlying SentenceTransformer model.

## ⚡️ Enhancement Notes

- Adapts how `ChatPromptBuilder` creates ChatMessages. Messages are deep copied to ensure all meta fields are copied correctly.

- Expose `default_headers` to pass custom headers to Azure API including APIM subscription key.

- Add optional `azure_kwargs` dictionary parameter to pass in parameters undefined in Haystack but supported by AzureOpenAI.

- Allow the ability to add the current date inside a template in `PromptBuilder` using the following syntax:

- `{% now 'UTC' %}`: Get the current date for the UTC timezone.
- `{% now 'America/Chicago' + 'hours=2' %}`: Add two hours to the current date in the Chicago timezone.
- `{% now 'Europe/Berlin' - 'weeks=2' %}`: Subtract two weeks from the current date in the Berlin timezone.
- `{% now 'Pacific/Fiji' + 'hours=2', '%H' %}`: Display only the number of hours after adding two hours to the Fiji timezone.
- `{% now 'Etc/GMT-4', '%I:%M %p' %}`: Change the date format to AM/PM for the GMT-4 timezone.

Note that if no date format is provided, the default will be `%Y-%m-%d %H:%M:%S`. Please refer to [list of tz database](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) for a list of timezones.

- Adds `usage` meta field with `prompt_tokens` and `completion_tokens` keys to `HuggingFaceAPIChatGenerator`.

- Add new `GreedyVariadic` input type. This has a similar behaviour to `Variadic` input type as it can be connected to multiple output sockets, though the Pipeline will run it as soon as it receives an input without waiting for others. This replaces the `is_greedy` argument in the `@component` decorator. If you had a Component with a `Variadic` input type and `@component(is_greedy=True)` you need to change the type to `GreedyVariadic` and remove `is_greedy=true` from `@component`.

- Add new Pipeline init argument `max_runs_per_component`, this has the same identical behaviour as the existing `max_loops_allowed` argument but is more descriptive of its actual effects.

- Add new `PipelineMaxLoops` to reflect new `max_runs_per_component` init argument

- We added batching during inference time to the `TransformerSimilarityRanker` to help prevent OOMs when ranking large amounts of Documents.

## ⚠️ Deprecation Notes

- The `DefaultConverter` class used by the `PyPDFToDocument` component has been deprecated. Its functionality will be merged into the component in 2.7.0.
- Pipeline init argument `debug_path` is deprecated and will be removed in version 2.7.0.
- `@component` decorator `is_greedy` argument is deprecated and will be removed in version 2.7.0. Use `GreedyVariadic` type instead.
- Deprecate connecting a Component to itself when calling `Pipeline.connect()`, it will raise an error from version 2.7.0 onwards
- Pipeline init argument `max_loops_allowed` is deprecated and will be removed in version 2.7.0. Use `max_runs_per_component` instead.
- `PipelineMaxLoops` exception is deprecated and will be removed in version 2.7.0. Use `PipelineMaxComponentRuns` instead.

## 🐛 Bug Fixes

- Fix the serialization of `PyPDFToDocument` component to prevent the default converter from being serialized unnecessarily.
- Add constraints to `component.set_input_type` and `component.set_input_types` to prevent undefined behaviour when the `run` method does not contain a variadic keyword argument.
- Prevent `set_output_types` from being called when the `output_types` decorator is used.
- Update the `CHAT_WITH_WEBSITE` Pipeline template to reflect the changes in the `HTMLToDocument` converter component.
- Fix a Pipeline visualization issue due to changes in the new release of Mermaid.
- Fixing the filters in the `SentenceWindowRetriever` allowing now support for 3 more DocumentStores: Astra, PGVector, Qdrant
- Fix Pipeline not running Components with Variadic input even if it received inputs only from a subset of its senders
- The `from_dict` method of `ConditionalRouter` now correctly handles the case where the `dict` passed to it contains the key `custom_filters` explicitly set to `None`. Previously this was causing an `AttributeError`
- Make the `from_dict` method of the `PyPDFToDocument` more robust to cases when the converter is not provided in the dictionary.

0 comments on commit 2b97584

Please sign in to comment.