Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs[experimental]: Make docs clearer and add min_chunk_size #26398

Merged
merged 1 commit into from
Dec 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions docs/docs/how_to/semantic-chunker.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,11 @@
"\n",
"There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.\n",
"\n",
"Note: if the resulting chunk sizes are too small/big, the additional kwargs `breakpoint_threshold_amount` and `min_chunk_size` can be used for adjustments.\n",
"\n",
"### Percentile\n",
"\n",
"The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split."
"The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split. The default value for X is 95.0 and can be adjusted by the keyword argument `breakpoint_threshold_amount` which expects a number between 0.0 and 100.0."
]
},
{
Expand Down Expand Up @@ -186,7 +188,7 @@
"source": [
"### Standard Deviation\n",
"\n",
"In this method, any difference greater than X standard deviations is split."
"In this method, any difference greater than X standard deviations is split. The default value for X is 3.0 and can be adjusted by the keyword argument `breakpoint_threshold_amount`."
]
},
{
Expand Down Expand Up @@ -245,7 +247,7 @@
"source": [
"### Interquartile\n",
"\n",
"In this method, the interquartile distance is used to split chunks."
"In this method, the interquartile distance is used to split chunks. The interquartile range can be scaled by the keyword argument `breakpoint_threshold_amount`, the default value is 1.5."
]
},
{
Expand Down Expand Up @@ -306,8 +308,8 @@
"source": [
"### Gradient\n",
"\n",
"In this method, the gradient of distance is used to split chunks along with the percentile method.\n",
"This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data."
"In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data.\n",
"Similar to the percentile method, the split can be adjusted by the keyword argument `breakpoint_threshold_amount` which expects a number between 0.0 and 100.0, the default value is 95.0."
]
},
{
Expand Down
Loading