Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support updating doc mapper through api #5253

Merged
merged 5 commits into from
Jul 26, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/configuration/index-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ In addition to the `index_id`, the index configuration lets you define five item
- The **search settings**: it defines the default search fields `default_search_fields`, a list of fields that Quickwit will search into if the user query does not explicitly target a field.
- The **retention policy**: it defines how long Quickwit should keep the indexed data. If not specified, the data is stored forever.

Configuration is generally set at index creation and cannot be modified, except for some specific attributes like the search settings and retention policy, which can be changed using the [update endpoint](../reference/rest-api.md) or the [CLI](../reference/cli.md).
Configuration is set at index creation and can be changed using the [update endpoint](../reference/rest-api.md) or the [CLI](../reference/cli.md).

## Config file format

Expand Down
7 changes: 6 additions & 1 deletion docs/reference/rest-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,11 @@ Updates the configurations of an index. This endpoint follows PUT semantics, whi
- The retention policy update is automatically picked up by the janitor service on its next state refresh.
- The search settings update is automatically picked up by searcher nodes when the next query is executed.
- The indexing settings update is not automatically picked up by the indexer nodes, they need to be manually restarted.
- The doc mapping update is not automatically picked up by the indexer nodes, they have to be manually restarted.

Updating the doc mapping doesn't reindex existing data. Queries and answers are mapped on a best effort basis when querying older splits.
It is also not possible to update the timestamp field, or to modify/remove existing non-default tokenizers (but it is possible to change
which tokenizer is used for a field).

#### PUT payload

Expand All @@ -327,7 +332,7 @@ Updates the configurations of an index. This endpoint follows PUT semantics, whi
| `version` | `String` | Config format version, use the same as your Quickwit version. | _required_ |
| `index_id` | `String` | Index ID, must be the same index as in the request URI. | _required_ |
| `index_uri` | `String` | Defines where the index files are stored. Cannot be updated. | `{current_index_uri}` |
| `doc_mapping` | `DocMapping` | Doc mapping object as specified in the [index config docs](../configuration/index-config.md#doc-mapping). Cannot be updated. | _required_ |
| `doc_mapping` | `DocMapping` | Doc mapping object as specified in the [index config docs](../configuration/index-config.md#doc-mapping). | _required_ |
| `indexing_settings` | `IndexingSettings` | Indexing settings object as specified in the [index config docs](../configuration/index-config.md#indexing-settings). | |
| `search_settings` | `SearchSettings` | Search settings object as specified in the [index config docs](../configuration/index-config.md#search-settings). | |
| `retention` | `Retention` | Retention policy object as specified in the [index config docs](../configuration/index-config.md#retention-policy). | |
Expand Down
175 changes: 165 additions & 10 deletions quickwit/quickwit-config/src/index_config/serialize.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,11 @@
// You should have received a copy of the GNU Affero General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.

use std::collections::HashSet;

use anyhow::{ensure, Context};
use quickwit_common::uri::Uri;
use quickwit_doc_mapper::DefaultDocMapperBuilder;
use quickwit_proto::types::{DocMappingUid, IndexId};
use serde::{Deserialize, Serialize};
use tracing::info;
Expand Down Expand Up @@ -81,7 +84,7 @@ pub fn load_index_config_update(
.index_uri
.parent()
.expect("index URI should have a parent");
let new_index_config = load_index_config_from_user_config(
let mut new_index_config = load_index_config_from_user_config(
config_format,
index_config_bytes,
current_index_parent_dir,
Expand All @@ -98,12 +101,63 @@ pub fn load_index_config_update(
current_index_config.index_uri,
new_index_config.index_uri
);
ensure!(
current_index_config
.doc_mapping
.eq_ignore_doc_mapping_uid(&new_index_config.doc_mapping),
"`doc_mapping` cannot be updated"
);

// verify the new mapping is coherent
let doc_mapper_builder = DefaultDocMapperBuilder {
doc_mapping: new_index_config.doc_mapping.clone(),
default_search_fields: new_index_config
.search_settings
.default_search_fields
.clone(),
};
doc_mapper_builder
.try_build()
.context("invalid mapping update")?;

{
let new_mapping_uid = new_index_config.doc_mapping.doc_mapping_uid;
// we verify whether they are equal ignoring the mapping uid as it is generated at random:
// we don't want to record a mapping change when nothing really happened.
new_index_config.doc_mapping.doc_mapping_uid =
current_index_config.doc_mapping.doc_mapping_uid;
if new_index_config.doc_mapping != current_index_config.doc_mapping {
new_index_config.doc_mapping.doc_mapping_uid = new_mapping_uid;
ensure!(
current_index_config.doc_mapping.doc_mapping_uid
!= new_index_config.doc_mapping.doc_mapping_uid,
"`doc_mapping_doc_mapping_uid` must change when the doc mapping is updated",
);
ensure!(
current_index_config.doc_mapping.timestamp_field
== new_index_config.doc_mapping.timestamp_field,
"`doc_mapping.timestamp_field` cannot be updated, current value {}, new expected \
value {}",
current_index_config
.doc_mapping
.timestamp_field
.as_deref()
.unwrap_or("<none>"),
new_index_config
.doc_mapping
.timestamp_field
.as_deref()
.unwrap_or("<none>"),
);
// TODO: i'm not sure this is necessary, we can relax this requirement once we know
// for sure
let current_tokenizers: HashSet<_> =
current_index_config.doc_mapping.tokenizers.iter().collect();
let new_tokenizers: HashSet<_> =
new_index_config.doc_mapping.tokenizers.iter().collect();
ensure!(
new_tokenizers.is_superset(&current_tokenizers),
"`.doc_mapping.tokenizers` must be a superset of previously available tokenizers"
);
} else {
// the docmapping is unchanged, keep the old uid
}
}

Ok(new_index_config)
}

Expand Down Expand Up @@ -440,12 +494,113 @@ mod test {
tokenizer: default
record: position
"#;
let load_error = load_index_config_update(
let updated_config = load_index_config_update(
ConfigFormat::Yaml,
updated_config_yaml.as_bytes(),
&original_config,
)
.unwrap();
assert_eq!(updated_config.doc_mapping.field_mappings.len(), 1);
}

#[test]
fn test_update_doc_mappings_failing_cases() {
let original_config_yaml = r#"
version: 0.8
index_id: hdfs-logs
doc_mapping:
mode: lenient
doc_mapping_uid: 00000000000000000000000000
timestamp_field: timestamp
field_mappings:
- name: timestamp
type: datetime
fast: true
"#;
let original_config: IndexConfig = load_index_config_from_user_config(
ConfigFormat::Yaml,
original_config_yaml.as_bytes(),
&Uri::for_test("s3://mybucket"),
)
.unwrap();

let updated_config_yaml = r#"
version: 0.8
index_id: hdfs-logs
doc_mapping:
mode: lenient
doc_mapping_uid: 00000000000000000000000000
timestamp_field: timestamp
field_mappings:
- name: timestamp
type: datetime
fast: true
- name: body
type: text
tokenizer: default
record: position
"#;
load_index_config_update(
ConfigFormat::Yaml,
updated_config_yaml.as_bytes(),
&original_config,
)
.expect_err("mapping changed but uid fixed should error");

let updated_config_yaml = r#"
version: 0.8
index_id: hdfs-logs
doc_mapping:
mode: lenient
field_mappings:
- name: timestamp
type: datetime
fast: true
"#;
load_index_config_update(
ConfigFormat::Yaml,
updated_config_yaml.as_bytes(),
&original_config,
)
.expect_err("timestamp field removed should error");

let updated_config_yaml = r#"
version: 0.8
index_id: hdfs-logs
doc_mapping:
mode: lenient
timestamp_field: timestamp
field_mappings:
- name: body
type: text
tokenizer: default
record: position
"#;
load_index_config_update(
ConfigFormat::Yaml,
updated_config_yaml.as_bytes(),
&original_config,
)
.expect_err("field required for timestamp is absent");

let updated_config_yaml = r#"
version: 0.8
index_id: hdfs-logs
doc_mapping:
mode: lenient
timestamp_field: timestamp
field_mappings:
- name: timestamp
type: datetime
fast: true
search_settings:
default_search_fields: ["i_dont_exist"]
"#;
load_index_config_update(
ConfigFormat::Yaml,
updated_config_yaml.as_bytes(),
&original_config,
)
.unwrap_err();
assert!(format!("{:?}", load_error).contains("`doc_mapping` cannot be updated"));
.expect_err("field required for default search is absent");
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ use tantivy::tokenizer::{
};

/// A `TokenizerEntry` defines a custom tokenizer with its name and configuration.
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, utoipa::ToSchema)]
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq, Hash, utoipa::ToSchema)]
pub struct TokenizerEntry {
/// Tokenizer name.
pub name: String,
Expand All @@ -36,7 +36,7 @@ pub struct TokenizerEntry {
}

/// Tokenizer configuration.
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, utoipa::ToSchema)]
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq, Hash, utoipa::ToSchema)]
pub struct TokenizerConfig {
#[serde(flatten)]
pub(crate) tokenizer_type: TokenizerType,
Expand Down Expand Up @@ -94,7 +94,7 @@ pub fn analyze_text(text: &str, tokenizer: &TokenizerConfig) -> anyhow::Result<V
Ok(tokens)
}

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize, utoipa::ToSchema)]
#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize, utoipa::ToSchema)]
#[serde(rename_all = "snake_case")]
pub enum TokenFilterType {
RemoveLong,
Expand Down Expand Up @@ -122,7 +122,7 @@ impl TokenFilterType {
}
}

#[derive(Clone, Debug, PartialEq, Serialize, Deserialize, utoipa::ToSchema)]
#[derive(Clone, Debug, PartialEq, Eq, Hash, Serialize, Deserialize, utoipa::ToSchema)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum TokenizerType {
#[cfg(any(test, feature = "multilang"))]
Expand All @@ -133,7 +133,7 @@ pub enum TokenizerType {
SourceCode,
}

#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, utoipa::ToSchema)]
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq, Hash, utoipa::ToSchema)]
#[serde(deny_unknown_fields)]
pub struct NgramTokenizerOption {
pub min_gram: usize,
Expand All @@ -142,7 +142,7 @@ pub struct NgramTokenizerOption {
pub prefix_only: bool,
}

#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, utoipa::ToSchema)]
#[derive(Clone, Serialize, Deserialize, Debug, PartialEq, Eq, Hash, utoipa::ToSchema)]
#[serde(deny_unknown_fields)]
pub struct RegexTokenizerOption {
pub pattern: String,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ use std::ops::Bound;
use itertools::Itertools;
use quickwit_common::pretty::PrettySample;
use quickwit_config::{
IndexingSettings, RetentionPolicy, SearchSettings, SourceConfig, INGEST_V2_SOURCE_ID,
DocMapping, IndexingSettings, RetentionPolicy, SearchSettings, SourceConfig,
INGEST_V2_SOURCE_ID,
};
use quickwit_proto::metastore::{
AcquireShardsRequest, AcquireShardsResponse, DeleteQuery, DeleteShardsRequest,
Expand Down Expand Up @@ -231,6 +232,12 @@ impl FileBackedIndex {
self.metadata.set_indexing_settings(search_settings)
}

/// Replaces the doc mapping in the index config, returning whether a mutation occurred, or
/// possibly an error if this change is not allowed
trinity-1686a marked this conversation as resolved.
Show resolved Hide resolved
pub fn set_doc_mapping(&mut self, doc_mapping: DocMapping) -> bool {
self.metadata.set_doc_mapping(doc_mapping)
}

/// Stages a single split.
///
/// If a split already exists and is in the [SplitState::Staged] state,
Expand Down
2 changes: 2 additions & 0 deletions quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -505,13 +505,15 @@ impl MetastoreService for FileBackedMetastore {
let retention_policy_opt = request.deserialize_retention_policy()?;
let search_settings = request.deserialize_search_settings()?;
let indexing_settings = request.deserialize_indexing_settings()?;
let doc_mapping = request.deserialize_doc_mapping()?;
let index_uid = request.index_uid();

let index_metadata = self
.mutate(index_uid, |index| {
let mut mutation_occurred = index.set_retention_policy(retention_policy_opt);
mutation_occurred |= index.set_search_settings(search_settings);
mutation_occurred |= index.set_indexing_settings(indexing_settings);
mutation_occurred |= index.set_doc_mapping(doc_mapping);

let index_metadata = index.metadata().clone();

Expand Down
13 changes: 12 additions & 1 deletion quickwit/quickwit-metastore/src/metastore/index_metadata/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ use std::collections::HashMap;

use quickwit_common::uri::Uri;
use quickwit_config::{
IndexConfig, IndexingSettings, RetentionPolicy, SearchSettings, SourceConfig,
DocMapping, IndexConfig, IndexingSettings, RetentionPolicy, SearchSettings, SourceConfig,
};
use quickwit_proto::metastore::{EntityKind, MetastoreError, MetastoreResult};
use quickwit_proto::types::{IndexUid, SourceId};
Expand Down Expand Up @@ -130,6 +130,17 @@ impl IndexMetadata {
}
}

/// Replaces the doc mapping in the index config, returning whether a mutation occurred, or
/// possibly an error if this change is not allowed
trinity-1686a marked this conversation as resolved.
Show resolved Hide resolved
pub fn set_doc_mapping(&mut self, doc_mapping: DocMapping) -> bool {
if self.index_config.doc_mapping != doc_mapping {
self.index_config.doc_mapping = doc_mapping;
true
} else {
false
}
}

/// Adds a source to the index. Returns an error if the source already exists.
pub fn add_source(&mut self, source_config: SourceConfig) -> MetastoreResult<()> {
match self.sources.entry(source_config.source_id.clone()) {
Expand Down
Loading
Loading