From ba5ab948669b8afd0862532fe2a6999a03c9c27e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Krzysztof=20Jamr=C3=B3z?= <79092062+k-jamroz@users.noreply.github.com> Date: Mon, 25 Nov 2024 10:39:48 +0100 Subject: [PATCH] Add vector collection backup documentation, update after upgrade to JVector 3 [AI-155] [AI-230] (#1366) In 6.0: 1. fault tolerance is supported 2. memory leak is no longer present 3. new default `max-degree` Also fixed which config settings are optional. --------- Co-authored-by: Amanda Lindsay --- .../pages/distributed-data-structures.adoc | 1 + .../pages/vector-collections.adoc | 45 ++++++++++--------- .../pages/vector-search-overview.adoc | 14 ++++-- .../fault-tolerance/pages/backups.adoc | 1 + 4 files changed, 38 insertions(+), 23 deletions(-) diff --git a/docs/modules/data-structures/pages/distributed-data-structures.adoc b/docs/modules/data-structures/pages/distributed-data-structures.adoc index 7e08280a8..2697b3ac0 100644 --- a/docs/modules/data-structures/pages/distributed-data-structures.adoc +++ b/docs/modules/data-structures/pages/distributed-data-structures.adoc @@ -143,6 +143,7 @@ when performing concurrent activities. |<> |=== +[#aiml-data-structures] == AI/ML Data Structures [cols="20%a,40%a,20%a,20%a"] |=== diff --git a/docs/modules/data-structures/pages/vector-collections.adoc b/docs/modules/data-structures/pages/vector-collections.adoc index bb42e7520..ee0892e40 100644 --- a/docs/modules/data-structures/pages/vector-collections.adoc +++ b/docs/modules/data-structures/pages/vector-collections.adoc @@ -50,6 +50,17 @@ Can include letters, numbers, and the symbols `-`, `_`, `*`. |Information about indexes configuration |Required |`NULL` + +|backup-count +|Number of synchronous backups. See xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[Backup Types] +|Optional +|`1` + +|async-backup-count +|Number of asynchronous backups. See xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[Backup Types] +|Optional +|`0` + |=== .Index configuration options @@ -75,19 +86,19 @@ For further information on distance metrics, see the < + 1 + 0 6 @@ -150,6 +163,8 @@ YAML:: hazelcast: vector-collection: books: + backup-count: 1 + async-backup-count: 0 indexes: - name: word2vec-index dimension: 6 @@ -169,6 +184,8 @@ Java:: ---- Config config = new Config(); VectorCollectionConfig collectionConfig = new VectorCollectionConfig("books") + .setBackupCount(1) + .setAsyncBackupCount(0) .addVectorIndexConfig( new VectorIndexConfig() .setName("word2vec-index") @@ -191,7 +208,7 @@ Python:: -- [source,python] ---- -client.create_vector_collection_config("books", indexes=[ +client.create_vector_collection_config("books", backup_count=1, async_backup_count=0, indexes=[ IndexConfig(name="word2vec-index", metric=Metric.DOT, dimension=6), IndexConfig(name="glove-index", metric=Metric.DOT, dimension=10, max_degree=32, ef_construction=256, use_deduplication=False), @@ -692,17 +709,5 @@ As this is a beta version, Vector Collection has some limitations; the most sign 1. The API could change in future versions 2. The rolling-upgrade compatibility guarantees do not apply for vector collections. You might need to delete existing vector collections before migrating to a future version of Hazelcast -3. The lack of fault tolerance, as backups cannot yet be configured. However, data in collections is migrated to other cluster members on graceful shutdown and a new member joining the cluster, which means that normal cluster maintenance (such as a rolling restart) is possible without data loss. -4. Only on-heap storage of vector collections is available - - -== Known issue - -There is currently a known issue that has potential for causing a memory leak in Vector collections in some scenarios: - -1. Using `destroy` or `clear` on a non-empty vector collection. -2. Making a vector collection partition empty by removing the last entry using `deleteAsync` or `removeAsync`. -3. Updating the only entry in a vector collection partition using one of the `put`/`set` methods. -4. Repeating migrations back and forth when a member is not restarted. This should not significantly affect rolling restart, provided sufficient heap margin. The leak may manifest itself when a subset of members is not restarted and the rest of them are repeatedly shut down or restarted gracefully. +3. Only on-heap storage of vector collections is available -The workaround for scenarios 1 - 3 is to avoid those situations or restart the affected cluster. For scenario 4 the workaround is to restart the affected member or cluster. The restart can be graceful, which should not cause loss of data. diff --git a/docs/modules/data-structures/pages/vector-search-overview.adoc b/docs/modules/data-structures/pages/vector-search-overview.adoc index 113896fa2..947ab23f2 100644 --- a/docs/modules/data-structures/pages/vector-search-overview.adoc +++ b/docs/modules/data-structures/pages/vector-search-overview.adoc @@ -43,17 +43,25 @@ The index is based on the link:https://github.com/jbellis/jvector[JVector] libra Each collection is partitioned and replicated based on the system's general partitioning rules. Data partitioning is carried out using the collection key. -For further information on Hazelcast partitioning, see xref:architecture:data-partitioning.adoc[Data Partitioning and Replication]. +Vector collection is an xref:distributed-data-structures.adoc#aiml-data-structures[AP data structure] and implements standard xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[in-memory backup types]. +Vector collection does not currently support xref:data-structures:backing-up-maps.adoc#enabling-in-memory-backup-reads-embedded-mode [reading from backup] and xref:data-structures:backing-up-maps.adoc#file-based-backups[file based backups]. -NOTE: Version 5.5/beta supports partitioning and migration but does not include support for the backup process. +For further information on Hazelcast partitioning, see xref:architecture:data-partitioning.adoc[Data Partitioning and Replication]. === Data store + Hazelcast stores data in-memory (RAM) for faster access. Presently, the only available data storage option is the JVM heap store. === Fault Tolerance + Hazelcast distributes storage data across all cluster members. In the event of a graceful shutdown, the data is migrated to remaining active members. -In version 5.5, there is no automatic data restoration in the event of an unexpected member loss. +In the event of member crash, if backups were configured, they are used to restore the data. +If no backups were configured, data can be lost. + +NOTE: You can disable backups to speed up data ingestion into Vector Collection and reduce memory usage +in development and test environments which do not require fault tolerance. +With a single-member cluster (for example, an embedded dev cluster), you do not need to disable backups because these are not used in this case. == Partitioned similarity search diff --git a/docs/modules/fault-tolerance/pages/backups.adoc b/docs/modules/fault-tolerance/pages/backups.adoc index e475afb06..1c0ae2059 100644 --- a/docs/modules/fault-tolerance/pages/backups.adoc +++ b/docs/modules/fault-tolerance/pages/backups.adoc @@ -45,6 +45,7 @@ for the data structures: * xref:data-structures:set.adoc#configuring-set[Sets], xref:data-structures:list.adoc#configuring-list[Lists] * xref:data-structures:ringbuffer.adoc#backing-up-ringbuffer[Ringbuffers] * xref:data-structures:cardinality-estimator-service.adoc[Cardinality Estimators] +* xref:data-structures:vector-collections.adoc#configuration[Vector Collections] See xref:fault-tolerance:fault-tolerance.adoc[this section] for backup information about the Hazelcast jobs.