From ba5ab948669b8afd0862532fe2a6999a03c9c27e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Krzysztof=20Jamr=C3=B3z?=
 <79092062+k-jamroz@users.noreply.github.com>
Date: Mon, 25 Nov 2024 10:39:48 +0100
Subject: [PATCH] Add vector collection backup documentation, update after
 upgrade to JVector 3 [AI-155] [AI-230] (#1366)

In 6.0:

1. fault tolerance is supported
2. memory leak is no longer present
3. new default `max-degree`

Also fixed which config settings are optional.

---------

Co-authored-by: Amanda Lindsay <v-amanda.lindsay@hazelcast.com>
---
 .../pages/distributed-data-structures.adoc    |  1 +
 .../pages/vector-collections.adoc             | 45 ++++++++++---------
 .../pages/vector-search-overview.adoc         | 14 ++++--
 .../fault-tolerance/pages/backups.adoc        |  1 +
 4 files changed, 38 insertions(+), 23 deletions(-)
diff --git a/docs/modules/data-structures/pages/distributed-data-structures.adoc b/docs/modules/data-structures/pages/distributed-data-structures.adoc
index 7e08280a8..2697b3ac0 100644
--- a/docs/modules/data-structures/pages/distributed-data-structures.adoc
+++ b/docs/modules/data-structures/pages/distributed-data-structures.adoc
@@ -143,6 +143,7 @@ when performing concurrent activities.
 |<<ap-data,Availability and partition tolerance>>
 |===
 
+[#aiml-data-structures]
 == AI/ML Data Structures
 [cols="20%a,40%a,20%a,20%a"]
 |===
diff --git a/docs/modules/data-structures/pages/vector-collections.adoc b/docs/modules/data-structures/pages/vector-collections.adoc
index bb42e7520..ee0892e40 100644
--- a/docs/modules/data-structures/pages/vector-collections.adoc
+++ b/docs/modules/data-structures/pages/vector-collections.adoc
@@ -50,6 +50,17 @@ Can include letters, numbers, and the symbols `-`, `_`, `*`.
 |Information about indexes configuration
 |Required
 |`NULL`
+
+|backup-count
+|Number of synchronous backups. See xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[Backup Types]
+|Optional
+|`1`
+
+|async-backup-count
+|Number of asynchronous backups. See xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[Backup Types]
+|Optional
+|`0`
+
 |===
 
 .Index configuration options
@@ -75,19 +86,19 @@ For further information on distance metrics, see the <<available-metrics, Availa
 |`N/A`
 
 |max-degree
-|Used to calculate the maximum number of neighbors per node. The calculation used is max-degree * 2
-|Required
-|`16`
+|Maximum number of neighbors per node. Note that the meaning of this parameter differs from that used in version 5.5.
+|Optional
+|`32`
 
 |ef-construction
 |The size of the search queue to use when finding nearest neighbors.
-|Required
+|Optional
 |`100`
 
 |use-deduplication
-|Whether or not to use vector deduplication.
+|Whether to use vector deduplication.
 When disabled, each added vector is treated as a distinct vector in the index, even if it is identical to an existing one. When enabled, the index consumes less space as duplicates share a vector, but the time required to add a vector increases.
-|Required
+|Optional
 |`TRUE`
 
 |===
@@ -125,6 +136,8 @@ XML::
 ----
 <hazelcast>
     <vector-collection name="books">
+        <backup-count>1</backup-count>
+        <async-backup-count>0</async-backup-count>
         <indexes>
             <index name="word2vec-index">
                 <dimension>6</dimension>
@@ -150,6 +163,8 @@ YAML::
 hazelcast:
   vector-collection:
     books:
+      backup-count: 1
+      async-backup-count: 0
       indexes:
         - name: word2vec-index
           dimension: 6
@@ -169,6 +184,8 @@ Java::
 ----
 Config config = new Config();
 VectorCollectionConfig collectionConfig = new VectorCollectionConfig("books")
+    .setBackupCount(1)
+    .setAsyncBackupCount(0)
     .addVectorIndexConfig(
             new VectorIndexConfig()
                 .setName("word2vec-index")
@@ -191,7 +208,7 @@ Python::
 --
 [source,python]
 ----
-client.create_vector_collection_config("books", indexes=[
+client.create_vector_collection_config("books", backup_count=1, async_backup_count=0, indexes=[
     IndexConfig(name="word2vec-index", metric=Metric.DOT, dimension=6),
     IndexConfig(name="glove-index", metric=Metric.DOT, dimension=10,
                 max_degree=32, ef_construction=256, use_deduplication=False),
@@ -692,17 +709,5 @@ As this is a beta version, Vector Collection has some limitations; the most sign
 
 1. The API could change in future versions
 2. The rolling-upgrade compatibility guarantees do not apply for vector collections. You might need to delete existing vector collections before migrating to a future version of Hazelcast
-3. The lack of fault tolerance, as backups cannot yet be configured. However, data in collections is migrated to other cluster members on graceful shutdown and a new member joining the cluster, which means that normal cluster maintenance (such as a rolling restart) is possible without data loss.
-4. Only on-heap storage of vector collections is available
-
-
-== Known issue
-
-There is currently a known issue that has potential for causing a memory leak in Vector collections in some scenarios:
-
-1. Using `destroy` or `clear` on a non-empty vector collection.
-2. Making a vector collection partition empty by removing the last entry using `deleteAsync` or `removeAsync`.
-3. Updating the only entry in a vector collection partition using one of the `put`/`set` methods.
-4. Repeating migrations back and forth when a member is not restarted. This should not significantly affect rolling restart, provided sufficient heap margin. The leak may manifest itself when a subset of members is not restarted and the rest of them are repeatedly shut down or restarted gracefully.
+3. Only on-heap storage of vector collections is available
 
-The workaround for scenarios 1 - 3 is to avoid those situations or restart the affected cluster. For scenario 4 the workaround is to restart the affected member or cluster. The restart can be graceful, which should not cause loss of data. 
diff --git a/docs/modules/data-structures/pages/vector-search-overview.adoc b/docs/modules/data-structures/pages/vector-search-overview.adoc
index 113896fa2..947ab23f2 100644
--- a/docs/modules/data-structures/pages/vector-search-overview.adoc
+++ b/docs/modules/data-structures/pages/vector-search-overview.adoc
@@ -43,17 +43,25 @@ The index is based on the link:https://github.com/jbellis/jvector[JVector] libra
 
 Each collection is partitioned and replicated based on the system's general partitioning rules. Data partitioning is carried out using the collection key.
 
-For further information on Hazelcast partitioning, see xref:architecture:data-partitioning.adoc[Data Partitioning and Replication].
+Vector collection is an xref:distributed-data-structures.adoc#aiml-data-structures[AP data structure] and implements standard xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[in-memory backup types].
+Vector collection does not currently support xref:data-structures:backing-up-maps.adoc#enabling-in-memory-backup-reads-embedded-mode [reading from backup] and xref:data-structures:backing-up-maps.adoc#file-based-backups[file based backups].
 
-NOTE: Version 5.5/beta supports partitioning and migration but does not include support for the backup process.
+For further information on Hazelcast partitioning, see xref:architecture:data-partitioning.adoc[Data Partitioning and Replication].
 
 === Data store
+
 Hazelcast stores data in-memory (RAM) for faster access. Presently, the only available data storage option is the JVM heap store.
 
 === Fault Tolerance
+
 Hazelcast distributes storage data across all cluster members.
 In the event of a graceful shutdown, the data is migrated to remaining active members.
-In version 5.5, there is no automatic data restoration in the event of an unexpected member loss.
+In the event of member crash, if backups were configured, they are used to restore the data.
+If no backups were configured, data can be lost.
+
+NOTE: You can disable backups to speed up data ingestion into Vector Collection and reduce memory usage
+in development and test environments which do not require fault tolerance.
+With a single-member cluster (for example, an embedded dev cluster), you do not need to disable backups because these are not used in this case.
 
 == Partitioned similarity search
 
diff --git a/docs/modules/fault-tolerance/pages/backups.adoc b/docs/modules/fault-tolerance/pages/backups.adoc
index e475afb06..1c0ae2059 100644
--- a/docs/modules/fault-tolerance/pages/backups.adoc
+++ b/docs/modules/fault-tolerance/pages/backups.adoc
@@ -45,6 +45,7 @@ for the data structures:
 * xref:data-structures:set.adoc#configuring-set[Sets], xref:data-structures:list.adoc#configuring-list[Lists]
 * xref:data-structures:ringbuffer.adoc#backing-up-ringbuffer[Ringbuffers]
 * xref:data-structures:cardinality-estimator-service.adoc[Cardinality Estimators]
+* xref:data-structures:vector-collections.adoc#configuration[Vector Collections]
 
 See xref:fault-tolerance:fault-tolerance.adoc[this section] for backup information about the Hazelcast jobs.