Skip to content

Commit

Permalink
Add vector collection backup documentation, update after upgrade to J…
Browse files Browse the repository at this point in the history
…Vector 3 [AI-155] [AI-230] (#1366)

In 6.0:

1. fault tolerance is supported
2. memory leak is no longer present
3. new default `max-degree`

Also fixed which config settings are optional.

---------

Co-authored-by: Amanda Lindsay <[email protected]>
  • Loading branch information
k-jamroz and amandalindsay authored Nov 25, 2024
1 parent 8fabadb commit ba5ab94
Show file tree
Hide file tree
Showing 4 changed files with 38 additions and 23 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ when performing concurrent activities.
|<<ap-data,Availability and partition tolerance>>
|===

[#aiml-data-structures]
== AI/ML Data Structures
[cols="20%a,40%a,20%a,20%a"]
|===
Expand Down
45 changes: 25 additions & 20 deletions docs/modules/data-structures/pages/vector-collections.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,17 @@ Can include letters, numbers, and the symbols `-`, `_`, `*`.
|Information about indexes configuration
|Required
|`NULL`

|backup-count
|Number of synchronous backups. See xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[Backup Types]
|Optional
|`1`

|async-backup-count
|Number of asynchronous backups. See xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[Backup Types]
|Optional
|`0`

|===

.Index configuration options
Expand All @@ -75,19 +86,19 @@ For further information on distance metrics, see the <<available-metrics, Availa
|`N/A`

|max-degree
|Used to calculate the maximum number of neighbors per node. The calculation used is max-degree * 2
|Required
|`16`
|Maximum number of neighbors per node. Note that the meaning of this parameter differs from that used in version 5.5.
|Optional
|`32`

|ef-construction
|The size of the search queue to use when finding nearest neighbors.
|Required
|Optional
|`100`

|use-deduplication
|Whether or not to use vector deduplication.
|Whether to use vector deduplication.
When disabled, each added vector is treated as a distinct vector in the index, even if it is identical to an existing one. When enabled, the index consumes less space as duplicates share a vector, but the time required to add a vector increases.
|Required
|Optional
|`TRUE`

|===
Expand Down Expand Up @@ -125,6 +136,8 @@ XML::
----
<hazelcast>
<vector-collection name="books">
<backup-count>1</backup-count>
<async-backup-count>0</async-backup-count>
<indexes>
<index name="word2vec-index">
<dimension>6</dimension>
Expand All @@ -150,6 +163,8 @@ YAML::
hazelcast:
vector-collection:
books:
backup-count: 1
async-backup-count: 0
indexes:
- name: word2vec-index
dimension: 6
Expand All @@ -169,6 +184,8 @@ Java::
----
Config config = new Config();
VectorCollectionConfig collectionConfig = new VectorCollectionConfig("books")
.setBackupCount(1)
.setAsyncBackupCount(0)
.addVectorIndexConfig(
new VectorIndexConfig()
.setName("word2vec-index")
Expand All @@ -191,7 +208,7 @@ Python::
--
[source,python]
----
client.create_vector_collection_config("books", indexes=[
client.create_vector_collection_config("books", backup_count=1, async_backup_count=0, indexes=[
IndexConfig(name="word2vec-index", metric=Metric.DOT, dimension=6),
IndexConfig(name="glove-index", metric=Metric.DOT, dimension=10,
max_degree=32, ef_construction=256, use_deduplication=False),
Expand Down Expand Up @@ -692,17 +709,5 @@ As this is a beta version, Vector Collection has some limitations; the most sign

1. The API could change in future versions
2. The rolling-upgrade compatibility guarantees do not apply for vector collections. You might need to delete existing vector collections before migrating to a future version of Hazelcast
3. The lack of fault tolerance, as backups cannot yet be configured. However, data in collections is migrated to other cluster members on graceful shutdown and a new member joining the cluster, which means that normal cluster maintenance (such as a rolling restart) is possible without data loss.
4. Only on-heap storage of vector collections is available


== Known issue

There is currently a known issue that has potential for causing a memory leak in Vector collections in some scenarios:

1. Using `destroy` or `clear` on a non-empty vector collection.
2. Making a vector collection partition empty by removing the last entry using `deleteAsync` or `removeAsync`.
3. Updating the only entry in a vector collection partition using one of the `put`/`set` methods.
4. Repeating migrations back and forth when a member is not restarted. This should not significantly affect rolling restart, provided sufficient heap margin. The leak may manifest itself when a subset of members is not restarted and the rest of them are repeatedly shut down or restarted gracefully.
3. Only on-heap storage of vector collections is available

The workaround for scenarios 1 - 3 is to avoid those situations or restart the affected cluster. For scenario 4 the workaround is to restart the affected member or cluster. The restart can be graceful, which should not cause loss of data.
14 changes: 11 additions & 3 deletions docs/modules/data-structures/pages/vector-search-overview.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -43,17 +43,25 @@ The index is based on the link:https://github.com/jbellis/jvector[JVector] libra

Each collection is partitioned and replicated based on the system's general partitioning rules. Data partitioning is carried out using the collection key.

For further information on Hazelcast partitioning, see xref:architecture:data-partitioning.adoc[Data Partitioning and Replication].
Vector collection is an xref:distributed-data-structures.adoc#aiml-data-structures[AP data structure] and implements standard xref:data-structures:backing-up-maps.adoc#in-memory-backup-types[in-memory backup types].
Vector collection does not currently support xref:data-structures:backing-up-maps.adoc#enabling-in-memory-backup-reads-embedded-mode [reading from backup] and xref:data-structures:backing-up-maps.adoc#file-based-backups[file based backups].

NOTE: Version 5.5/beta supports partitioning and migration but does not include support for the backup process.
For further information on Hazelcast partitioning, see xref:architecture:data-partitioning.adoc[Data Partitioning and Replication].

=== Data store

Hazelcast stores data in-memory (RAM) for faster access. Presently, the only available data storage option is the JVM heap store.

=== Fault Tolerance

Hazelcast distributes storage data across all cluster members.
In the event of a graceful shutdown, the data is migrated to remaining active members.
In version 5.5, there is no automatic data restoration in the event of an unexpected member loss.
In the event of member crash, if backups were configured, they are used to restore the data.
If no backups were configured, data can be lost.

NOTE: You can disable backups to speed up data ingestion into Vector Collection and reduce memory usage
in development and test environments which do not require fault tolerance.
With a single-member cluster (for example, an embedded dev cluster), you do not need to disable backups because these are not used in this case.

== Partitioned similarity search

Expand Down
1 change: 1 addition & 0 deletions docs/modules/fault-tolerance/pages/backups.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ for the data structures:
* xref:data-structures:set.adoc#configuring-set[Sets], xref:data-structures:list.adoc#configuring-list[Lists]
* xref:data-structures:ringbuffer.adoc#backing-up-ringbuffer[Ringbuffers]
* xref:data-structures:cardinality-estimator-service.adoc[Cardinality Estimators]
* xref:data-structures:vector-collections.adoc#configuration[Vector Collections]
See xref:fault-tolerance:fault-tolerance.adoc[this section] for backup information about the Hazelcast jobs.

0 comments on commit ba5ab94

Please sign in to comment.