Un-embed clusters as an alternative to summarizing #22

enjalot · 2024-02-21T17:55:20Z

With vec2txt we should be able to get a reasonably useful sentence out of the average embeddings of a cluster. This could serve as the cluster label, or perhaps as guidance for summarizing the label.

https://github.com/jxmorris12/vec2text/

There are pre-trained models, like for OpenAI's text-embedding-ada-002 and perhaps others. Part of this issue might be helping to pre-train for other supported models in our list.

One could imagine a new API endpoint that takes in an embedding vector and outputs a sentence. We could also have an alternative summarize script that uses this instead (or in conjunction with) summarizing. We currently have a description field per cluster which is not really being used, it could be populated with this or we could add another field.

dhruv-anand-aintech · 2024-03-03T19:17:40Z

could BerTopic also be a viable alternative to using LLM for summarization/topic name generation?

enjalot added enhancement New feature or request help wanted Extra attention is needed python labels Feb 21, 2024

enjalot added this to the 2.0 milestone Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Un-embed clusters as an alternative to summarizing #22

Un-embed clusters as an alternative to summarizing #22

enjalot commented Feb 21, 2024

dhruv-anand-aintech commented Mar 3, 2024

Un-embed clusters as an alternative to summarizing #22

Un-embed clusters as an alternative to summarizing #22

Comments

enjalot commented Feb 21, 2024

dhruv-anand-aintech commented Mar 3, 2024