Pinecone Datasets

install

pip install pinecone-datasets

Loading public datasets

Pinecone hosts a public datasets catalog, you can load a dataset by name using list_datasets and load_dataset functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.

from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Usage - Accessing data

Each dataset has three main attributes, documents, queries, and metadata which are lazily loaded the first time they are accessed. You may notice a delay as the underlying parquet files are being downloaded the first time these attributes are accessed.

Pinecone Datasets is build on top of pandas. documents and queries are lazily-loaded pandas dataframes. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.

accessing the documents and queries dataframes is done using the documents and queries properties. These properties are lazy and will only load the data when accessed.

from pinecone_datasets import list_datasets, load_dataset

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

document_df: pd.DataFrame = dataset.documents

query_df: pd.DataFrame = dataset.queries

Usage - Iterating over documents

The Dataset class has helpers for iterating over your dataset. This is useful for upserting a dataset to an index, or for benchmarking.

# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n) 

# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()

Upserting to Index

To upsert data to the index, you should install the Pinecone SDK

from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import load_dataset, list_datasets

# See what datasets are available
for ds in list_datasets():
    print(ds)

# Download embeddings data 
dataset = load_dataset(dataset_name)

# Instantiate a Pinecone client using API key from app.pinecone.io
pc = Pinecone(api_key='key')

# Create a Pinecone index
index_config = pc.create_index(
    name="demo-index",
    dimension=dataset.metadata.dense_model.dimension,
    spec=ServerlessSpec(cloud="aws", region="us-east1")
)

# Instantiate an index client
index = pc.Index(host=index_config.host)

# Upsert data from the dataset
index.upsert_from_dataframe(df=dataset.documents)

Name		Name	Last commit message	Last commit date
Latest commit History 336 Commits
.github/workflows		.github/workflows
pdoc_theme		pdoc_theme
pinecone_datasets		pinecone_datasets
tests		tests
.gitignore		.gitignore
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinecone Datasets

install

Loading public datasets

Usage - Accessing data

Usage - Iterating over documents

Upserting to Index

About

Releases 7

Packages

Contributors 5

Languages

pinecone-io/pinecone-datasets

Folders and files

Latest commit

History

Repository files navigation

Pinecone Datasets

install

Loading public datasets

Usage - Accessing data

Usage - Iterating over documents

Upserting to Index

About

Topics

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 5

Languages

Packages