Skip to content

Releases: pinecone-io/pinecone-datasets


28 Feb 20:42
Choose a tag to compare
[Bug] Infinite loop when listing the contents of an empty catalog (#49)

## Problem

When trying to call `Catalog.list_datasets` on an empty catalog, you go
into an infinite loop.

from pinecone_datasets import Catalog

catalog = Catalog(base_path='/path/to/empty/dir')
catalog.list_datasets() # infinite loop

## Solution

Add test and remove the offending log line attempting to call
list_datasets as they are being loaded in.

## Type of Change

- [x] Bug fix (non-breaking change which fixes an issue)


28 Feb 19:48
Choose a tag to compare
Package overhaul: update dependencies, improve load time, and more (#48)

## Problem

A lot of dependencies are far out of date, preventing this package from
being run on modern jupyter notebooks and creating incompatibilities
with updated SDK libraries. Also, this package loads very slowly despite
not doing very much.

It wasn't my intention to change everything in one PR, but since the
tests weren't runnable I couldn't assess the impact of the dependency
changes without updating and adding tests. And once you start doing that
you naturally want to modify other parts of the code.

## Solution

- Upgrade all dependencies in pyproject.toml and actions in .github
- Tried running tests to assess what broke after upgrading deps, and
found myself struggling because the test coverage and organization isn't
great. For example, there were tests under "unit tests" which can't run
without S3 credentials for a private bucket (which I don't have).
Problems like these, and a lack of specific testing for the
`load_dataset` and `list_datasets` functions, meant I definitely needed
to expand testing.

Added tests:
- For `load_dataset` and `list_datasets` functions which are the main
way people use this package from the examples repository.
- Added tests for working with local catalogs since that seemed like an
intended behavior of this package but wasn't clearly tested
- Added tests for uploading datasets to a google storage bucket using
service account credentials, which seems important since that is how we
would need to update or add additional datasets to the public set.

Refactoring (non-breaking):
- I ended up refactoring and adding a lot of tests to scope down the
responsibility of the Dataset class by extracting groups of
functionality into smaller more focused classes: `DatasetFSWriter`,
`DatasetFSReader`. This seemed faster than trying to reason about
everything smashed together into one giant `Dataset` class.
- Use lazy-loading for heavy dependencies such as `pandas`, `gcsfs`, and
`s3fs`. With these changes, importing the `load_dataset` function is now
about 8x faster, with import time dropping from a measured ~1.849s down
to 0.230s.
- Incorporated Dave Rigby's suggestion to use the `fs.glob` function
when listing datasets in a Catalog; this significantly cuts down the
number of network calls needed to build the catalog list.
- When calling `load_dataset`, skip loading all metadata. In the past
the entire catalog metadata was loaded to check if a dataset exists
before trying to build the `Dataset` object, which is unnecessary and
just adds a ton of overhead. If you try to load a dataset that doesn't
exist, the error message is already pretty clear.

Breaking changes:
- Removed `to_pinecone_index`. Having this in creates a coupling between
this package and the SDK package that is going to be a perpetual thorn
in our side when it comes to maintaining docs and examples. The SDK will
continue marching forward whereas most of the other logic in here
related to uploading and downloading from buckets into dataframes should
not change much. There's really no need for it in here, so I am removing
- `Dataset.to_catalog` and `Dataset.from_catalog` now error and tells
you to use `Catalog.save_dataset` and `Catalog.load_dataset`. I'm not
aware of anyone using these legacy methods, but it was a change that
felt right to do to make the Catalog class mostly responsible for
"where" things are saved to while the Dataset class is responsible for
"what" things are saved. Now you can easily download a dataset and save
it to local, for example, which wasn't something you could easily reason
about when the dataset itself was responsible for writing to a catalog.

Despite all these changes, little should change for most users of the
package aside from dependency updates. `load_dataset` and
`list_datasets` are still the same (although much more performant) and
they are by far the most used.

## Usage

Most people who just want to load a demo dataset will do something like

from pinecone import Pinecone
from pinecone_dataset import load_dataset

ds = load_dataset('dataset_name')

pc = Pinecone(api_key='key')
index = pc.Index(host='host')
index.upsert_from_dataframe(df=ds.documents, batch_size=100)

## Type of Change

- [x] Bug fix (non-breaking change which fixes an issue)
- [x] Breaking change (fix or feature that would cause existing
functionality to not work as expected)

## Testing load performance

To investigate load performance, I used a python feature called
`importtime` and a cool package I added as a dev dependency called
`tuna` for visualizing the outputs.

poetry run python3 -X importtime -c "from pinecone_datasets import load_dataset" 2> load_times.log
poetry run tuna load_times.log

## Testing

Besides the tests you see added here, I did some manual testing in a
notebook setting using the `1.0.0.dev3` build that I built from this


28 Feb 19:08
Choose a tag to compare
Merge pull request #42 from pinecone-io/daver/version_fix


26 Feb 17:50
Choose a tag to compare
Merge pull request #42 from pinecone-io/daver/version_fix


25 Feb 18:01
Choose a tag to compare
Merge pull request #42 from pinecone-io/daver/version_fix


16 Jan 14:31
Choose a tag to compare
Merge pull request #39 from pinecone-io/fix_cd

Fix release CI

Version 0.5

02 Jul 12:05
Choose a tag to compare

In this version we introduced new functionalities:

  • unified the dataframe implementation with pandas v2.0 with arrow
  • better separation for path/catalog loading with from_path and from_catalog functionalities
    • from_path(path) functionality to load a dataset from an arbitrary path (parquet files)
    • from_catalog(dataset_name) functionality to load from catalog
      • load_dataset using this functionality behind the scenes
  • adding loading from pandas dataframe with from_pandas
  • saving datasets is now supporting to_path and to_catalog
    • used internally to save more datasets to the pinecone catalog and support example notebooks
  • documents and queries dataframes are loading only when calling the property
  • to_pinecone_index functionality was added, supports creating a new index from a dataset, using client v2/v3
  • added more test support for all functionalities