Releases: pinecone-io/pinecone-datasets
Releases · pinecone-io/pinecone-datasets
1.0.1
[Bug] Infinite loop when listing the contents of an empty catalog (#49) ## Problem When trying to call `Catalog.list_datasets` on an empty catalog, you go into an infinite loop. ```python from pinecone_datasets import Catalog catalog = Catalog(base_path='/path/to/empty/dir') catalog.list_datasets() # infinite loop ``` ## Solution Add test and remove the offending log line attempting to call list_datasets as they are being loaded in. ## Type of Change - [x] Bug fix (non-breaking change which fixes an issue)
1.0.0
Package overhaul: update dependencies, improve load time, and more (#48) ## Problem A lot of dependencies are far out of date, preventing this package from being run on modern jupyter notebooks and creating incompatibilities with updated SDK libraries. Also, this package loads very slowly despite not doing very much. It wasn't my intention to change everything in one PR, but since the tests weren't runnable I couldn't assess the impact of the dependency changes without updating and adding tests. And once you start doing that you naturally want to modify other parts of the code. ## Solution - Upgrade all dependencies in pyproject.toml and actions in .github workflows - Tried running tests to assess what broke after upgrading deps, and found myself struggling because the test coverage and organization isn't great. For example, there were tests under "unit tests" which can't run without S3 credentials for a private bucket (which I don't have). Problems like these, and a lack of specific testing for the `load_dataset` and `list_datasets` functions, meant I definitely needed to expand testing. Added tests: - For `load_dataset` and `list_datasets` functions which are the main way people use this package from the examples repository. - Added tests for working with local catalogs since that seemed like an intended behavior of this package but wasn't clearly tested - Added tests for uploading datasets to a google storage bucket using service account credentials, which seems important since that is how we would need to update or add additional datasets to the public set. Refactoring (non-breaking): - I ended up refactoring and adding a lot of tests to scope down the responsibility of the Dataset class by extracting groups of functionality into smaller more focused classes: `DatasetFSWriter`, `DatasetFSReader`. This seemed faster than trying to reason about everything smashed together into one giant `Dataset` class. - Use lazy-loading for heavy dependencies such as `pandas`, `gcsfs`, and `s3fs`. With these changes, importing the `load_dataset` function is now about 8x faster, with import time dropping from a measured ~1.849s down to 0.230s. - Incorporated Dave Rigby's suggestion to use the `fs.glob` function when listing datasets in a Catalog; this significantly cuts down the number of network calls needed to build the catalog list. - When calling `load_dataset`, skip loading all metadata. In the past the entire catalog metadata was loaded to check if a dataset exists before trying to build the `Dataset` object, which is unnecessary and just adds a ton of overhead. If you try to load a dataset that doesn't exist, the error message is already pretty clear. Breaking changes: - Removed `to_pinecone_index`. Having this in creates a coupling between this package and the SDK package that is going to be a perpetual thorn in our side when it comes to maintaining docs and examples. The SDK will continue marching forward whereas most of the other logic in here related to uploading and downloading from buckets into dataframes should not change much. There's really no need for it in here, so I am removing it. - `Dataset.to_catalog` and `Dataset.from_catalog` now error and tells you to use `Catalog.save_dataset` and `Catalog.load_dataset`. I'm not aware of anyone using these legacy methods, but it was a change that felt right to do to make the Catalog class mostly responsible for "where" things are saved to while the Dataset class is responsible for "what" things are saved. Now you can easily download a dataset and save it to local, for example, which wasn't something you could easily reason about when the dataset itself was responsible for writing to a catalog. Despite all these changes, little should change for most users of the package aside from dependency updates. `load_dataset` and `list_datasets` are still the same (although much more performant) and they are by far the most used. ## Usage Most people who just want to load a demo dataset will do something like this ```python from pinecone import Pinecone from pinecone_dataset import load_dataset ds = load_dataset('dataset_name') pc = Pinecone(api_key='key') index = pc.Index(host='host') index.upsert_from_dataframe(df=ds.documents, batch_size=100) ``` ## Type of Change - [x] Bug fix (non-breaking change which fixes an issue) - [x] Breaking change (fix or feature that would cause existing functionality to not work as expected) ## Testing load performance To investigate load performance, I used a python feature called `importtime` and a cool package I added as a dev dependency called `tuna` for visualizing the outputs. ``` poetry run python3 -X importtime -c "from pinecone_datasets import load_dataset" 2> load_times.log poetry run tuna load_times.log ``` ## Testing Besides the tests you see added here, I did some manual testing in a notebook setting using the `1.0.0.dev3` build that I built from this branch.
1.0.0.dev3
Merge pull request #42 from pinecone-io/daver/version_fix
1.0.0.dev2
Merge pull request #42 from pinecone-io/daver/version_fix
1.0.0.dev1
Merge pull request #42 from pinecone-io/daver/version_fix
0.7.0
Merge pull request #39 from pinecone-io/fix_cd Fix release CI
Version 0.5
In this version we introduced new functionalities:
- unified the dataframe implementation with pandas v2.0 with arrow
- better separation for path/catalog loading with from_path and from_catalog functionalities
from_path(path)
functionality to load a dataset from an arbitrary path (parquet files)from_catalog(dataset_name)
functionality to load from catalog- load_dataset using this functionality behind the scenes
- adding loading from pandas dataframe with
from_pandas
- saving datasets is now supporting
to_path
andto_catalog
- used internally to save more datasets to the pinecone catalog and support example notebooks
- documents and queries dataframes are loading only when calling the property
to_pinecone_index
functionality was added, supports creating a new index from a dataset, using client v2/v3- added more test support for all functionalities