Merge pull request #117 from int-brain-lab/v2.7.0

V2.7.0
int-brain-lab · Mar 25, 2024 · dc41a27 · dc41a27
2 parents 22f2972 + 960e7a9
commit dc41a27
Show file tree

Hide file tree

Showing 23 changed files with 638 additions and 189 deletions.
diff --git a/.github/workflows/python-publish.yaml b/.github/workflows/python-publish.yaml
@@ -0,0 +1,37 @@
+# Reference for this action:
+#   https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
+name: Publish to PyPI
+
+on:
+  push:
+    tags:
+      - 'v*'
+
+permissions:
+  contents: read
+
+jobs:
+  deploy:
+    name: Build and publish Python distributions to PyPI
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - uses: actions/setup-python@v4
+        with:
+          python-version: '3.x'
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install setuptools wheel
+
+      - name: Build package
+        run: python setup.py sdist bdist_wheel
+
+      - name: Publish package
+        # GitHub recommends pinning 3rd party actions to a commit SHA.
+        uses: pypa/gh-action-pypi-publish@37f50c210e3d2f9450da2cd423303d6a14a6e29f
+        with:
+          user: __token__
+          password: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,12 +1,34 @@
 # Changelog
-## [Latest](https://github.com/int-brain-lab/ONE/commits/main) [2.6.0]
+## [Latest](https://github.com/int-brain-lab/ONE/commits/main) [2.7.0]
+This version of ONE adds support for Alyx 2.0.0 and pandas 3.0.0 with dataset QC filters. This version no longer supports 'data' search filter.
+
+### Added
+
+- support for Alyx v2.0.0
+- support for pandas v3.0.0
+- one.alf.spec.QC enumeration
+- ONE_HTTP_DL_THREADS environment variable allows user to specify maximum number of threads to use
+- github workflow for releasing to PyPi
+
+### Modified
+
+- support 'qc' category field in dataset cache table
+- One.search supports ´dataset_qc_lte` filter
+- One.list_datasets supports ´dataset_qc_lte` and `ignore_qc_not_set` filters
+- one.alf.io.iter_sessions pattern arg to make more performant
+
+### Removed
+
+- One.search no longer supports 'data' filter: kwarg must be 'dataset'
+
+## [2.6.0]
 
 ### Modified
-- `one.load_dataset`
+
+- One.load_dataset
   - add an option to skip computing hash for existing files when loading datasets `check_hash=False`
   - check filesize before computing hash for performance
 
-
 ## [2.5.5]
 
 ### Modified

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 [![Coverage Status](https://coveralls.io/repos/github/int-brain-lab/ONE/badge.svg?branch=main)](https://coveralls.io/github/int-brain-lab/ONE?branch=main)
 ![CI workflow](https://github.com/int-brain-lab/ONE/actions/workflows/main.yaml/badge.svg?branch=main)
 
-The Open Neurophysiology Environment is a scheme for sharing neurophysiology data in a standardized manner. It is a Python API for searching and loading ONE-standardized data, stored either on a user’s local machine or on a remote server.
+The Open Neurophysiology Environment is a scheme for sharing neurophysiology data in a standardized manner. It is a Python API for searching and loading ONE-standardized data, stored either on a user's local machine or on a remote server.
 
 Please [Click here](https://int-brain-lab.github.io/ONE/) for the main documentation page.  For a quick primer on the file naming convention we use, [click here](https://github.com/int-brain-lab/ONE/blob/main/docs/Open_Neurophysiology_Environment_Filename_Convention.pdf).
 

diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -194,3 +194,20 @@ or provided a different tag (see [this question](#how-do-i-download-the-datasets
 Second, there are minor differences between the default/local modes and remote mode. Namely that in remote mode
 queries are generally case-insensitive.  See the 'gotcha' section of
 '[Searching with ONE](notebooks/one_search/one_search.html#Gotchas)' for more information.
+
+## How do I load datasets that pass quality control
+You can first filter sessions by those that the supplied datasets with QC level WARNING or less:
+
+```python
+one = ONE()
+# In local and auto mode
+eids = one.search(dataset=['trials', 'spikes'], dataset_qc_lte='WARNING')
+# In remote mode
+eids = one.search(datasets=['trials.table.pqt', 'spikes.times.npy'], dataset_qc_lte='WARNING')
+```
+
+You can then load the datasets with list_datasets and load_datasets:
+```python
+dsets = one.list_datasets(eid, qc='WARNING', ignore_qc_not_set=True)
+data, info = one.load_datasets(eid, dsets)
+```
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -23,3 +23,21 @@ python ./make-script.py -d
 
 The HTML files are placed in `docs/_build/html/`.
 
+# Contributing to code
+
+Always branch off branch `main` before commiting changes, then push to remote and open a PR into `main`.
+A developer will then approve the PR and release.
+
+## Releasing (developers only)
+
+Note that in order to trigger a pypi release the tag must begin with 'v', e.g. `v2.8.0`.
+
+```shell
+git checkout -b release/X.X.X origin/<branch>
+git checkout origin/main
+git merge release/X.X.X
+git tag vX.X.X
+git push origin --tags
+git push origin
+git branch -d release/X.X.X
+```
diff --git a/docs/notebooks/one_list/one_list.ipynb b/docs/notebooks/one_list/one_list.ipynb
@@ -244,7 +244,13 @@
     "collections = one.list_collections(eid, filename='*spikes*')\n",
     "\n",
     "# All datasets with 'raw' in the name:\n",
-    "datasets = one.list_datasets(eid, '*raw*')\n"
+    "datasets = one.list_datasets(eid, '*raw*')\n",
+    "\n",
+    "# All datasets with a QC value less than or equal to 'WARNING' (i.e. includes 'PASS', 'NOT_SET' also):\n",
+    "datasets = one.list_datasets(eid, qc='WARNING')\n",
+    "\n",
+    "# All QC'd datasets with a value less than or equal to 'WARNING' (i.e. 'WARNING' or 'PASS'):\n",
+    "datasets = one.list_datasets(eid, qc='WARNING', ignore_qc_not_set=True)"
    ],
    "metadata": {
     "collapsed": false,
@@ -384,7 +390,8 @@
    "source": [
     "## Combining with load methods\n",
     "The list methods are useful in combination with the load methods.  For example, the output of\n",
-    "the `list_datasets` method can be a direct input of the `load_datasets` method:"
+    "the `list_datasets` method can be a direct input of the `load_datasets` method. Here we load all\n",
+    "spike and cluster datasets where the QC is either PASS or NOT_SET:"
    ],
    "metadata": {
     "collapsed": false
@@ -403,7 +410,7 @@
     }
    ],
    "source": [
-    "datasets = one.list_datasets(eid, ['*spikes*', '*clusters*'])\n",
+    "datasets = one.list_datasets(eid, ['*spikes*', '*clusters*'], qc='PASS', ignore_qc_not_set=False)\n",
     "data, records = one.load_datasets(eid, datasets)"
    ],
    "metadata": {
@@ -537,4 +544,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 0
-}
+}
diff --git a/docs/notebooks/one_search/one_search.ipynb b/docs/notebooks/one_search/one_search.ipynb
@@ -573,15 +573,19 @@
     "As mentioned above, different search terms perform differently. Below are the search terms and their\n",
     "approximate SQL equivalents:\n",
     "\n",
-    "| Term         | Lookup    |\n",
-    "|--------------|-----------|\n",
-    "| dataset      | LIKE AND  |\n",
-    "| number       | EXACT     |\n",
-    "| date_range   | BETWEEN   |\n",
-    "| subject, etc.| LIKE OR   |\n",
+    "| Term            | Lookup   |\n",
+    "|-----------------|----------|\n",
+    "| dataset         | LIKE AND |\n",
+    "| dataset_qc_lte  | <=       |\n",
+    "| number          | EXACT    |\n",
+    "| date_range      | BETWEEN  |\n",
+    "| subject, etc.   | LIKE OR  |\n",
     "\n",
     "Combinations of terms form a logical AND, for example `one.search(subject=['foo', 'bar'], project='baz')`\n",
     "searches for sessions where the subject name contains foo OR bar, AND the project contains baz.\n",
+    "NB: When `dataset_qc_lte` which is provided with `dataset(s)`, sessions are returned where ALL matching datasets\n",
+    "have a less than or equal QC value. When `dataset_qc_lte` is provided alone, sessions are returned where\n",
+    "ANY of the datasets have a less than or equal QC value.\n",
     "\n",
     "#### Difference between remote mode search terms\n",
     "Many search terms perform differently between auto/local mode and [remote mode](../one_modes.html),\n",
@@ -591,7 +595,7 @@
     "In remote mode there are three ways to search for datasets:\n",
     "\n",
     "* **dataset** - a partial, case-insensitive match of a single dataset (multiple datasets not supported).\n",
-    "* **datasets** - an exact, case-sensitive match of one or more datasets.  All datasets must be present.\n",
+    "* **datasets** - an exact, case-sensitive match of one or more datasets.  All datasets must be present. If `dataset_qc` provided, this criterion applies only to these datasets.\n",
     "* **dataset_type** - an exact, case-sensitive match of one or more [dataset types](../datasets_and_types.html#Dataset-types).  All dataset types must be present.\n",
     "\n",
     "#### Regex systems between modes\n",

diff --git a/one/__init__.py b/one/__init__.py
@@ -1,2 +1,2 @@
 """The Open Neurophysiology Environment (ONE) API."""
-__version__ = '2.6.0'
+__version__ = '2.7.0'
diff --git a/one/alf/cache.py b/one/alf/cache.py
@@ -30,6 +30,7 @@
 from one.alf.io import iter_sessions, iter_datasets
 from one.alf.files import session_path_parts, get_alf_path
 from one.converters import session_record2path
+from one.util import QC_TYPE
 
 __all__ = ['make_parquet_db', 'remove_missing_datasets', 'DATASETS_COLUMNS', 'SESSIONS_COLUMNS']
 _logger = logging.getLogger(__name__)
@@ -40,12 +41,12 @@
 
 SESSIONS_COLUMNS = (
     'id',               # int64
-    'lab',
-    'subject',
+    'lab',              # str
+    'subject',          # str
     'date',             # datetime.date
     'number',           # int
-    'task_protocol',
-    'projects',
+    'task_protocol',    # str
+    'projects',         # str
 )
 
 DATASETS_COLUMNS = (
@@ -56,6 +57,7 @@
     'file_size',        # file size in bytes
     'hash',             # sha1/md5, computed in load function
     'exists',           # bool
+    'qc',               # one.util.QC_TYPE
 )
 
 
@@ -64,7 +66,7 @@
 # -------------------------------------------------------------------------------------------------
 
 def _ses_str_id(session_path):
-    """Returns a str id from a session path in the form '(lab/)subject/date/number'"""
+    """Returns a str id from a session path in the form '(lab/)subject/date/number'."""
     return Path(*filter(None, session_path_parts(session_path, assert_valid=True))).as_posix()
 
 
@@ -91,7 +93,8 @@ def _get_dataset_info(full_ses_path, rel_dset_path, ses_eid=None, compute_hash=F
         'rel_path': Path(rel_dset_path).as_posix(),
         'file_size': file_size,
         'hash': md5(full_dset_path) if compute_hash else None,
-        'exists': True
+        'exists': True,
+        'qc': 'NOT_SET'
     }
 
 
@@ -140,7 +143,7 @@ def _metadata(origin):
     Parameters
     ----------
     origin : str, pathlib.Path
-        Path to full directory, or computer name / db name
+        Path to full directory, or computer name / db name.
     """
     return {
         'date_created': datetime.datetime.now().isoformat(sep=' ', timespec='minutes'),
@@ -150,17 +153,17 @@ def _metadata(origin):
 
 def _make_sessions_df(root_dir) -> pd.DataFrame:
     """
-    Given a root directory, recursively finds all sessions and returns a sessions DataFrame
+    Given a root directory, recursively finds all sessions and returns a sessions DataFrame.
 
     Parameters
     ----------
     root_dir : str, pathlib.Path
-        The folder to look for sessions
+        The folder to look for sessions.
 
     Returns
     -------
     pandas.DataFrame
-        A pandas DataFrame of session info
+        A pandas DataFrame of session info.
     """
     rows = []
     for full_path in iter_sessions(root_dir):
@@ -176,21 +179,21 @@ def _make_sessions_df(root_dir) -> pd.DataFrame:
 
 def _make_datasets_df(root_dir, hash_files=False) -> pd.DataFrame:
     """
-    Given a root directory, recursively finds all datasets and returns a datasets DataFrame
+    Given a root directory, recursively finds all datasets and returns a datasets DataFrame.
 
     Parameters
     ----------
     root_dir : str, pathlib.Path
-        The folder to look for sessions
+        The folder to look for sessions.
     hash_files : bool
-        If True, an MD5 is computed for each file and stored in the 'hash' column
+        If True, an MD5 is computed for each file and stored in the 'hash' column.
 
     Returns
     -------
     pandas.DataFrame
-        A pandas DataFrame of dataset info
+        A pandas DataFrame of dataset info.
     """
-    df = pd.DataFrame([], columns=DATASETS_COLUMNS)
+    df = pd.DataFrame([], columns=DATASETS_COLUMNS).astype({'qc': QC_TYPE})
     # Go through sessions and append datasets
     for session_path in iter_sessions(root_dir):
         rows = []
@@ -200,7 +203,7 @@ def _make_datasets_df(root_dir, hash_files=False) -> pd.DataFrame:
             rows.append(file_info)
         df = pd.concat((df, pd.DataFrame(rows, columns=DATASETS_COLUMNS)),
                        ignore_index=True, verify_integrity=True)
-    return df
+    return df.astype({'qc': QC_TYPE})
 
 
 def make_parquet_db(root_dir, out_dir=None, hash_ids=True, hash_files=False, lab=None):
@@ -216,7 +219,7 @@ def make_parquet_db(root_dir, out_dir=None, hash_ids=True, hash_files=False, lab
         root directory.
     hash_ids : bool
         If True, experiment and dataset IDs will be UUIDs generated from the system and relative
-        paths (required for use with ONE API)
+        paths (required for use with ONE API).
     hash_files : bool
         If True, an MD5 hash is computed for each dataset and stored in the datasets table.
         This will substantially increase cache generation time.
@@ -227,9 +230,9 @@ def make_parquet_db(root_dir, out_dir=None, hash_ids=True, hash_files=False, lab
     Returns
     -------
     pathlib.Path
-        The full path of the saved sessions parquet table
+        The full path of the saved sessions parquet table.
     pathlib.Path
-        The full path of the saved datasets parquet table
+        The full path of the saved datasets parquet table.
     """
     root_dir = Path(root_dir).resolve()