test: Add integration tests for public datasets #591

dkrako · 2023-10-06T11:25:23Z

Description

A first version to try out downloading and processing public datasets

This should fail until #517 is fixed

We have to find some solution such that integration tests are only tested very seldomly. (what about only when publishing new releases?)

For now I would just add --ignore=^tests/integration under [tool.pytest.ini_options] in pyproject.toml, because we definitely do not want to preprocess all datasets for each commit, because this would take forever (even if they are cached)

dkrako · 2023-10-06T13:03:20Z

The GitHub workers are failing the integration tests. I see three potential reasons:

Dataset.load() loads a complete dataset into memory, if the workers have too little working memory, they will fail.
Downloading all datasets could have too much disk usage
Running the tests took too long.

Point 1. would be solved after we implement batched preprocessing, meaning that we won't keep all the dataset files in memory, but process N files in parallel. This way we only need to keep N files in memory.

If it's Point 2 or 3 then we have a problem.

I'm running the integration tests locally on our DGX, let's see what we get as output from that.

dkrako · 2023-10-06T13:19:24Z

We have one expected fail on GazeBase and one unexpected fail with SB-Sat.

Output

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/krakowczyk/workspace/pymovements, configfile: pyproject.toml
plugins: anyio-3.7.1, dash-2.11.1, lazy-fixture-0.6.3, hydra-core-1.3.2, cov-3.0.0
collected 8 items                                                                                                                                                                                                 

tests/integration/public_dataset_processing_test.py .F...F..                                                                                                                                                [100%]

==================================================================================================== FAILURES =====================================================================================================
____________________________________________________________________________________ test_public_dataset_processing[GazeBase] _____________________________________________________________________________________

dataset_name = 'GazeBase', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1')

    @pytest.mark.parametrize(
        'dataset_name',
        list(pm.dataset.DatasetLibrary.definitions.keys()),
    )
    def test_public_dataset_processing(dataset_name, tmp_path):
        # Initialize dataset.
        dataset_path = tmp_path / dataset_name
        dataset = pm.Dataset(dataset_name, path=dataset_path)
    
        # Download and load in dataset.
        dataset.download()
        dataset.load()
    
        # Do some basic transformations.
        if 'pixel' in dataset.gaze[0].columns:
            dataset.pix2deg()
>       dataset.pos2vel()

dataset    = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
dataset_name = 'GazeBase'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase')
tmp_path   = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1')

tests/integration/public_dataset_processing_test.py:42: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>, method = 'fivepoint', verbose = True, kwargs = {}

    def pos2vel(
            self,
            method: str = 'fivepoint',
            *,
            verbose: bool = True,
            **kwargs: Any,
    ) -> Dataset:
        """Compute gaze velocites in dva/s from dva coordinates.
    
        This method requires a properly initialized :py:attr:`~.Dataset.experiment` attribute.
    
        After success, the gaze dataframe is extended by the resulting velocity columns.
    
        Parameters
        ----------
        method : str
            Computation method. See :func:`~transforms.pos2vel()` for details, default: smooth.
        verbose : bool
            If True, show progress of computation.
        **kwargs
            Additional keyword arguments to be passed to the :func:`~transforms.pos2vel()` method.
    
        Raises
        ------
        AttributeError
            If `gaze` is None or there are no gaze dataframes present in the `gaze` attribute, or
            if experiment is None.
    
        Returns
        -------
        Dataset
            Returns self, useful for method cascading.
        """
>       return self.apply('pos2vel', method=method, verbose=verbose, **kwargs)

kwargs     = {}
method     = 'fivepoint'
self       = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
verbose    = True

src/pymovements/dataset/dataset.py:393: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>, function = 'pos2vel', verbose = True, kwargs = {'method': 'fivepoint'}, disable_progressbar = False
gaze = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>

    def apply(
            self,
            function: str,
            *,
            verbose: bool = True,
            **kwargs: Any,
    ) -> Dataset:
        """Apply preprocessing method to all GazeDataFrames in Dataset.
    
        Parameters
        ----------
        function: str
            Name of the preprocessing function to apply.
        verbose : bool
            If True, show progress bar of computation.
        kwargs:
            kwargs that will be forwarded when calling the preprocessing method.
    
        Examples
        --------
        Let's load in our dataset first,
        >>> import pymovements as pm
        >>>
        >>> dataset = pm.Dataset("ToyDataset", path='toy_dataset')
        >>> dataset.download()# doctest:+ELLIPSIS
        Downloading ... to toy_dataset...downloads...
        Checking integrity of ...
        Extracting ... to toy_dataset...raw
        <pymovements.dataset.dataset.Dataset object at ...>
        >>> dataset.load()# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>
    
        Use apply for your gaze transformations:
        >>> dataset.apply('pix2deg')# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>
    
        >>> dataset.apply('pos2vel', method='neighbors')# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>
    
        Use apply for your event detection:
        >>> dataset.apply('ivt')# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>
    
        >>> dataset.apply('microsaccades', minimum_duration=8)# doctest:+ELLIPSIS
        <pymovements.dataset.dataset.Dataset object at ...>
        """
        self._check_gaze_dataframe()
    
        disable_progressbar = not verbose
        for gaze in tqdm(self.gaze, disable=disable_progressbar):
>           gaze.apply(function, **kwargs)

disable_progressbar = False
function   = 'pos2vel'
gaze       = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
kwargs     = {'method': 'fivepoint'}
self       = <pymovements.dataset.dataset.Dataset object at 0x7f205c77dd30>
verbose    = True

src/pymovements/dataset/dataset.py:287: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>, function = 'pos2vel', kwargs = {'method': 'fivepoint'}

    def apply(
            self,
            function: str,
            **kwargs: Any,
    ) -> None:
        """Apply preprocessing method to GazeDataFrame.
    
        Parameters
        ----------
        function: str
            Name of the preprocessing function to apply.
        kwargs:
            kwargs that will be forwarded when calling the preprocessing method.
        """
        if transforms.TransformLibrary.__contains__(function):
>           self.transform(function, **kwargs)

function   = 'pos2vel'
kwargs     = {'method': 'fivepoint'}
self       = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>

src/pymovements/gaze/gaze_dataframe.py:252: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>, transform_method = <function pos2vel at 0x7f1fbeda29d0>
kwargs = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}, method_kwargs = ['sampling_rate', 'method', 'n_components', 'degree', 'window_length', 'padding', ...]

    def transform(
            self,
            transform_method: str | Callable[..., pl.Expr],
            **kwargs: Any,
    ) -> None:
        """Apply transformation method."""
        if isinstance(transform_method, str):
            transform_method = transforms.TransformLibrary.get(transform_method)
    
        if transform_method.__name__ == 'downsample':
            downsample_factor = kwargs.pop('factor')
            self.frame = self.frame.select(
                transforms.downsample(
                    factor=downsample_factor, **kwargs,
                ),
            )
    
        else:
            method_kwargs = inspect.getfullargspec(transform_method).kwonlyargs
            if 'origin' in method_kwargs and 'origin' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['origin'] = self.experiment.screen.origin
    
            if 'screen_resolution' in method_kwargs and 'screen_resolution' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['screen_resolution'] = (
                    self.experiment.screen.width_px, self.experiment.screen.height_px,
                )
    
            if 'screen_size' in method_kwargs and 'screen_size' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['screen_size'] = (
                    self.experiment.screen.width_cm, self.experiment.screen.height_cm,
                )
    
            if 'distance' in method_kwargs and 'distance' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
    
                if 'distance' in self.frame.columns:
                    kwargs['distance'] = 'distance'
    
                    if self.experiment.screen.distance_cm:
                        warnings.warn(
                            "Both a distance column and experiment's "
                            'eye-to-screen distance are specified. '
                            'Using eye-to-screen distances from column '
                            "'distance' in the dataframe.",
                        )
                elif self.experiment.screen.distance_cm:
                    kwargs['distance'] = self.experiment.screen.distance_cm
                else:
                    raise AttributeError(
                        'Neither eye-to-screen distance is in the columns of the dataframe '
                        'nor experiment eye-to-screen distance is specified.',
                    )
    
            if 'sampling_rate' in method_kwargs and 'sampling_rate' not in kwargs:
                self._check_experiment()
                assert self.experiment is not None
                kwargs['sampling_rate'] = self.experiment.sampling_rate
    
            if 'n_components' in method_kwargs and 'n_components' not in kwargs:
                self._check_n_components()
                kwargs['n_components'] = self.n_components
    
            if transform_method.__name__ in {'pos2vel', 'pos2acc'}:
                if 'position' not in self.frame.columns and 'position_column' not in kwargs:
                    if 'pixel' in self.frame.columns:
                        raise pl.exceptions.ColumnNotFoundError(
                            "Neither 'position' is in the columns of the dataframe: "
                            f'{self.frame.columns} nor is the position column specified. '
                            "Since the dataframe has a 'pixel' column, consider running "
                            f'pix2deg() before {transform_method.__name__}(). If you want '
                            'to calculate pixel transformations, you can do so by using '
                            f"{transform_method.__name__}(position_column='pixel'). "
                            f'Available dataframe columns are {self.frame.columns}',
                        )
                    raise pl.exceptions.ColumnNotFoundError(
                        "Neither 'position' is in the columns of the dataframe: "
                        f'{self.frame.columns} nor is the position column specified. '
                        f'Available dataframe columns are {self.frame.columns}',
                    )
            if transform_method.__name__ in {'pix2deg'}:
                if 'pixel' not in self.frame.columns and 'pixel_column' not in kwargs:
                    raise pl.exceptions.ColumnNotFoundError(
                        "Neither 'position' is in the columns of the dataframe: "
                        f'{self.frame.columns} nor is the pixel column specified. '
                        'You can specify the pixel column via: '
                        f'{transform_method.__name__}(pixel_column="name_of_your_pixel_column"). '
                        f'Available dataframe columns are {self.frame.columns}',
                    )
    
            if self.trial_columns is None:
                self.frame = self.frame.with_columns(transform_method(**kwargs))
            else:
                self.frame = pl.concat(
>                   [
                        df.with_columns(transform_method(**kwargs))
                        for group, df in self.frame.groupby(self.trial_columns, maintain_order=True)
                    ],
                )

kwargs     = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}
method_kwargs = ['sampling_rate', 'method', 'n_components', 'degree', 'window_length', 'padding', ...]
self       = <pymovements.gaze.gaze_dataframe.GazeDataFrame object at 0x7f205cbf8460>
transform_method = <function pos2vel at 0x7f1fbeda29d0>

src/pymovements/gaze/gaze_dataframe.py:358: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <polars.dataframe.groupby.GroupBy object at 0x7f204f4ded00>

        [
>           df.with_columns(transform_method(**kwargs))
            for group, df in self.frame.groupby(self.trial_columns, maintain_order=True)
        ],
    )

.0         = <polars.dataframe.groupby.GroupBy object at 0x7f204f4ded00>
df         = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"]  │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
group      = (1, 2, 2, 'FXS')
kwargs     = {'method': 'fivepoint', 'n_components': 2, 'sampling_rate': 1000}
transform_method = <function pos2vel at 0x7f1fbeda29d0>

src/pymovements/gaze/gaze_dataframe.py:359: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"]  │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘
exprs = (<polars.expr.expr.Expr object at 0x7f205cbf8910>,), named_exprs = {}

    def with_columns(
        self,
        *exprs: IntoExpr | Iterable[IntoExpr],
        **named_exprs: IntoExpr,
    ) -> DataFrame:
        """
        Add columns to this DataFrame.
    
        Added columns will replace existing columns with the same name.
    
        Parameters
        ----------
        *exprs
            Column(s) to add, specified as positional arguments.
            Accepts expression input. Strings are parsed as column names, other
            non-expression inputs are parsed as literals.
        **named_exprs
            Additional columns to add, specified as keyword arguments.
            The columns will be renamed to the keyword used.
    
        Returns
        -------
        DataFrame
            A new DataFrame with the columns added.
    
        Notes
        -----
        Creating a new DataFrame using this method does not create a new copy of
        existing data.
    
        Examples
        --------
        Pass an expression to add it as a new column.
    
        >>> df = pl.DataFrame(
        ...     {
        ...         "a": [1, 2, 3, 4],
        ...         "b": [0.5, 4, 10, 13],
        ...         "c": [True, True, False, True],
        ...     }
        ... )
        >>> df.with_columns((pl.col("a") ** 2).alias("a^2"))
        shape: (4, 4)
        ┌─────┬──────┬───────┬──────┐
        │ a   ┆ b    ┆ c     ┆ a^2  │
        │ --- ┆ ---  ┆ ---   ┆ ---  │
        │ i64 ┆ f64  ┆ bool  ┆ f64  │
        ╞═════╪══════╪═══════╪══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 1.0  │
        │ 2   ┆ 4.0  ┆ true  ┆ 4.0  │
        │ 3   ┆ 10.0 ┆ false ┆ 9.0  │
        │ 4   ┆ 13.0 ┆ true  ┆ 16.0 │
        └─────┴──────┴───────┴──────┘
    
        Added columns will replace existing columns with the same name.
    
        >>> df.with_columns(pl.col("a").cast(pl.Float64))
        shape: (4, 3)
        ┌─────┬──────┬───────┐
        │ a   ┆ b    ┆ c     │
        │ --- ┆ ---  ┆ ---   │
        │ f64 ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╡
        │ 1.0 ┆ 0.5  ┆ true  │
        │ 2.0 ┆ 4.0  ┆ true  │
        │ 3.0 ┆ 10.0 ┆ false │
        │ 4.0 ┆ 13.0 ┆ true  │
        └─────┴──────┴───────┘
    
        Multiple columns can be added by passing a list of expressions.
    
        >>> df.with_columns(
        ...     [
        ...         (pl.col("a") ** 2).alias("a^2"),
        ...         (pl.col("b") / 2).alias("b/2"),
        ...         (pl.col("c").is_not()).alias("not c"),
        ...     ]
        ... )
        shape: (4, 6)
        ┌─────┬──────┬───────┬──────┬──────┬───────┐
        │ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
        │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
        │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╪══════╪══════╪═══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
        │ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
        │ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
        │ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
        └─────┴──────┴───────┴──────┴──────┴───────┘
    
        Multiple columns also can be added using positional arguments instead of a list.
    
        >>> df.with_columns(
        ...     (pl.col("a") ** 2).alias("a^2"),
        ...     (pl.col("b") / 2).alias("b/2"),
        ...     (pl.col("c").is_not()).alias("not c"),
        ... )
        shape: (4, 6)
        ┌─────┬──────┬───────┬──────┬──────┬───────┐
        │ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
        │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
        │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╪══════╪══════╪═══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
        │ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
        │ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
        │ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
        └─────┴──────┴───────┴──────┴──────┴───────┘
    
        Use keyword arguments to easily name your expression inputs.
    
        >>> df.with_columns(
        ...     ab=pl.col("a") * pl.col("b"),
        ...     not_c=pl.col("c").is_not(),
        ... )
        shape: (4, 5)
        ┌─────┬──────┬───────┬──────┬───────┐
        │ a   ┆ b    ┆ c     ┆ ab   ┆ not_c │
        │ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
        │ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
        ╞═════╪══════╪═══════╪══════╪═══════╡
        │ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
        │ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
        │ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
        │ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
        └─────┴──────┴───────┴──────┴───────┘
    
        Expressions with multiple outputs can be automatically instantiated as Structs
        by enabling the experimental setting ``Config.set_auto_structify(True)``:
    
        >>> with pl.Config(auto_structify=True):
        ...     df.drop("c").with_columns(
        ...         diffs=pl.col(["a", "b"]).diff().suffix("_diff"),
        ...     )
        ...
        shape: (4, 3)
        ┌─────┬──────┬─────────────┐
        │ a   ┆ b    ┆ diffs       │
        │ --- ┆ ---  ┆ ---         │
        │ i64 ┆ f64  ┆ struct[2]   │
        ╞═════╪══════╪═════════════╡
        │ 1   ┆ 0.5  ┆ {null,null} │
        │ 2   ┆ 4.0  ┆ {1,3.5}     │
        │ 3   ┆ 10.0 ┆ {1,6.0}     │
        │ 4   ┆ 13.0 ┆ {1,3.0}     │
        └─────┴──────┴─────────────┘
    
        """
        return (
>           self.lazy()
            .with_columns(*exprs, **named_exprs)
            .collect(no_optimization=True)
        )

exprs      = (<polars.expr.expr.Expr object at 0x7f205cbf8910>,)
named_exprs = {}
self       = shape: (15_076, 11)
┌──────────┬────────────┬────────────┬───────────┬───┬──────────────┬──────┬─────┬────────────────...948"]  │
└──────────┴────────────┴────────────┴───────────┴───┴──────────────┴──────┴─────┴───────────────────────────┘

/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/dataframe/frame.py:7631: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (<LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>,), kwargs = {'no_optimization': True}

    @wraps(function)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
        _rename_keyword_argument(
            old_name, new_name, kwargs, function.__name__, version
        )
>       return function(*args, **kwargs)

args       = (<LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>,)
function   = <function LazyFrame.collect at 0x7f205fa84c10>
kwargs     = {'no_optimization': True}
new_name   = 'comm_subplan_elim'
old_name   = 'common_subplan_elimination'
version    = '0.18.9'

/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/utils/deprecation.py:93: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>

    @deprecate_renamed_parameter(
        "common_subplan_elimination", "comm_subplan_elim", version="0.18.9"
    )
    def collect(
        self,
        *,
        type_coercion: bool = True,
        predicate_pushdown: bool = True,
        projection_pushdown: bool = True,
        simplify_expression: bool = True,
        no_optimization: bool = False,
        slice_pushdown: bool = True,
        comm_subplan_elim: bool = True,
        comm_subexpr_elim: bool = True,
        streaming: bool = False,
    ) -> DataFrame:
        """
        Collect into a DataFrame.
    
        Note: use :func:`fetch` if you want to run your query on the first `n` rows
        only. This can be a huge time saver in debugging queries.
    
        Parameters
        ----------
        type_coercion
            Do type coercion optimization.
        predicate_pushdown
            Do predicate pushdown optimization.
        projection_pushdown
            Do projection pushdown optimization.
        simplify_expression
            Run simplify expressions optimization.
        no_optimization
            Turn off (certain) optimizations.
        slice_pushdown
            Slice pushdown optimization.
        comm_subplan_elim
            Will try to cache branching subplans that occur on self-joins or unions.
        comm_subexpr_elim
            Common subexpressions will be cached and reused.
        streaming
            Run parts of the query in a streaming fashion (this is in an alpha state)
    
        Returns
        -------
        DataFrame
    
        Examples
        --------
        >>> lf = pl.LazyFrame(
        ...     {
        ...         "a": ["a", "b", "a", "b", "b", "c"],
        ...         "b": [1, 2, 3, 4, 5, 6],
        ...         "c": [6, 5, 4, 3, 2, 1],
        ...     }
        ... )
        >>> lf.groupby("a", maintain_order=True).agg(pl.all().sum()).collect()
        shape: (3, 3)
        ┌─────┬─────┬─────┐
        │ a   ┆ b   ┆ c   │
        │ --- ┆ --- ┆ --- │
        │ str ┆ i64 ┆ i64 │
        ╞═════╪═════╪═════╡
        │ a   ┆ 4   ┆ 10  │
        │ b   ┆ 11  ┆ 10  │
        │ c   ┆ 6   ┆ 1   │
        └─────┴─────┴─────┘
    
        """
        if no_optimization:
            predicate_pushdown = False
            projection_pushdown = False
            slice_pushdown = False
            comm_subplan_elim = False
            comm_subexpr_elim = False
    
        if streaming:
            comm_subplan_elim = False
    
        ldf = self._ldf.optimization_toggle(
            type_coercion,
            predicate_pushdown,
            projection_pushdown,
            simplify_expression,
            slice_pushdown,
            comm_subplan_elim,
            comm_subexpr_elim,
            streaming,
        )
>       return wrap_df(ldf.collect())
E       exceptions.ComputeError: arithmetic on string and numeric not allowed, try an explicit cast first

comm_subexpr_elim = False
comm_subplan_elim = False
ldf        = <builtins.PyLazyFrame object at 0x7f1f14821bb0>
no_optimization = True
predicate_pushdown = False
projection_pushdown = False
self       = <LazyFrame [12 cols, {"round_id": Int64 … "velocity": List(Utf8)}] at 0x7F1F14467A90>
simplify_expression = True
slice_pushdown = False
streaming  = False
type_coercion = True

/mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/polars/lazyframe/frame.py:1695: ComputeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://figshare.com/ndownloader/files/27039812 to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase/downloads/GazeBase_v2_0.zip
Checking integrity of GazeBase_v2_0.zip
Extracting GazeBase_v2_0.zip to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing1/GazeBase/raw
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
GazeBase_v2_0.zip: 100%|██████████| 6.25G/6.25G [03:27<00:00, 32.4MB/s]   
100%|██████████| 12334/12334 [08:12<00:00, 25.06it/s]
  0%|          | 22/12334 [00:01<10:37, 19.32it/s]
______________________________________________________________________________________ test_public_dataset_processing[SBSAT] ______________________________________________________________________________________

dataset_name = 'SBSAT', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5')

    @pytest.mark.parametrize(
        'dataset_name',
        list(pm.dataset.DatasetLibrary.definitions.keys()),
    )
    def test_public_dataset_processing(dataset_name, tmp_path):
        # Initialize dataset.
        dataset_path = tmp_path / dataset_name
        dataset = pm.Dataset(dataset_name, path=dataset_path)
    
        # Download and load in dataset.
>       dataset.download()

dataset    = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
dataset_name = 'SBSAT'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5/SBSAT')
tmp_path   = PosixPath('/tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5')

tests/integration/public_dataset_processing_test.py:36: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>

    def download(
            self,
            *,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: int = 1,
    ) -> Dataset:
        """Download dataset resources.
    
        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.
    
        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.
    
        Parameters
        ----------
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : int
            Verbosity levels: (1) Show download progress bar and print info messages on downloading
            and extracting archive files without printing messages for recursive archive extraction.
            (2) Print additional messages for each recursive archive extract.
    
        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.
    
        Returns
        -------
        PublicDataset
            Returns self, useful for method cascading.
        """
>       dataset_download.download_dataset(
            definition=self.definition,
            paths=self.paths,
            extract=extract,
            remove_finished=remove_finished,
            verbose=bool(verbose),
        )

extract    = True
remove_finished = False
self       = <pymovements.dataset.dataset.Dataset object at 0x7f205bdcfc40>
verbose    = 1

src/pymovements/dataset/dataset.py:761: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f205bdcfa90>, extract = True, remove_finished = False, verbose = True

    def download_dataset(
            definition: DatasetDefinition,
            paths: DatasetPaths,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: bool = True,
    ) -> None:
        """Download dataset resources.
    
        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.
    
        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.
    
        Parameters
        ----------
        definition:
            The dataset definition.
        paths:
            The dataset paths.
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : bool
            If True, show progress of download and print status messages for integrity checking and
            file extraction.
    
        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.
        """
        if len(definition.mirrors) == 0:
            raise AttributeError('number of mirrors must not be zero to download dataset')
    
        if len(definition.resources) == 0:
            raise AttributeError('number of resources must not be zero to download dataset')
    
        paths.raw.mkdir(parents=True, exist_ok=True)
    
        for resource in definition.resources:
            success = False
    
            for mirror_idx, mirror in enumerate(definition.mirrors):
    
                url = f'{mirror}{resource["resource"]}'
    
                try:
                    download_file(
                        url=url,
                        dirpath=paths.downloads,
                        filename=resource['filename'],
                        md5=resource['md5'],
                        verbose=verbose,
                    )
                    success = True
    
                # pylint: disable=overlapping-except
                except (URLError, OSError, RuntimeError) as error:
                    # Error downloading the resource, try next mirror
                    if mirror_idx < len(definition.mirrors) - 1:
                        print(f'Failed to download:\n{error}\nTrying next mirror.')
                    continue
    
                # downloading the resource was successful, we don't need to try another mirror
                break
    
            if not success:
>               raise RuntimeError(
                    f"downloading resource {resource['resource']} failed for all mirrors.",
                )
E               RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
extract    = True
mirror     = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/'
mirror_idx = 0
paths      = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f205bdcfa90>
remove_finished = False
resource   = {'filename': 'csvs.zip', 'md5': '3cf074c93266b723437cf887f948c993', 'resource': '64525979230ea6163c031267/?zip='}
success    = False
url        = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip='
verbose    = True

src/pymovements/dataset/dataset_download.py:108: RuntimeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip= to /tmp/pytest-of-krakowczyk/pytest-0/test_public_dataset_processing5/SBSAT/downloads/csvs.zip
Checking integrity of csvs.zip
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
csvs.zip: 100%|██████████| 403M/403M [04:03<00:00, 1.74MB/s]
================================================================================================ warnings summary =================================================================================================
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(module.__version__) < minver:

../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[GazeBase] - exceptions.ComputeError: arithmetic on string and numeric not allowed, try an explicit cast first
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[SBSAT] - RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
============================================================================== 2 failed, 6 passed, 10 warnings in 4148.91s (1:09:08) ==============================================================================

The error on GazeBase exactly reproduces #517. I will now merge #593 into this PR and see if we get rid of the error.

The fail on SB-Sat is strange though. @prassepaul do you know why that happened?

codecov · 2023-10-06T13:50:19Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (8836275) 100.00% compared to head (575ba37) 100.00%.

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #591   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           52        52           
  Lines         2337      2337           
  Branches       582       582           
=========================================
  Hits          2337      2337

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dkrako · 2023-10-06T15:07:00Z

Merging #593 into this PR resolves #517:

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.9.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/krakowczyk/workspace/pymovements, configfile: pyproject.toml
plugins: anyio-3.7.1, dash-2.11.1, lazy-fixture-0.6.3, hydra-core-1.3.2, cov-3.0.0
collected 8 items                                                                                                                                                                                                 

tests/integration/public_dataset_processing_test.py .....F..                                                                                                                                                [100%]

==================================================================================================== FAILURES =====================================================================================================
______________________________________________________________________________________ test_public_dataset_processing[SBSAT] ______________________________________________________________________________________

dataset_name = 'SBSAT', tmp_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5')

    @pytest.mark.parametrize(
        'dataset_name',
        list(pm.dataset.DatasetLibrary.definitions.keys()),
    )
    def test_public_dataset_processing(dataset_name, tmp_path):
        # Initialize dataset.
        dataset_path = tmp_path / dataset_name
        dataset = pm.Dataset(dataset_name, path=dataset_path)
    
        # Download and load in dataset.
>       dataset.download()

dataset    = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
dataset_name = 'SBSAT'
dataset_path = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5/SBSAT')
tmp_path   = PosixPath('/tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5')

tests/integration/public_dataset_processing_test.py:36: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>

    def download(
            self,
            *,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: int = 1,
    ) -> Dataset:
        """Download dataset resources.
    
        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.
    
        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.
    
        Parameters
        ----------
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : int
            Verbosity levels: (1) Show download progress bar and print info messages on downloading
            and extracting archive files without printing messages for recursive archive extraction.
            (2) Print additional messages for each recursive archive extract.
    
        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.
    
        Returns
        -------
        PublicDataset
            Returns self, useful for method cascading.
        """
>       dataset_download.download_dataset(
            definition=self.definition,
            paths=self.paths,
            extract=extract,
            remove_finished=remove_finished,
            verbose=bool(verbose),
        )

extract    = True
remove_finished = False
self       = <pymovements.dataset.dataset.Dataset object at 0x7f1f13b22460>
verbose    = 1

src/pymovements/dataset/dataset.py:761: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
paths = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f1f13b22220>, extract = True, remove_finished = False, verbose = True

    def download_dataset(
            definition: DatasetDefinition,
            paths: DatasetPaths,
            extract: bool = True,
            remove_finished: bool = False,
            verbose: bool = True,
    ) -> None:
        """Download dataset resources.
    
        This downloads all resources of the dataset. Per default this also extracts all archives
        into :py:meth:`Dataset.paths.raw`,
        To save space on your device you can remove the archive files after
        successful extraction with ``remove_finished=True``.
    
        If a corresponding file already exists in the local system, its checksum is calculated and
        checked against the expected checksum.
        Downloading will be evaded if the integrity of the existing file can be verified.
        If the existing file does not match the expected checksum it is overwritten with the
        downloaded new file.
    
        Parameters
        ----------
        definition:
            The dataset definition.
        paths:
            The dataset paths.
        extract : bool
            Extract dataset archive files.
        remove_finished : bool
            Remove archive files after extraction.
        verbose : bool
            If True, show progress of download and print status messages for integrity checking and
            file extraction.
    
        Raises
        ------
        AttributeError
            If number of mirrors or number of resources specified for dataset is zero.
        RuntimeError
            If downloading a resource failed for all given mirrors.
        """
        if len(definition.mirrors) == 0:
            raise AttributeError('number of mirrors must not be zero to download dataset')
    
        if len(definition.resources) == 0:
            raise AttributeError('number of resources must not be zero to download dataset')
    
        paths.raw.mkdir(parents=True, exist_ok=True)
    
        for resource in definition.resources:
            success = False
    
            for mirror_idx, mirror in enumerate(definition.mirrors):
    
                url = f'{mirror}{resource["resource"]}'
    
                try:
                    download_file(
                        url=url,
                        dirpath=paths.downloads,
                        filename=resource['filename'],
                        md5=resource['md5'],
                        verbose=verbose,
                    )
                    success = True
    
                # pylint: disable=overlapping-except
                except (URLError, OSError, RuntimeError) as error:
                    # Error downloading the resource, try next mirror
                    if mirror_idx < len(definition.mirrors) - 1:
                        print(f'Failed to download:\n{error}\nTrying next mirror.')
                    continue
    
                # downloading the resource was successful, we don't need to try another mirror
                break
    
            if not success:
>               raise RuntimeError(
                    f"downloading resource {resource['resource']} failed for all mirrors.",
                )
E               RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.

definition = SBSAT(name='SBSAT', mirrors=('https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/',), resources=({'reso...ns=['x_left', 'y_left'], position_columns=None, velocity_columns=None, acceleration_columns=None, distance_column=None)
extract    = True
mirror     = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/'
mirror_idx = 0
paths      = <pymovements.dataset.dataset_paths.DatasetPaths object at 0x7f1f13b22220>
remove_finished = False
resource   = {'filename': 'csvs.zip', 'md5': '3cf074c93266b723437cf887f948c993', 'resource': '64525979230ea6163c031267/?zip='}
success    = False
url        = 'https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip='
verbose    = True

src/pymovements/dataset/dataset_download.py:108: RuntimeError
---------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------
Downloading https://files.de-1.osf.io/v1/resources/cdx69/providers/osfstorage/64525979230ea6163c031267/?zip= to /tmp/pytest-of-krakowczyk/pytest-1/test_public_dataset_processing5/SBSAT/downloads/csvs.zip
Checking integrity of csvs.zip
---------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------
csvs.zip: 100%|██████████| 403M/403M [03:58<00:00, 1.77MB/s]
================================================================================================ warnings summary =================================================================================================
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/matplotlib/__init__.py:169: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(module.__version__) < minver:

../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
../../../../mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346
  /mnt/scratch/krakowczyk/venvs/cuda113/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================= short test summary info =============================================================================================
FAILED tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[SBSAT] - RuntimeError: downloading resource 64525979230ea6163c031267/?zip= failed for all mirrors.
============================================================================== 1 failed, 7 passed, 10 warnings in 5418.02s (1:30:18) ==============================================================================

Also notice 5418.02s (1:30:18) as runtime, on our dgx using all cores and 60+ GB RAM.

dkrako · 2023-10-06T15:17:49Z

The problem with SBSAT should be solved in another issue. This PR is now ready for review.

We will not include integration tests in our CI (yet). This single test run took 90 minutes (with one dataset failing at download start).
I rather think integration testing should be limited to once before publishing each release only.

As long as we don't solve our very high memory usage, I can do these test runs manually on our DGX via tox -e integration.

…ormal use

tests: Add integration tests for public datasets

89c336c

dkrako mentioned this pull request Oct 6, 2023

fix: Specify column dtypes in dataset definitions #593

Merged

11 tasks

dkrako marked this pull request as ready for review October 6, 2023 15:09

dkrako requested review from SiQube and prassepaul as code owners October 6, 2023 15:09

create integration tox environment and ignore integraiton tests for n…

3aef774

…ormal use

dkrako force-pushed the tests/integration-tests branch from 6560078 to 3aef774 Compare October 6, 2023 15:20

dkrako added 3 commits October 6, 2023 17:27

remove dataset files at the end of the run

2f04f09

Merge branch 'main' into tests/integration-tests

40c7658

Merge branch 'main' into tests/integration-tests

8007fff

dkrako enabled auto-merge (squash) October 10, 2023 07:14

prassepaul approved these changes Oct 10, 2023

View reviewed changes

dkrako added 2 commits October 10, 2023 10:26

Merge branch 'main' into tests/integration-tests

b0004c2

Merge branch 'main' into tests/integration-tests

575ba37

dkrako merged commit 1a603f8 into main Oct 10, 2023
15 checks passed

dkrako deleted the tests/integration-tests branch October 10, 2023 12:02

dkrako changed the title ~~tests: Add integration tests for public datasets~~ test: Add integration tests for public datasets Oct 10, 2023

dkrako added the internal label Oct 10, 2023

dkrako mentioned this pull request Oct 13, 2023

cache public datasets and use them in integration tests #523

Open

dkrako mentioned this pull request Feb 22, 2024

add tests for registered datasets that they are correctly downloaded and loaded in #384

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add integration tests for public datasets #591

test: Add integration tests for public datasets #591

dkrako commented Oct 6, 2023 •

edited

Loading

dkrako commented Oct 6, 2023 •

edited

Loading

dkrako commented Oct 6, 2023

codecov bot commented Oct 6, 2023 •

edited

Loading

dkrako commented Oct 6, 2023

dkrako commented Oct 6, 2023 •

edited

Loading

test: Add integration tests for public datasets #591

test: Add integration tests for public datasets #591

Conversation

dkrako commented Oct 6, 2023 • edited Loading

Description

dkrako commented Oct 6, 2023 • edited Loading

dkrako commented Oct 6, 2023

codecov bot commented Oct 6, 2023 • edited Loading

Codecov Report

dkrako commented Oct 6, 2023

dkrako commented Oct 6, 2023 • edited Loading

dkrako commented Oct 6, 2023 •

edited

Loading

dkrako commented Oct 6, 2023 •

edited

Loading

codecov bot commented Oct 6, 2023 •

edited

Loading

dkrako commented Oct 6, 2023 •

edited

Loading