Extract Python and Dask `Executor` classes from `Workflow` #1609

karlhigley · 2022-07-12T16:13:00Z

We'd like to re-use some of the mechanics of graph execution (both local and distributed) in other parts of Merlin, so this is a step in the direction of disentangling graph execution from Workflow itself. It removes direct dependencies on Dask from Workflow and centralizes them in MerlinDaskExecutor, which Workflow can then use in conjunction with a Merlin operator DAG to run distributed computations.

In the future, we'd like to use these Executor classes in Merlin Systems too, so that we can run the full process of generating recommendations (also represented as a Merlin DAG) interchangeably either in Triton (using MerlinPythonExecutor) or on Dask (using MerlinDaskExecutor.)

nvidia-merlin-bot · 2022-07-12T16:25:46Z

Click to view CI Results

GitHub pull request #1609 of commit 4f3e941e62750333eccd6899cccf6181575b9b1e, no merge conflicts.
Running as SYSTEM
Setting status of 4f3e941e62750333eccd6899cccf6181575b9b1e to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4573/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 4f3e941e62750333eccd6899cccf6181575b9b1e^{commit} # timeout=10
Checking out Revision 4f3e941e62750333eccd6899cccf6181575b9b1e (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 4f3e941e62750333eccd6899cccf6181575b9b1e # timeout=10
Commit message: "Clean up `MerlinDaskExecutor.fit()`"
 > git rev-list --no-walk 1be6d8849ce7ced685fb755e168766b150e37536 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins4082524424769022190.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1428 items
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

[  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py ..                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 62%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_s3.py: 2 warnings

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

========== 1427 passed, 1 skipped, 619 warnings in 710.53s (0:11:50) ===========

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins16256471248351385987.sh

github-actions · 2022-07-12T16:34:00Z

Documentation preview

https://nvidia-merlin.github.io/NVTabular/review/pr-1609

nvidia-merlin-bot · 2022-07-15T13:21:13Z

Click to view CI Results

GitHub pull request #1609 of commit 64914a5f8965c646133e4417b807717ebfde610f, no merge conflicts.
Running as SYSTEM
Setting status of 64914a5f8965c646133e4417b807717ebfde610f to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4583/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 64914a5f8965c646133e4417b807717ebfde610f^{commit} # timeout=10
Checking out Revision 64914a5f8965c646133e4417b807717ebfde610f (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 64914a5f8965c646133e4417b807717ebfde610f # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk d5d379101ec42f6ba7b7f31fc9f3237f29d1b5fb # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins1193956549961660074.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1428 items
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

[  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py ..                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 62%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_s3.py: 2 warnings

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

========== 1427 passed, 1 skipped, 619 warnings in 699.43s (0:11:39) ===========

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins7725628539816512595.sh

rjzamora · 2022-07-15T20:54:10Z

I suppose this would (partially) intersect with NVIDIA-Merlin/core#70

karlhigley · 2022-07-20T21:05:33Z

Yeah, good point @rjzamora. I would like to be able to do Dask computations across all the Merlin libraries, and also use Merlin graphs to run computations without Dask in some contexts (e.g. in Triton), so I ended up with a somewhat different design.

nvidia-merlin-bot · 2022-07-20T21:19:52Z

Click to view CI Results

GitHub pull request #1609 of commit 7ca7c0def80043f81602f0400142d8e866a5d562, no merge conflicts.
Running as SYSTEM
Setting status of 7ca7c0def80043f81602f0400142d8e866a5d562 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4600/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 7ca7c0def80043f81602f0400142d8e866a5d562^{commit} # timeout=10
Checking out Revision 7ca7c0def80043f81602f0400142d8e866a5d562 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7ca7c0def80043f81602f0400142d8e866a5d562 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 54c0038e16bfb8603e3f6ec7cbebb8ae5a4dc4a9 # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins2239481737267509604.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1428 items
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

[  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py ..                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 62%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_s3.py: 2 warnings

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

========== 1427 passed, 1 skipped, 619 warnings in 697.38s (0:11:37) ===========

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins4461961831957264715.sh

viswa-nvidia · 2022-07-29T22:23:41Z

arbitration: which initiative is this under ?

oliverholworthy · 2022-08-02T09:56:45Z

nvtabular/workflow/executor.py

+            )
+        )
+
+    def fit(self, ddf, nodes):


📄 missing nodes in Paramaters docstring here

oliverholworthy · 2022-08-02T10:01:11Z

nvtabular/workflow/executor.py

+
+    def __getstate__(self):
+        # dask client objects aren't picklable - exclude from saved representation
+        return {k: v for k, v in self.__dict__.items() if k != "client"}


❓ I'm wondering where the client attribute is being set on the object (that this code is trying exclude). I don't see a self.client in here. Could be something outside this module doing something I suppose. Not suggesting we remove this now since it was here before and to reduce risk it makes sense to keep.

I think this is leftover from a version before I realized I should use set_client_deprecated, so it is likely safe to remove. I know from the process of writing this that NVT tests will fail when saving a Workflow if there's a non-serializable client attribute on this object, so if it's problematic to remove, we'll find out quickly.

oliverholworthy · 2022-08-02T11:10:07Z

😃 This PR is a great example of separating changes into well defined commits that makes reviewing a refactor like this easy to follow. 🚀 It looks like a great step in the direction toward being able to run these transforms in different modes.

I imagine we may identify further changes as we try to use this in Systems. In the interest of keeping the changes relatively small, it seems like in a merge-able state to me.

karlhigley · 2022-08-02T14:09:43Z

@viswa-nvidia This PR was opened on the premise that we'd be working on offline batch recs generation in 22.08, as we'd planned before session-based bumped it out of the way. Since we still plan to work on offline batch (albeit later than we'd originally hoped), this PR is still relevant but not tied to one of the pieces of work slated for 22.08.

nvidia-merlin-bot · 2022-08-02T14:22:49Z

Click to view CI Results

GitHub pull request #1609 of commit 242fc3657c847d7ed026dc657dc5a331c73ca015, no merge conflicts.
Running as SYSTEM
Setting status of 242fc3657c847d7ed026dc657dc5a331c73ca015 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4612/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 242fc3657c847d7ed026dc657dc5a331c73ca015^{commit} # timeout=10
Checking out Revision 242fc3657c847d7ed026dc657dc5a331c73ca015 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 242fc3657c847d7ed026dc657dc5a331c73ca015 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 302f7c355a27bd485f293a4494785ea89d29949e # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins2058300991048675202.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items
tests/unit/test_dask_nvt.py ..........................F..F....F......F.. [  3%]

F.................................................................FFF... [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py FF                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  9%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 21%]

........................................s..                              [ 24%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 26%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py FFFFFF                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr26')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:10:53,272 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-a117a5c3563047ab7c1e46c936b45b04', 1)

Function:  subgraph_callable-eec4959e-4b83-466a-b446-9bb87151

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr26/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr29')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:10:55,307 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-ac98640b3fa44ac29eff10c91786542c', 1)

Function:  subgraph_callable-723280cd-5667-4086-b5a7-3509cc3a

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr29/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_________ test_dask_workflow_api_dlrm[True-None-True-None-150-csv-0.1] _________
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr34')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None

on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:10:58,216 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-41153e01c8fc6f5939c438d5c8bb0aed', 0)

Function:  subgraph_callable-2e6d8883-6283-40d5-8469-02fc19d6

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr34/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr41')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:11:02,469 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-d4101e91f9873c58557cd7d56b525793', 1)

Function:  subgraph_callable-04a503ca-8711-4a43-ad5f-9be72c3e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr41/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
____ test_dask_workflow_api_dlrm[True-None-False-None-0-csv-no-header-0.1] _____
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr44')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = None, on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:11:04,529 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-e697312ba72ed052bd71ceb256da36a4', 1)

Function:  subgraph_callable-a9e4f333-d96d-439a-9565-49dec56a

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_workflow_api_dlrm_Tr44/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________ test_dask_preproc_cpu[True-None-parquet] ___________________
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

engine = 'parquet', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-02 14:11:46,515 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-38892a42e6efb5a7f77e9e32dd415ba5', 14)

Function:  subgraph_callable-d3152863-1a2f-4a95-aeea-22c61b92

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-02 14:11:46,516 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-38892a42e6efb5a7f77e9e32dd415ba5', 15)

Function:  subgraph_callable-d3152863-1a2f-4a95-aeea-22c61b92

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:46,519 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-38892a42e6efb5a7f77e9e32dd415ba5', 12)

Function:  subgraph_callable-d3152863-1a2f-4a95-aeea-22c61b92

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_____________________ test_dask_preproc_cpu[True-None-csv] _____________________
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

engine = 'csv', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:11:47,479 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 12)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,480 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 18)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,481 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 21)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,481 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 14)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,482 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 2)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,482 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 10)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,482 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 16)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,483 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 1)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,484 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 0)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,485 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 15)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,486 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 13)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,487 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 11)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,487 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 17)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,487 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 20)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,488 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 19)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,488 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 22)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown ---------------------------

2022-08-02 14:11:47,495 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 8)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,498 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 6)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,499 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 5)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,500 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 3)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,507 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 4)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,511 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 7)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_1.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:47,514 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-8f01355415a57a595bd1a3d7090180cf', 9)

Function:  subgraph_callable-0b60d4d2-9943-4cc3-9496-bc4b18e7

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________
client = <Client: 'tcp://127.0.0.1:37465' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

engine = 'csv-no-header', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:11:48,171 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-3b9570b799cadec73fd64f5f4d9b0c9e', 13)

Function:  subgraph_callable-1b15a093-7e0c-45e7-9a1f-a74b059f

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:48,174 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-3b9570b799cadec73fd64f5f4d9b0c9e', 11)

Function:  subgraph_callable-1b15a093-7e0c-45e7-9a1f-a74b059f

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2/processed/part_2.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:11:48,176 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-3b9570b799cadec73fd64f5f4d9b0c9e', 15)

Function:  subgraph_callable-1b15a093-7e0c-45e7-9a1f-a74b059f

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-12/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________________ test_s3_dataset[parquet] ___________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7fe918ad2b20>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe918b8e460>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/parquet', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0>

_stacktrace = <traceback object at 0x7fe9114049c0>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe918b8e460>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918b9f3d0>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

method = 'PUT', url = '/parquet'

timeout = <urllib3.util.timeout.Timeout object at 0x7fe918b8e460>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91afa2d30>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: bb55e11d-7809-400d-99db-753fa4d71a84\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: bb55e11d-7809-400d-99db-753fa4d71a84\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>

data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: bb55e11d-7809-400d-99db-753fa4d71a84\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91afa2fd0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-12/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-12/parquet0/dataset-1.parquet']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

engine = 'parquet'

df =      name-cat name-string    id  label         x         y

0         Bob       Frank   977   1039  0.430966  0.771394

...la   935    975 -0.258980  0.125659

4320    Alice      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 6 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7fe918ad2b20>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'bb55e11d-7809-400d-99db-753fa4d71a84', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

---------------------------- Captured stderr setup -----------------------------

Traceback (most recent call last):

File "/usr/local/bin/moto_server", line 5, in 

from moto.server import main

File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in 

from moto.moto_server.werkzeug_app import (

File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in 

from flask import Flask

File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in 

from . import json as json

File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in 

from ..globals import current_app

File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in 

app_ctx: "AppContext" = LocalProxy(  # type: ignore[assignment]

TypeError: init() got an unexpected keyword argument 'unbound_message'

_____________________________ test_s3_dataset[csv] _____________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7fe9114914f0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91afa44c0>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/csv', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0>

_stacktrace = <traceback object at 0x7fe918831d40>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7fe9d30bd220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91afa44c0>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe918241be0>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

method = 'PUT', url = '/csv'

timeout = <urllib3.util.timeout.Timeout object at 0x7fe91afa44c0>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe91af2e970>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d3fec743-d9f5-40fe-ada7-4db95610b271\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d3fec743-d9f5-40fe-ada7-4db95610b271\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>

data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d3fec743-d9f5-40fe-ada7-4db95610b271\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe91af2ea90>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-12/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-12/csv0/dataset-1.csv']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-12/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-12/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-12/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-12/parquet0')}

engine = 'csv'

df =      name-string    id  label         x         y

0          Frank   977   1039  0.430966  0.771394

1            Bob  ...     Ursula   935    975 -0.258980  0.125659

2160      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 5 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7fe9114914f0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd3fec743-d9f5-40fe-ada7-4db95610b271', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

_____________________ test_cpu_workflow[True-True-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_pa0')

df =      name-cat name-string    id  label         x         y

0         Bob       Frank   977   1039  0.430966  0.771394

...la   935    975 -0.258980  0.125659

4320    Alice      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe8905f5160>, cpu = True

engine = 'parquet', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________________ test_cpu_workflow[True-True-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs0')

df =      name-string    id  label         x         y

0          Frank   977   1039  0.430966  0.771394

1            Bob  ...     Ursula   935    975 -0.258980  0.125659

2160      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe8e83beeb0>, cpu = True

engine = 'csv', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_cpu_workflow[True-True-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs1')

df =      name-string    id  label         x         y

0          Frank   977   1039  0.430966  0.771394

1            Bob  ...     Ursula   935    975 -0.258980  0.125659

2160      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe8c064d040>, cpu = True

engine = 'csv-no-header', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

____________________ test_cpu_workflow[True-False-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_p0')

df =      name-cat name-string    id  label         x         y

0         Bob       Frank   977   1039  0.430966  0.771394

...la   935    975 -0.258980  0.125659

4320    Alice      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe8c8512ee0>, cpu = True

engine = 'parquet', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

______________________ test_cpu_workflow[True-False-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c0')

df =      name-string    id  label         x         y

0          Frank   977   1039  0.430966  0.771394

1            Bob  ...     Ursula   935    975 -0.258980  0.125659

2160      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe8c07c7220>, cpu = True

engine = 'csv', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_________________ test_cpu_workflow[True-False-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c1')

df =      name-string    id  label         x         y

0          Frank   977   1039  0.430966  0.771394

1            Bob  ...     Ursula   935    975 -0.258980  0.125659

2160      Oliver   988   1060 -0.785203  0.746451
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7fe8c8621e80>, cpu = True

engine = 'csv-no-header', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-12/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-150-csv-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-0-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py...

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header]

FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions....

FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp...

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header]

===== 16 failed, 1415 passed, 1 skipped, 617 warnings in 736.74s (0:12:16) =====

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins4796112231564463528.sh

karlhigley · 2022-08-02T14:49:33Z

I'm not able to reproduce these test failures locally, even in the merlin_ci_runner image. Going to try a re-run 🤷🏻

karlhigley · 2022-08-02T14:49:38Z

rerun tests

nvidia-merlin-bot · 2022-08-02T15:02:15Z

Click to view CI Results

GitHub pull request #1609 of commit 242fc3657c847d7ed026dc657dc5a331c73ca015, no merge conflicts.
GitHub pull request #1609 of commit 242fc3657c847d7ed026dc657dc5a331c73ca015, no merge conflicts.
Running as SYSTEM
Setting status of 242fc3657c847d7ed026dc657dc5a331c73ca015 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4613/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 242fc3657c847d7ed026dc657dc5a331c73ca015^{commit} # timeout=10
Checking out Revision 242fc3657c847d7ed026dc657dc5a331c73ca015 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 242fc3657c847d7ed026dc657dc5a331c73ca015 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 242fc3657c847d7ed026dc657dc5a331c73ca015 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins7892443554037532412.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1432 items
tests/unit/test_dask_nvt.py ..........................F....F............ [  3%]

...F...............................................................F.... [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_s3.py FF                                                 [  8%]

tests/unit/test_tf4rec.py .                                              [  9%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 21%]

........................................s..                              [ 24%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 26%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py FFFFFF                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr26')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:50:46,240 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-59cbff4bfa9b201755371def3a4a8ee0', 1)

Function:  subgraph_callable-7e8dc1fb-908b-45ec-a6cb-e042825e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr26/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__________ test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] __________
client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr31')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')}

freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None

on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:50:49,413 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-4fa23bb4f606e99d8314e594eb4d3c5d', 0)

Function:  subgraph_callable-432f28e8-49bd-45c5-869f-248b0670

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr31/processed/part_0.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr47')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')}

freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header'

cat_cache = None, on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1])
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("freq_threshold", [0, 150])
@pytest.mark.parametrize("cat_cache", ["device", None])
@pytest.mark.parametrize("on_host", [True, False])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [True, False])
def test_dask_workflow_api_dlrm(
    client,
    tmpdir,
    datasets,
    freq_threshold,
    part_mem_fraction,
    engine,
    cat_cache,
    on_host,
    shuffle,
    cpu,
):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    paths = sorted(paths)
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)
    df0 = df0.to_pandas() if cpu else df0

    if engine == "parquet":
        cat_names = ["name-cat", "name-string"]
    else:
        cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    cats = cat_names >> ops.Categorify(
        freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host
    )

    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp()

    workflow = Workflow(cats + conts + label_name)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction)
    else:
        dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction)

    output_path = os.path.join(tmpdir, "processed")

    transformed = workflow.fit_transform(dataset)
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1)

    result = transformed.to_ddf().compute()
    assert len(df0) == len(result)
    assert result["x"].min() == 0.0
    assert result["x"].isna().sum() == 0
    assert result["y"].min() == 0.0
    assert result["y"].isna().sum() == 0

    # Check categories.  Need to sort first to make sure we are comparing
    # "apples to apples"
    expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index()
    dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]]
    dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg(
        {"name-string_x": "count", "name-string_y": "count"}
    )
    if freq_threshold:
        dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold]
    assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False)

    # Read back from disk
    if cpu:


      df_disk = dd_read_parquet(output_path).compute()


tests/unit/test_dask_nvt.py:130:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

2022-08-02 14:50:58,418 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-f624361c9960e8bfe9f17d1c64ec291a', 1)

Function:  subgraph_callable-66c6180a-b49f-4f3d-9cea-eba316c3

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_workflow_api_dlrm_Tr47/processed/part_1.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_____________________ test_dask_preproc_cpu[True-None-csv] _____________________
client = <Client: 'tcp://127.0.0.1:42499' processes=2 threads=16, memory=125.83 GiB>

tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1')

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')}

engine = 'csv', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None])
@pytest.mark.parametrize("cpu", [None, True])
def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu):
    set_dask_client(client=client)
    paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0])
    if engine == "parquet":
        df1 = cudf.read_parquet(paths[0])[mycols_pq]
        df2 = cudf.read_parquet(paths[1])[mycols_pq]
    elif engine == "csv":
        df1 = cudf.read_csv(paths[0], header=0)[mycols_csv]
        df2 = cudf.read_csv(paths[1], header=0)[mycols_csv]
    else:
        df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv]
        df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv]
    df0 = cudf.concat([df1, df2], axis=0)

    if engine in ("parquet", "csv"):
        dataset = Dataset(paths, part_size="1MB", cpu=cpu)
    else:
        dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu)

    # Simple transform (normalize)
    cat_names = ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]
    conts = cont_names >> ops.FillMissing() >> ops.Normalize()
    workflow = Workflow(conts + cat_names + label_name)
    transformed = workflow.fit_transform(dataset)

    # Write out dataset
    output_path = os.path.join(tmpdir, "processed")
    transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4)

    # Check the final result


  df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()


tests/unit/test_dask_nvt.py:277:

/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute

(result,) = compute(self, traverse=False, **kwargs)

/usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute

results = schedule(dsk, keys, **kwargs)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get

results = self.gather(packed, asynchronous=asynchronous, direct=direct)

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather

return self.sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync

return sync(

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync

raise exc.with_traceback(tb)

/usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f

result = yield future

/usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run

value = future.result()

/usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather

raise exception.with_traceback(traceback)

/usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call

return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

/usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get

result = _execute_task(task, cache)

/usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task

return func(*(_execute_task(a, cache) for a in args))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call

return read_parquet_part(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part

dfs = [

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in 

func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw))

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition

arrow_table = cls._read_table(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table

arrow_table = _read_table_from_path(

/usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path

return pq.ParquetFile(fil).read_row_groups(

/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init

self.reader.open(

pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open

???


???

E   pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

pyarrow/error.pxi:99: ArrowInvalid

----------------------------- Captured stderr call -----------------------------

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-02 14:51:39,276 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 10)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,277 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 14)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(

2022-08-02 14:51:39,280 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 17)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,281 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 20)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,281 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 22)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,282 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 19)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,282 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 18)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown ---------------------------

2022-08-02 14:51:39,312 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 21)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,315 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 16)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_4.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,315 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 23)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_5.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,316 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 25)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [1], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,319 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 27)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [3], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,323 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 26)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [2], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,324 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 28)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_7.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-02 14:51:39,326 - distributed.worker - WARNING - Compute Failed

Key:       ('read-parquet-c93258fabc7094400b097695615335f6', 24)

Function:  subgraph_callable-6c250d85-77cf-4cb2-aff8-71d6729e

args:      ({'piece': ('/tmp/pytest-of-jenkins/pytest-14/test_dask_preproc_cpu_True_Non1/processed/part_6.parquet', [0], [])})

kwargs:    {}

Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________________ test_s3_dataset[parquet] ___________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7f1e9d7ec7c0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bcee0>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/parquet', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0>

_stacktrace = <traceback object at 0x7f1e6bb84ac0>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bcee0>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e6d2947c0>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

method = 'PUT', url = '/parquet'

timeout = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bcee0>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e6c3bc280>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

method = 'PUT', url = '/parquet', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: a1d8cf96-eda6-4e6a-b090-537d987ca6eb\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: a1d8cf96-eda6-4e6a-b090-537d987ca6eb\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>

data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: a1d8cf96-eda6-4e6a-b090-537d987ca6eb\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e6c3bcfd0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-14/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-14/parquet0/dataset-1.parquet']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')}

engine = 'parquet'

df =      name-cat name-string    id  label         x         y

0      Yvonne      Xavier   991    986  0.157298 -0.169087

...ry   995   1027  0.992783 -0.835742

4320    Zelda        Gary   996    973  0.665933 -0.646899
[4321 rows x 6 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7f1e9d7ec7c0>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'a1d8cf96-eda6-4e6a-b090-537d987ca6eb', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

---------------------------- Captured stderr setup -----------------------------

Traceback (most recent call last):

File "/usr/local/bin/moto_server", line 5, in 

from moto.server import main

File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in 

from moto.moto_server.werkzeug_app import (

File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in 

from flask import Flask

File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in 

from . import json as json

File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in 

from ..globals import current_app

File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in 

app_ctx: "AppContext" = LocalProxy(  # type: ignore[assignment]

TypeError: init() got an unexpected keyword argument 'unbound_message'

_____________________________ test_s3_dataset[csv] _____________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:


      conn = connection.create_connection(


            (self._dns_host, self.port), self.timeout, **extra_kw
        )

/usr/lib/python3/dist-packages/urllib3/connection.py:159:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)
            sock.connect(sa)
            return sock

        except socket.error as e:
            err = e
            if sock is not None:
                sock.close()
                sock = None

    if err is not None:


      raise err


/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:

address = ('127.0.0.1', 5000), timeout = 60, source_address = None

socket_options = [(6, 1, 1)]
def create_connection(
    address,
    timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
    source_address=None,
    socket_options=None,
):
    """Connect to *address* and return the socket object.

    Convenience function.  Connect to *address* (a 2-tuple ``(host,
    port)``) and return the socket object.  Passing the optional
    *timeout* parameter will set the timeout on the socket instance
    before attempting to connect.  If no *timeout* is supplied, the
    global default timeout setting returned by :func:`getdefaulttimeout`
    is used.  If *source_address* is set it must be a tuple of (host, port)
    for the socket to bind as a source address before making the connection.
    An host of '' or port 0 tells the OS to use the default.
    """

    host, port = address
    if host.startswith("["):
        host = host.strip("[]")
    err = None

    # Using the value from allowed_gai_family() in the context of getaddrinfo lets
    # us select whether to work with IPv4 DNS records, IPv6 records, or both.
    # The original create_connection function always returns all records.
    family = allowed_gai_family()

    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
        af, socktype, proto, canonname, sa = res
        sock = None
        try:
            sock = socket.socket(af, socktype, proto)

            # If provided, set socket level options before connecting.
            _set_socket_options(sock, socket_options)

            if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT:
                sock.settimeout(timeout)
            if source_address:
                sock.bind(source_address)


          sock.connect(sa)


E               ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7f1e69705610>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)


      urllib_response = conn.urlopen(


            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e9d29c670>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.
        httplib_response = self._make_request(
            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

        # If we're going to release the connection in ``finally:``, then
        # the response doesn't need to know about the connection. Otherwise
        # it will also try to release it and we'll have a double-release
        # mess.
        response_conn = conn if not release_conn else None

        # Pass method to Response for length checking
        response_kw["request_method"] = method

        # Import httplib's response into our own wrapper object
        response = self.ResponseCls.from_httplib(
            httplib_response,
            pool=self,
            connection=response_conn,
            retries=retries,
            **response_kw
        )

        # Everything went great!
        clean_exit = True

    except queue.Empty:
        # Timed out by queue.
        raise EmptyPoolError(self, "No pool connections are available.")

    except (
        TimeoutError,
        HTTPException,
        SocketError,
        ProtocolError,
        BaseSSLError,
        SSLError,
        CertificateError,
    ) as e:
        # Discard the connection for these exceptions. It will be
        # replaced during the next _get_conn() call.
        clean_exit = False
        if isinstance(e, (BaseSSLError, CertificateError)):
            e = SSLError(e)
        elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy:
            e = ProxyError("Cannot connect to proxy.", e)
        elif isinstance(e, (SocketError, HTTPException)):
            e = ProtocolError("Connection aborted.", e)


      retries = retries.increment(


            method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:

self = Retry(total=False, connect=None, read=None, redirect=0, status=None)

method = 'PUT', url = '/csv', response = None

error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>: Failed to establish a new connection: [Errno 111] Connection refused')

_pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80>

_stacktrace = <traceback object at 0x7f1e6bb99e80>
def increment(
    self,
    method=None,
    url=None,
    response=None,
    error=None,
    _pool=None,
    _stacktrace=None,
):
    """ Return a new Retry object with incremented retry counters.

    :param response: A response object, or None, if the server did not
        return a response.
    :type response: :class:`~urllib3.response.HTTPResponse`
    :param Exception error: An error encountered during the request, or
        None if the response was received successfully.

    :return: A new ``Retry`` object.
    """
    if self.total is False and error:
        # Disabled, indicate to re-raise the error.


      raise six.reraise(type(error), error, _stacktrace)


/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:

tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None):
    try:
        if value is None:
            value = tp()
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)


      raise value


../../../.local/lib/python3.8/site-packages/six.py:703:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

retries = Retry(total=False, connect=None, read=None, redirect=0, status=None)

redirect = True, assert_same_host = False

timeout = <object object at 0x7f1f6c0eb220>, pool_timeout = None

release_conn = False, chunked = False, body_pos = None

response_kw = {'decode_content': False, 'preload_content': False}, conn = None

release_this_conn = True, err = None, clean_exit = False

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e9d29c670>

is_new_proxy_conn = False
def urlopen(
    self,
    method,
    url,
    body=None,
    headers=None,
    retries=None,
    redirect=True,
    assert_same_host=True,
    timeout=_Default,
    pool_timeout=None,
    release_conn=None,
    chunked=False,
    body_pos=None,
    **response_kw
):
    """
    Get a connection from the pool and perform an HTTP request. This is the
    lowest level call for making a request, so you'll need to specify all
    the raw details.

    .. note::

       More commonly, it's appropriate to use a convenience method provided
       by :class:`.RequestMethods`, such as :meth:`request`.

    .. note::

       `release_conn` will only behave as expected if
       `preload_content=False` because we want to make
       `preload_content=False` the default behaviour someday soon without
       breaking backwards compatibility.

    :param method:
        HTTP request method (such as GET, POST, PUT, etc.)

    :param body:
        Data to send in the request body (useful for creating
        POST requests, see HTTPConnectionPool.post_url for
        more convenience).

    :param headers:
        Dictionary of custom headers to send, such as User-Agent,
        If-None-Match, etc. If None, pool headers are used. If provided,
        these headers completely replace any pool-specific headers.

    :param retries:
        Configure the number of retries to allow before raising a
        :class:`~urllib3.exceptions.MaxRetryError` exception.

        Pass ``None`` to retry until you receive a response. Pass a
        :class:`~urllib3.util.retry.Retry` object for fine-grained control
        over different types of retries.
        Pass an integer number to retry connection errors that many times,
        but no other types of errors. Pass zero to never retry.

        If ``False``, then retries are disabled and any exception is raised
        immediately. Also, instead of raising a MaxRetryError on redirects,
        the redirect response will be returned.

    :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int.

    :param redirect:
        If True, automatically handle redirects (status codes 301, 302,
        303, 307, 308). Each redirect counts as a retry. Disabling retries
        will disable redirect, too.

    :param assert_same_host:
        If ``True``, will make sure that the host of the pool requests is
        consistent else will raise HostChangedError. When False, you can
        use the pool on an HTTP proxy and request foreign hosts.

    :param timeout:
        If specified, overrides the default timeout for this one
        request. It may be a float (in seconds) or an instance of
        :class:`urllib3.util.Timeout`.

    :param pool_timeout:
        If set and the pool is set to block=True, then this method will
        block for ``pool_timeout`` seconds and raise EmptyPoolError if no
        connection is available within the time period.

    :param release_conn:
        If False, then the urlopen call will not release the connection
        back into the pool once a response is received (but will release if
        you read the entire contents of the response such as when
        `preload_content=True`). This is useful if you're not preloading
        the response's content immediately. You will need to call
        ``r.release_conn()`` on the response ``r`` to return the connection
        back into the pool. If None, it takes the value of
        ``response_kw.get('preload_content', True)``.

    :param chunked:
        If True, urllib3 will send the body using chunked transfer
        encoding. Otherwise, urllib3 will send the body using the standard
        content-length form. Defaults to False.

    :param int body_pos:
        Position to seek to in file-like body in the event of a retry or
        redirect. Typically this won't need to be set because urllib3 will
        auto-populate the value when needed.

    :param \\**response_kw:
        Additional parameters are passed to
        :meth:`urllib3.response.HTTPResponse.from_httplib`
    """
    if headers is None:
        headers = self.headers

    if not isinstance(retries, Retry):
        retries = Retry.from_int(retries, redirect=redirect, default=self.retries)

    if release_conn is None:
        release_conn = response_kw.get("preload_content", True)

    # Check host
    if assert_same_host and not self.is_same_host(url):
        raise HostChangedError(self, url, retries)

    # Ensure that the URL we're connecting to is properly encoded
    if url.startswith("/"):
        url = six.ensure_str(_encode_target(url))
    else:
        url = six.ensure_str(parse_url(url).url)

    conn = None

    # Track whether `conn` needs to be released before
    # returning/raising/recursing. Update this variable if necessary, and
    # leave `release_conn` constant throughout the function. That way, if
    # the function recurses, the original value of `release_conn` will be
    # passed down into the recursive call, and its value will be respected.
    #
    # See issue #651 [1] for details.
    #
    # [1] <https://github.com/urllib3/urllib3/issues/651>
    release_this_conn = release_conn

    # Merge the proxy headers. Only do this in HTTP. We have to copy the
    # headers dict so we can safely change it without those changes being
    # reflected in anyone else's copy.
    if self.scheme == "http":
        headers = headers.copy()
        headers.update(self.proxy_headers)

    # Must keep the exception bound to a separate variable or else Python 3
    # complains about UnboundLocalError.
    err = None

    # Keep track of whether we cleanly exited the except block. This
    # ensures we do proper cleanup in finally.
    clean_exit = False

    # Rewind body position, if needed. Record current position
    # for future rewinds in the event of a redirect/retry.
    body_pos = set_file_position(body, body_pos)

    try:
        # Request a connection from the queue.
        timeout_obj = self._get_timeout(timeout)
        conn = self._get_conn(timeout=pool_timeout)

        conn.timeout = timeout_obj.connect_timeout

        is_new_proxy_conn = self.proxy is not None and not getattr(
            conn, "sock", None
        )
        if is_new_proxy_conn:
            self._prepare_proxy(conn)

        # Make the request on the httplib connection object.


      httplib_response = self._make_request(


            conn,
            method,
            url,
            timeout=timeout_obj,
            body=body,
            headers=headers,
            chunked=chunked,
        )

/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:

self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f1e697ffe80>

conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

method = 'PUT', url = '/csv'

timeout = <urllib3.util.timeout.Timeout object at 0x7f1e9d29c670>

chunked = False

httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}}

timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f1e9d488760>
def _make_request(
    self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw
):
    """
    Perform a request on a given urllib connection object taken from our
    pool.

    :param conn:
        a connection from one of our connection pools

    :param timeout:
        Socket timeout in seconds for the request. This can be a
        float or integer, which will set the same timeout value for
        the socket connect and the socket read, or an instance of
        :class:`urllib3.util.Timeout`, which gives you more fine-grained
        control over your timeouts.
    """
    self.num_requests += 1

    timeout_obj = self._get_timeout(timeout)
    timeout_obj.start_connect()
    conn.timeout = timeout_obj.connect_timeout

    # Trigger any extra validation we need to do.
    try:
        self._validate_conn(conn)
    except (SocketTimeout, BaseSSLError) as e:
        # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout.
        self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
        raise

    # conn.request() calls httplib.*.request, not the method in
    # urllib3.request. It also calls makefile (recv) on the socket.
    if chunked:
        conn.request_chunked(method, url, **httplib_request_kw)
    else:


      conn.request(method, url, **httplib_request_kw)


/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *,
            encode_chunked=False):
    """Send a complete request to the server."""


  self._send_request(method, url, body, headers, encode_chunked)


/usr/lib/python3.8/http/client.py:1256:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs):
    self._response_received = False
    if headers.get('Expect', b'') == b'100-continue':
        self._expect_header_set = True
    else:
        self._expect_header_set = False
        self.response_class = self._original_response_cls


  rval = super()._send_request(


        method, url, body, headers, *args, **kwargs
    )

/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

method = 'PUT', url = '/csv', body = None

headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}

encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked):
    # Honor explicitly requested Host: and Accept-Encoding: headers.
    header_names = frozenset(k.lower() for k in headers)
    skips = {}
    if 'host' in header_names:
        skips['skip_host'] = 1
    if 'accept-encoding' in header_names:
        skips['skip_accept_encoding'] = 1

    self.putrequest(method, url, **skips)

    # chunked encoding will happen if HTTP/1.1 is used and either
    # the caller passes encode_chunked=True or the following
    # conditions hold:
    # 1. content-length has not been explicitly set
    # 2. the body is a file or iterable, but not a str or bytes-like
    # 3. Transfer-Encoding has NOT been explicitly set by the caller

    if 'content-length' not in header_names:
        # only chunk body if not explicitly set for backwards
        # compatibility, assuming the client code is already handling the
        # chunking
        if 'transfer-encoding' not in header_names:
            # if content-length cannot be automatically determined, fall
            # back to chunked encoding
            encode_chunked = False
            content_length = self._get_content_length(body, method)
            if content_length is None:
                if body is not None:
                    if self.debuglevel > 0:
                        print('Unable to determine size of %r' % body)
                    encode_chunked = True
                    self.putheader('Transfer-Encoding', 'chunked')
            else:
                self.putheader('Content-Length', str(content_length))
    else:
        encode_chunked = False

    for hdr, value in headers.items():
        self.putheader(hdr, value)
    if isinstance(body, str):
        # RFC 2616 Section 3.7.1 says that text default has a
        # default charset of iso-8859-1.
        body = _encode(body, 'body')


  self.endheaders(body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1302:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False):
    """Indicate that the last header line has been sent to the server.

    This method sends the request to the server.  The optional message_body
    argument can be used to pass a message body associated with the
    request.
    """
    if self.__state == _CS_REQ_STARTED:
        self.__state = _CS_REQ_SENT
    else:
        raise CannotSendHeader()


  self._send_output(message_body, encode_chunked=encode_chunked)


/usr/lib/python3.8/http/client.py:1251:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

message_body = None, args = (), kwargs = {'encode_chunked': False}

msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: de9dcb34-2b34-4b02-8870-9d8cd162c54c\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs):
    self._buffer.extend((b"", b""))
    msg = self._convert_to_bytes(self._buffer)
    del self._buffer[:]
    # If msg and message_body are sent in a single send() call,
    # it will avoid performance problems caused by the interaction
    # between delayed ack and the Nagle algorithm.
    if isinstance(message_body, bytes):
        msg += message_body
        message_body = None


  self.send(msg)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: de9dcb34-2b34-4b02-8870-9d8cd162c54c\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str):
    if self._response_received:
        logger.debug(
            "send() called, but reseponse already received. "
            "Not sending data."
        )
        return


  return super().send(str)


/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>

data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: de9dcb34-2b34-4b02-8870-9d8cd162c54c\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data):
    """Send `data' to the server.
    ``data`` can be a string object, a bytes object, an array object, a
    file-like object that supports a .read() method, or an iterable object.
    """

    if self.sock is None:
        if self.auto_open:


          self.connect()


/usr/lib/python3.8/http/client.py:951:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>
def connect(self):


  conn = self._new_conn()


/usr/lib/python3/dist-packages/urllib3/connection.py:187:

self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>
def _new_conn(self):
    """ Establish a socket connection and set nodelay settings on it.

    :return: New socket connection.
    """
    extra_kw = {}
    if self.source_address:
        extra_kw["source_address"] = self.source_address

    if self.socket_options:
        extra_kw["socket_options"] = self.socket_options

    try:
        conn = connection.create_connection(
            (self._dns_host, self.port), self.timeout, **extra_kw
        )

    except SocketTimeout:
        raise ConnectTimeoutError(
            self,
            "Connection to %s timed out. (connect timeout=%s)"
            % (self.host, self.timeout),
        )

    except SocketError as e:


      raise NewConnectionError(


            self, "Failed to establish a new connection: %s" % e
        )

E           urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f1e9d4885b0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/'

s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}}

paths = ['/tmp/pytest-of-jenkins/pytest-14/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-14/csv0/dataset-1.csv']

datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-14/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-14/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-14/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-14/parquet0')}

engine = 'csv'

df =      name-string    id  label         x         y

0         Xavier   991    986  0.157298 -0.169087

1          Jerry  ...      Jerry   995   1027  0.992783 -0.835742

2160        Gary   996    973  0.665933 -0.646899
[4321 rows x 5 columns]

patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"])
def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore):
    # Copy files to mock s3 bucket
    files = {}
    for i, path in enumerate(paths):
        with open(path, "rb") as f:
            fbytes = f.read()
        fn = path.split(os.path.sep)[-1]
        files[fn] = BytesIO()
        files[fn].write(fbytes)
        files[fn].seek(0)

    if engine == "parquet":
        # Workaround for nvt#539. In order to avoid the
        # bug in Dask's `create_metadata_file`, we need
        # to manually generate a "_metadata" file here.
        # This can be removed after dask#7295 is merged
        # (see https://github.com/dask/dask/pull/7295)
        fn = "_metadata"
        files[fn] = BytesIO()
        meta = create_metadata_file(
            paths,
            engine="pyarrow",
            out_dir=False,
        )
        meta.write_metadata_file(files[fn])
        files[fn].seek(0)


  with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:


tests/unit/test_s3.py:97:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

/usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context

client.create_bucket(Bucket=bucket, ACL="public-read-write")

/usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call

return self._make_api_call(operation_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call

http, parsed_response = self._make_request(

/usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request

return self._endpoint.make_request(operation_model, request_dict)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request

return self._send_request(request_dict, operation_model)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request

while self._needs_retry(

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry

responses = self._event_emitter.emit(

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit

return self._emitter.emit(aliased_event_name, **kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit

return self._emit(event_name, kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit

response = handler(**kwargs)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call

if self._checker(**checker_kwargs):

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call

should_retry = self._should_retry(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry

return self._checker(attempt_number, response, caught_exception)

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call

checker_response = checker(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call

return self._check_caught_exception(

/usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception

raise caught_exception

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response

http_response = self._send(request)

/usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send

return self.http_session.send(request)

self = <botocore.httpsession.URLLib3Session object at 0x7f1e69705610>

request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'de9dcb34-2b34-4b02-8870-9d8cd162c54c', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request):
    try:
        proxy_url = self._proxy_config.proxy_url_for(request.url)
        manager = self._get_connection_manager(request.url, proxy_url)
        conn = manager.connection_from_url(request.url)
        self._setup_ssl_cert(conn, request.url, self._verify)
        if ensure_boolean(
            os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '')
        ):
            # This is currently an "experimental" feature which provides
            # no guarantees of backwards compatibility. It may be subject
            # to change or removal in any patch version. Anyone opting in
            # to this feature should strictly pin botocore.
            host = urlparse(request.url).hostname
            conn.proxy_headers['host'] = host

        request_target = self._get_request_target(request.url, proxy_url)
        urllib_response = conn.urlopen(
            method=request.method,
            url=request_target,
            body=request.body,
            headers=request.headers,
            retries=Retry(False),
            assert_same_host=False,
            preload_content=False,
            decode_content=False,
            chunked=self._chunked(request.headers),
        )

        http_response = botocore.awsrequest.AWSResponse(
            request.url,
            urllib_response.status,
            urllib_response.headers,
            urllib_response,
        )

        if not request.stream_output:
            # Cause the raw stream to be exhausted immediately. We do it
            # this way instead of using preload_content because
            # preload_content will never buffer chunked responses
            http_response.content

        return http_response
    except URLLib3SSLError as e:
        raise SSLError(endpoint_url=request.url, error=e)
    except (NewConnectionError, socket.gaierror) as e:


      raise EndpointConnectionError(endpoint_url=request.url, error=e)


E           botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError

_____________________ test_cpu_workflow[True-True-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_pa0')

df =      name-cat name-string    id  label         x         y

0      Yvonne      Xavier   991    986  0.157298 -0.169087

...ry   995   1027  0.992783 -0.835742

4320    Zelda        Gary   996    973  0.665933 -0.646899
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f1dd8781f70>, cpu = True

engine = 'parquet', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_______________________ test_cpu_workflow[True-True-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs0')

df =      name-string    id  label         x         y

0         Xavier   991    986  0.157298 -0.169087

1          Jerry  ...      Jerry   995   1027  0.992783 -0.835742

2160        Gary   996    973  0.665933 -0.646899
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f1e14793ee0>, cpu = True

engine = 'csv', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

__________________ test_cpu_workflow[True-True-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs1')

df =      name-string    id  label         x         y

0         Xavier   991    986  0.157298 -0.169087

1          Jerry  ...      Jerry   995   1027  0.992783 -0.835742

2160        Gary   996    973  0.665933 -0.646899
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f1d2cf76fd0>, cpu = True

engine = 'csv-no-header', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

____________________ test_cpu_workflow[True-False-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_p0')

df =      name-cat name-string    id  label         x         y

0      Yvonne      Xavier   991    986  0.157298 -0.169087

...ry   995   1027  0.992783 -0.835742

4320    Zelda        Gary   996    973  0.665933 -0.646899
[4321 rows x 6 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f1d2cf1c760>, cpu = True

engine = 'parquet', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

______________________ test_cpu_workflow[True-False-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c0')

df =      name-string    id  label         x         y

0         Xavier   991    986  0.157298 -0.169087

1          Jerry  ...      Jerry   995   1027  0.992783 -0.835742

2160        Gary   996    973  0.665933 -0.646899
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f1e147b62e0>, cpu = True

engine = 'csv', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

_________________ test_cpu_workflow[True-False-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c1')

df =      name-string    id  label         x         y

0         Xavier   991    986  0.157298 -0.169087

1          Jerry  ...      Jerry   995   1027  0.992783 -0.835742

2160        Gary   996    973  0.665933 -0.646899
[4321 rows x 5 columns]

dataset = <merlin.io.dataset.Dataset object at 0x7f1dd0f569a0>, cpu = True

engine = 'csv-no-header', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"])
@pytest.mark.parametrize("dump", [True, False])
@pytest.mark.parametrize("cpu", [True])
def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump):
    # Make sure we are in cpu formats
    if cudf and isinstance(df, cudf.DataFrame):
        df = df.to_pandas()

    if cpu:
        dataset.to_cpu()

    cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"]
    cont_names = ["x", "y", "id"]
    label_name = ["label"]

    norms = ops.Normalize()
    conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms
    cats = cat_names >> ops.Categorify()
    workflow = nvt.Workflow(conts + cats + label_name)

    workflow.fit(dataset)
    if dump:
        workflow_dir = os.path.join(tmpdir, "workflow")
        workflow.save(workflow_dir)
        workflow = None

        workflow = Workflow.load(workflow_dir)

    def get_norms(tar: pd.Series):
        df = tar.fillna(0)
        df = df * (df >= 0).astype("int")
        return df

    assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4)
    assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3)
    assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3)

    # Check that categories match
    if engine == "parquet":
        cats_expected0 = df["name-cat"].unique()
        cats0 = get_cats(workflow, "name-cat", cpu=True)
        # adding the None entry as a string because of move from gpu
        assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist())
        assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None])
    cats_expected1 = df["name-string"].unique()
    cats1 = get_cats(workflow, "name-string", cpu=True)
    # adding the None entry as a string because of move from gpu
    assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist())
    assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None])

    # Write to new "shuffled" and "processed" dataset
    workflow.transform(dataset).to_parquet(
        output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION
    )


  dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)


tests/unit/workflow/test_cpu_workflow.py:76:

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init

self.engine = ParquetDatasetEngine(

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init

self._path0,

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0

return next(self._dataset.get_fragments()).path

/usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset

dataset = pa_ds.dataset(paths, filesystem=fs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset

return _filesystem_dataset(source, **kwargs)

/usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset

return factory.finish(schema)

pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish

???

pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status

???


???

E   pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-14/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?

pyarrow/error.pxi:99: ArrowInvalid

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1]

FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py...

FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions....

FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp...

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv]

FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header]

===== 12 failed, 1419 passed, 1 skipped, 617 warnings in 709.70s (0:11:49) =====

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins9203679199988082363.sh

nvidia-merlin-bot · 2022-08-15T14:05:37Z

Click to view CI Results

GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts. Running as SYSTEM Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4626/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10 Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10 Commit message: "Merge branch 'main' into refactor/decouple-dask" > git rev-list --no-walk 8bd1260ba233898308f1416f79cefbd75013f4ff # timeout=10 First time build. Skipping changelog. [nvtabular_tests] $ /bin/bash /tmp/jenkins15266300073057526636.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skipped

tests/unit/test_dask_nvt.py ............................................ [ 3%]
........................................................................ [ 8%]
.... [ 8%]
tests/unit/test_notebooks.py ...F.. [ 8%]
tests/unit/test_tf4rec.py . [ 8%]
tests/unit/test_tools.py ...................... [ 10%]
tests/unit/test_triton_inference.py ................................ [ 12%]
tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%]
tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]
................................................... [ 18%]
tests/unit/framework_utils/test_torch_layers.py . [ 18%]
tests/unit/loader/test_dataloader_backend.py ...... [ 18%]
tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]
........................................s.. [ 23%]
tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]
...................................................... [ 29%]
tests/unit/ops/test_categorify.py ...................................... [ 32%]
........................................................................ [ 37%]
........................................... [ 40%]
tests/unit/ops/test_column_similarity.py ........................ [ 42%]
tests/unit/ops/test_drop_low_cardinality.py .. [ 42%]
tests/unit/ops/test_fill.py ............................................ [ 45%]
........ [ 45%]
tests/unit/ops/test_groupyby.py ..................... [ 47%]
tests/unit/ops/test_hash_bucket.py ......................... [ 49%]
tests/unit/ops/test_join.py ............................................ [ 52%]
........................................................................ [ 57%]
.................................. [ 59%]
tests/unit/ops/test_lambda.py .......... [ 60%]
tests/unit/ops/test_normalize.py ....................................... [ 63%]
.. [ 63%]
tests/unit/ops/test_ops.py ............................................. [ 66%]
.................... [ 67%]
tests/unit/ops/test_ops_schema.py ...................................... [ 70%]
........................................................................ [ 75%]
........................................................................ [ 80%]
........................................................................ [ 85%]
....................................... [ 88%]
tests/unit/ops/test_reduce_dtype_size.py .. [ 88%]
tests/unit/ops/test_target_encode.py ..................... [ 89%]
tests/unit/workflow/test_cpu_workflow.py ...... [ 90%]
tests/unit/workflow/test_workflow.py ................................... [ 92%]
.......................................................... [ 96%]
tests/unit/workflow/test_workflow_chaining.py ... [ 96%]
tests/unit/workflow/test_workflow_node.py ........... [ 97%]
tests/unit/workflow/test_workflow_ops.py ... [ 97%]
tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]
... [100%]

=================================== FAILURES ===================================
____________________________ test_movielens_example ____________________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0')

def test_movielens_example(tmpdir):
    _get_random_movielens_data(tmpdir, 10000, dataset="movie")
    _get_random_movielens_data(tmpdir, 10000, dataset="ratings")
    _get_random_movielens_data(tmpdir, 5000, dataset="ratings", valid=True)

    triton_model_path = os.path.join(tmpdir, "models")
    os.environ["INPUT_DATA_DIR"] = str(tmpdir)
    os.environ["MODEL_PATH"] = triton_model_path

    notebook_path = os.path.join(
        dirname(TEST_PATH),
        "examples/getting-started-movielens/",
        "02-ETL-with-NVTabular.ipynb",
    )
    _run_notebook(tmpdir, notebook_path)

    def _modify_tf_nb(line):
        return line.replace(
            # don't require graphviz/pydot
            "tf.keras.utils.plot_model(model)",
            "# tf.keras.utils.plot_model(model)",
        )

    def _modify_tf_triton(line):
        # models are already preloaded
        line = line.replace("triton_client.load_model", "# triton_client.load_model")
        line = line.replace("triton_client.unload_model", "# triton_client.unload_model")
        return line

    notebooks = []
    try:
        import torch  # noqa

        notebooks.append("03-Training-with-PyTorch.ipynb")
    except Exception:
        pass
    try:
        import nvtabular.inference.triton  # noqa
        import nvtabular.loader.tensorflow  # noqa

        notebooks.append("03-Training-with-TF.ipynb")
        has_tf = True

    except Exception:
        has_tf = False

    for notebook in notebooks:
        notebook_path = os.path.join(
            dirname(TEST_PATH),
            "examples/getting-started-movielens/",
            notebook,
        )
        if notebook == "03-Training-with-TF.ipynb":
            _run_notebook(tmpdir, notebook_path, transform=_modify_tf_nb)
        else:
            _run_notebook(tmpdir, notebook_path)

    # test out the TF inference movielens notebook if appropriate
    if has_tf and TRITON_SERVER_PATH:
        notebook = "04-Triton-Inference-with-TF.ipynb"
        notebook_path = os.path.join(
            dirname(TEST_PATH),
            "examples/getting-started-movielens/",
            notebook,
        )
        with run_triton_server(triton_model_path):

          _run_notebook(tmpdir, notebook_path, transform=_modify_tf_triton)

tests/unit/test_notebooks.py:224:

tests/unit/test_notebooks.py:307: in _run_notebook
subprocess.check_output([sys.executable, script_path])
/usr/lib/python3.8/subprocess.py:415: in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,

input = None, capture_output = False, timeout = None, check = True
popenargs = (['/usr/bin/python3', '/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/notebook.py'],)
kwargs = {'stdout': -1}, process = <subprocess.Popen object at 0x7f8489dd5160>
stdout = b"client created.\nGET /v2/health/live, headers None\n<HTTPSocketPoolResponse status=400 headers={'content-length': '0', 'content-type': 'text/plain'}>\nPOST /v2/repository/index, headers None\n\n"
stderr = None, retcode = 1

def run(*popenargs,
        input=None, capture_output=False, timeout=None, check=False, **kwargs):
    """Run command with arguments and return a CompletedProcess instance.

    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.

    If check is True and the exit code was non-zero, it raises a
    CalledProcessError. The CalledProcessError object will have the return code
    in the returncode attribute, and output & stderr attributes if those streams
    were captured.

    If timeout is given, and the process takes too long, a TimeoutExpired
    exception will be raised.

    There is an optional argument "input", allowing you to
    pass bytes or a string to the subprocess's stdin.  If you use this argument
    you may not also use the Popen constructor's "stdin" argument, as
    it will be used internally.

    By default, all communication is in bytes, and therefore any "input" should
    be bytes, and the stdout and stderr will be bytes. If in text mode, any
    "input" should be a string, and stdout and stderr will be strings decoded
    according to locale encoding, or by "encoding" if set. Text mode is
    triggered by setting any of text, encoding, errors or universal_newlines.

    The other arguments are the same as for the Popen constructor.
    """
    if input is not None:
        if kwargs.get('stdin') is not None:
            raise ValueError('stdin and input arguments may not both be used.')
        kwargs['stdin'] = PIPE

    if capture_output:
        if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
            raise ValueError('stdout and stderr arguments may not be used '
                             'with capture_output.')
        kwargs['stdout'] = PIPE
        kwargs['stderr'] = PIPE

    with Popen(*popenargs, **kwargs) as process:
        try:
            stdout, stderr = process.communicate(input, timeout=timeout)
        except TimeoutExpired as exc:
            process.kill()
            if _mswindows:
                # Windows accumulates the output in a single blocking
                # read() call run on child threads, with the timeout
                # being done in a join() on those threads.  communicate()
                # _after_ kill() is required to collect that and add it
                # to the exception.
                exc.stdout, exc.stderr = process.communicate()
            else:
                # POSIX _communicate already populated the output so
                # far into the TimeoutExpired exception.
                process.wait()
            raise
        except:  # Including KeyboardInterrupt, communicate handled that.
            process.kill()
            # We don't call process.wait() as .__exit__ does that for us.
            raise
        retcode = process.poll()
        if check and retcode:

          raise CalledProcessError(retcode, process.args,

                                     output=stdout, stderr=stderr)

E subprocess.CalledProcessError: Command '['/usr/bin/python3', '/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/notebook.py']' returned non-zero exit status 1.

/usr/lib/python3.8/subprocess.py:516: CalledProcessError
----------------------------- Captured stderr call -----------------------------
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
2022-08-15 13:55:34.039352: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-15 13:55:35.023527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-08-15 13:55:35.024322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14532 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
/usr/lib/python3/dist-packages/requests/init.py:89: RequestsDependencyWarning: urllib3 (1.26.11) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
WARNING:absl:Function _wrapped_model contains input name(s) movieId, userId with unsupported characters which will be renamed to movieid, userid in the SavedModel.
WARNING:absl:<nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7fa291fd2be0> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function.
WARNING:absl:Function _wrapped_model contains input name(s) movieId, userId with unsupported characters which will be renamed to movieid, userid in the SavedModel.
WARNING:absl:<nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7fa291fd2be0> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function.
I0815 13:55:43.015590 13149 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f151e000000' with size 268435456
I0815 13:55:43.016394 13149 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0815 13:55:43.019651 13149 model_repository_manager.cc:1191] loading: movielens_tf:1
I0815 13:55:43.119889 13149 model_repository_manager.cc:1191] loading: movielens_nvt:1
I0815 13:55:43.402054 13149 tensorflow.cc:2204] TRITONBACKEND_Initialize: tensorflow
I0815 13:55:43.402090 13149 tensorflow.cc:2214] Triton TRITONBACKEND API version: 1.10
I0815 13:55:43.402097 13149 tensorflow.cc:2220] 'tensorflow' TRITONBACKEND API version: 1.10
I0815 13:55:43.402103 13149 tensorflow.cc:2244] backend configuration:
{"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","version":"2","default-max-batch-size":"4"}}
I0815 13:55:43.402139 13149 tensorflow.cc:2310] TRITONBACKEND_ModelInitialize: movielens_tf (version 1)
I0815 13:55:43.406327 13149 backend.cc:46] TRITONBACKEND_Initialize: nvtabular
I0815 13:55:43.406368 13149 backend.cc:53] Triton TRITONBACKEND API version: 1.10
I0815 13:55:43.406385 13149 backend.cc:56] 'nvtabular' TRITONBACKEND API version: 1.10
I0815 13:55:43.406630 13149 backend.cc:76] Loaded libpython successfully
I0815 13:55:43.619111 13149 backend.cc:89] Python interpreter is initialized
I0815 13:55:43.619191 13149 tensorflow.cc:2359] TRITONBACKEND_ModelInstanceInitialize: movielens_tf (GPU device 0)
2022-08-15 13:55:44.030716: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models/movielens_tf/1/model.savedmodel
2022-08-15 13:55:44.033884: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve }
2022-08-15 13:55:44.036136: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present) from: /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models/movielens_tf/1/model.savedmodel
2022-08-15 13:55:44.036262: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-15 13:55:44.075003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 11486 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-08-15 13:55:44.105698: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-08-15 13:55:44.107504: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle.
2022-08-15 13:55:44.157853: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models/movielens_tf/1/model.savedmodel
2022-08-15 13:55:44.184293: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 153598 microseconds.
I0815 13:55:44.184515 13149 model_repository_manager.cc:1345] successfully loaded 'movielens_tf' version 1
I0815 13:55:44.185598 13149 model_inst_state.hpp:58] Loading TritonPythonModel from module 'nvtabular.inference.triton.workflow_model'
I0815 13:55:47.035470 13149 model_repository_manager.cc:1345] successfully loaded 'movielens_nvt' version 1
I0815 13:55:47.035885 13149 model_repository_manager.cc:1191] loading: movielens:1
I0815 13:55:47.136385 13149 model_repository_manager.cc:1345] successfully loaded 'movielens' version 1
I0815 13:55:47.136538 13149 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0815 13:55:47.136652 13149 server.cc:583]
+------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","version":"2","default-max-batch-size":"4"}} |
| nvtabular | /opt/tritonserver/backends/nvtabular/libtriton_nvtabular.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
+------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0815 13:55:47.195489 13149 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0815 13:55:47.196424 13149 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.23.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/models |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

E0815 13:55:47.197036931 13149 server_chttp2.cc:40] {"created":"@1660571747.196983913","description":"No address added out of total 1 resolved","file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1660571747.196982030","description":"Failed to add any wildcard listeners","file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1660571747.196960677","description":"Address family not supported by protocol","errno":97,"file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":395,"os_error":"Address family not supported by protocol","syscall":"socket","target_address":"[::]:8001"},{"created":"@1660571747.196981667","description":"Unable to configure socket","fd":43,"file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1660571747.196978868","description":"Address already in use","errno":98,"file":"/tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
E0815 13:55:47.197111 13149 main.cc:825] failed to start GRPC service: Unavailable - Socket '0.0.0.0:8001' already in use
W0815 13:55:48.222600 13149 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
/usr/local/lib/python3.8/dist-packages/tritonhttpclient/init.py:31: DeprecationWarning: The package tritonhttpclient is deprecated and will be removed in a future version. Please use instead tritonclient.http
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 163, in get_socket
return self._socket_queue.get(block=False)
File "src/gevent/queue.py", line 335, in gevent._gevent_cqueue.Queue.get
File "src/gevent/queue.py", line 350, in gevent._gevent_cqueue.Queue.get
File "src/gevent/queue.py", line 319, in gevent._gevent_cqueue.Queue._Queue__get_or_peek

_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/tmp/pytest-of-jenkins/pytest-8/test_movielens_example0/notebook.py", line 43, in
triton_client.get_model_repository_index()
File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 619, in get_model_repository_index
response = self._post(request_uri=request_uri,
File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/init.py", line 313, in _post
response = self._client_stub.post(request_uri=request_uri,
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post
return self.request(METHOD_POST, request_uri, body=body, headers=headers)
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 226, in request
sock = self._connection_pool.get_socket()
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 166, in get_socket
return self._create_socket()
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 127, in _create_socket
raise first_error
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 114, in _create_socket
sock = self._connect_socket(sock, sock_info[-1])
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/connectionpool.py", line 136, in _connect_socket
sock.connect(address)
File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 607, in connect
raise _SocketError(err, strerror(err))
ConnectionRefusedError: [Errno 111] Connection refused
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

nvtabular/loader/init.py:19
/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.
warnings.warn(

tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]
/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first
self.make_current()

tests/unit/test_dask_nvt.py: 1 warning
tests/unit/test_tf4rec.py: 1 warning
tests/unit/test_tools.py: 5 warnings
tests/unit/test_triton_inference.py: 8 warnings
tests/unit/loader/test_dataloader_backend.py: 6 warnings
tests/unit/loader/test_tf_dataloader.py: 66 warnings
tests/unit/loader/test_torch_dataloader.py: 67 warnings
tests/unit/ops/test_categorify.py: 69 warnings
tests/unit/ops/test_drop_low_cardinality.py: 2 warnings
tests/unit/ops/test_fill.py: 8 warnings
tests/unit/ops/test_hash_bucket.py: 4 warnings
tests/unit/ops/test_join.py: 88 warnings
tests/unit/ops/test_lambda.py: 1 warning
tests/unit/ops/test_normalize.py: 9 warnings
tests/unit/ops/test_ops.py: 11 warnings
tests/unit/ops/test_ops_schema.py: 17 warnings
tests/unit/workflow/test_workflow.py: 27 warnings
tests/unit/workflow/test_workflow_chaining.py: 1 warning
tests/unit/workflow/test_workflow_node.py: 1 warning
tests/unit/workflow/test_workflow_schemas.py: 1 warning
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/test_dask_nvt.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.
warnings.warn(

tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers
/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.
warnings.warn(

tests/unit/test_notebooks.py: 1 warning
tests/unit/test_tools.py: 17 warnings
tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 54 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future
warnings.warn(

tests/unit/loader/test_tf_dataloader.py: 2 warnings
tests/unit/loader/test_torch_dataloader.py: 12 warnings
tests/unit/workflow/test_workflow.py: 9 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.
warnings.warn(

tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]
tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]
tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]
/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)

tests/unit/workflow/test_cpu_workflow.py: 6 warnings
tests/unit/workflow/test_workflow.py: 12 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.
warnings.warn(

tests/unit/workflow/test_workflow.py: 48 warnings
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.
warnings.warn(

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]
tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/test_notebooks.py::test_movielens_example - subprocess.Call...
===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 764.65s (0:12:44) ======
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[nvtabular_tests] $ /bin/bash /tmp/jenkins14802151835763005993.sh

karlhigley · 2022-08-15T14:07:31Z

rerun tests

nvidia-merlin-bot · 2022-08-15T14:21:39Z

Click to view CI Results

GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4627/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins5374540968505043348.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ..............................FF     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=================================== FAILURES ===================================

_________________________ test_groupby_model[pytorch] __________________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_groupby_model_pytorch_0')

output_model = 'pytorch'
@pytest.mark.skipif(TRITON_SERVER_PATH is None, reason="Requires tritonserver on the path")
@pytest.mark.parametrize("output_model", ["tensorflow", "pytorch"])
def test_groupby_model(tmpdir, output_model):
    size = 20
    df = make_df(
        {
            "id": np.random.choice([0, 1], size=size),
            "ts": np.linspace(0.0, 10.0, num=size),
            "x": np.arange(size),
            "y": np.linspace(0.0, 10.0, num=size),
        }
    )

    groupby_features = ColumnSelector(["id", "ts", "x", "y"]) >> ops.Groupby(
        groupby_cols=["id"],
        sort_cols=["ts"],
        aggs={
            "x": ["sum"],
            "y": ["first"],
        },
        name_sep="-",
    )
    workflow = nvt.Workflow(groupby_features)


  _verify_workflow_on_tritonserver(


        tmpdir, workflow, df, "groupby", output_model, cats=["id", "y-first"], conts=["x-sum"]
    )

tests/unit/test_triton_inference.py:379:

tests/unit/test_triton_inference.py:112: in _verify_workflow_on_tritonserver

response = client.infer(model_name, inputs, outputs=outputs)

/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:1322: in infer

raise_error_grpc(rpc_error)

rpc_error = <_InactiveRpcError of RPC that terminated with:

status = StatusCode.UNAVAILABLE

details = "Socket closed"

debug_err....0.0.1:8001","file":"src/core/lib/surface/call.cc","file_line":1069,"grpc_message":"Socket closed","grpc_status":14}"


def raise_error_grpc(rpc_error):


  raise get_error_grpc(rpc_error) from None


E       tritonclient.utils.InferenceServerException: [StatusCode.UNAVAILABLE] Socket closed
/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/init.py:62: InferenceServerException

----------------------------- Captured stderr call -----------------------------

I0815 14:14:56.462962 26696 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f3044000000' with size 268435456

I0815 14:14:56.463717 26696 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864

I0815 14:14:56.466128 26696 model_repository_manager.cc:1191] loading: groupby:1

I0815 14:14:56.573468 26696 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: groupby (GPU device 0)

I0815 14:14:58.798412 26696 model_repository_manager.cc:1345] successfully loaded 'groupby' version 1

I0815 14:14:58.798619 26696 server.cc:556]

+------------------+------+

| Repository Agent | Path |

+------------------+------+

+------------------+------+
I0815 14:14:58.798723 26696 server.cc:583]

+---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Backend | Path                                                  | Config                                                                                                                                                         |

+---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |

+---------+-------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0815 14:14:58.798770 26696 server.cc:626]

+---------+---------+--------+

| Model   | Version | Status |

+---------+---------+--------+

| groupby | 1       | READY  |

+---------+---------+--------+
I0815 14:14:58.863077 26696 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB

I0815 14:14:58.863960 26696 tritonserver.cc:2159]

+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Option                           | Value                                                                                                                                                                                        |

+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| server_id                        | triton                                                                                                                                                                                       |

| server_version                   | 2.23.0                                                                                                                                                                                       |

| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |

| model_repository_path[0]         | /tmp/pytest-of-jenkins/pytest-13/test_groupby_model_pytorch_0                                                                                                                                |

| model_control_mode               | MODE_NONE                                                                                                                                                                                    |

| strict_model_config              | 1                                                                                                                                                                                            |

| rate_limit                       | OFF                                                                                                                                                                                          |

| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |

| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |

| response_cache_byte_size         | 0                                                                                                                                                                                            |

| min_supported_compute_capability | 6.0                                                                                                                                                                                          |

| strict_readiness                 | 1                                                                                                                                                                                            |

| exit_timeout                     | 30                                                                                                                                                                                           |

+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0815 14:14:58.864892 26696 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001

I0815 14:14:58.865437 26696 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000

I0815 14:14:58.906788 26696 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

W0815 14:14:59.883731 26696 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0

Signal (11) received.

0# 0x000055B88900E699 in /opt/tritonserver/bin/tritonserver

1# 0x00007F308B67C090 in /usr/lib/x86_64-linux-gnu/libc.so.6

2# 0x00007F30811D68C2 in /opt/tritonserver/backends/python/libtriton_python.so

3# 0x00007F30811A2F10 in /opt/tritonserver/backends/python/libtriton_python.so

4# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/python/libtriton_python.so

5# 0x00007F308BF2C5CA in /opt/tritonserver/lib/libtritonserver.so

6# 0x00007F308BF2CCF7 in /opt/tritonserver/lib/libtritonserver.so

7# 0x00007F308BFECE11 in /opt/tritonserver/lib/libtritonserver.so

8# 0x00007F308BF26C47 in /opt/tritonserver/lib/libtritonserver.so

9# 0x00007F308BA6BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6

10# 0x00007F308CC7C609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0

11# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
______________________ test_seq_etl_tf_model[tensorflow] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-13/test_seq_etl_tf_model_tensorfl0')

output_model = 'tensorflow'
@pytest.mark.skipif(TRITON_SERVER_PATH is None, reason="Requires tritonserver on the path")
@pytest.mark.parametrize("output_model", ["tensorflow"])
def test_seq_etl_tf_model(tmpdir, output_model):
    size = 100
    max_length = 10
    df = make_df(
        {
            "id": np.random.choice([0, 1], size=size),
            "item_id": np.random.randint(1, 10, size),
            "ts": np.linspace(0.0, 10.0, num=size).astype(np.float32),
            "y": np.linspace(0.0, 10.0, num=size).astype(np.float32),
        }
    )

    groupby_features = ColumnSelector(["id", "item_id", "ts", "y"]) >> ops.Groupby(
        groupby_cols=["id"],
        sort_cols=["ts"],
        aggs={
            "item_id": ["list"],
            "y": ["list"],
        },
        name_sep="-",
    )
    feats_list = groupby_features["item_id-list", "y-list"]
    feats_trim = feats_list >> ops.ListSlice(0, max_length, pad=True)
    selected_features = groupby_features["id"] + feats_trim

    workflow = nvt.Workflow(selected_features)

    sparse_max = {"item_id-list": max_length, "y-list": max_length}


  _verify_workflow_on_tritonserver(


        tmpdir,
        workflow,
        df,
        "groupby",
        output_model,
        sparse_max,
        cats=["id", "item_id-list"],
        conts=["y-list"],
    )

tests/unit/test_triton_inference.py:415:

tests/unit/test_triton_inference.py:111: in _verify_workflow_on_tritonserver

with run_triton_server(tmpdir) as client:

/usr/lib/python3.8/contextlib.py:113: in enter

return next(self.gen)

modelpath = local('/tmp/pytest-of-jenkins/pytest-13/test_seq_etl_tf_model_tensorfl0')
@contextlib.contextmanager
def run_triton_server(modelpath):
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        "--backend-config=tensorflow,version=2",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                        raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

                    try:
                        ready = client.is_server_ready()
                    except tritonclient.utils.InferenceServerException:
                        ready = False

                    if ready:
                        yield client
                        return

                    time.sleep(1)


              raise RuntimeError("Timed out waiting for tritonserver to become ready")


E                   RuntimeError: Timed out waiting for tritonserver to become ready
tests/unit/test_triton_inference.py:62: RuntimeError

----------------------------- Captured stderr call -----------------------------

0815 14:15:00.865426 26705 pb_stub.cc:1006] Non-graceful termination detected.

I0815 14:15:01.120920 26916 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f6136000000' with size 268435456

I0815 14:15:01.121618 26916 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864

I0815 14:15:01.123866 26916 model_repository_manager.cc:1191] loading: groupby:1

I0815 14:15:01.231143 26916 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: groupby (GPU device 0)

=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

=========================== short test summary info ============================

FAILED tests/unit/test_triton_inference.py::test_groupby_model[pytorch] - tri...

FAILED tests/unit/test_triton_inference.py::test_seq_etl_tf_model[tensorflow]

===== 2 failed, 1427 passed, 2 skipped, 618 warnings in 803.76s (0:13:23) ======

Build step 'Execute shell' marked build as failure

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins9494905553368145523.sh

karlhigley · 2022-08-15T14:22:38Z

rerun tests

nvidia-merlin-bot · 2022-08-15T15:09:16Z

Click to view CI Results

GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4628/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins16619067910149829981.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ...FBuild was aborted

Aborted by �[8mha:////4I6AZwo/1Z8Fal8AhZTEatjIwqNwCcqT21311HdysuK+AAAAlx+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAzWEgZu/dLi1CL9xJTczDwAj6GcLcAAAAA=�[0madmin

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins15746326830681704448.sh

karlhigley · 2022-08-15T17:42:35Z

rerun tests

nvidia-merlin-bot · 2022-08-15T18:33:37Z

Click to view CI Results

GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
GitHub pull request #1609 of commit 9df466c566c9f80b1282693baecbd07c6a2d6bb6, no merge conflicts.
Running as SYSTEM
Setting status of 9df466c566c9f80b1282693baecbd07c6a2d6bb6 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4632/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 9df466c566c9f80b1282693baecbd07c6a2d6bb6^{commit} # timeout=10
Checking out Revision 9df466c566c9f80b1282693baecbd07c6a2d6bb6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10
First time build. Skipping changelog.
[nvtabular_tests] $ /bin/bash /tmp/jenkins3170339596225298332.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ......FFF....................Build was aborted

Aborted by �[8mha:////4I6AZwo/1Z8Fal8AhZTEatjIwqNwCcqT21311HdysuK+AAAAlx+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAzWEgZu/dLi1CL9xJTczDwAj6GcLcAAAAA=�[0madmin

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins13035443765677779958.sh

karlhigley · 2022-08-15T18:33:52Z

The tests for this keep hanging on the multi-GPU Jenkins machine. Not sure if it's an issue with this PR specifically, or NVTabular PRs in general...

nvidia-merlin-bot · 2022-08-15T18:46:07Z

Click to view CI Results

GitHub pull request #1609 of commit 35f7c158c6023ef878644de0b65dbdfa3d28b609, no merge conflicts.
Running as SYSTEM
Setting status of 35f7c158c6023ef878644de0b65dbdfa3d28b609 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4633/ and message: 'Build started for merge commit.'
Using context: Jenkins Unit Test Run
Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests
using credential nvidia-merlin-bot
Cloning the remote Git repository
Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git
 > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
 > git --version # timeout=10
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git
using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1609/*:refs/remotes/origin/pr/1609/* # timeout=10
 > git rev-parse 35f7c158c6023ef878644de0b65dbdfa3d28b609^{commit} # timeout=10
Checking out Revision 35f7c158c6023ef878644de0b65dbdfa3d28b609 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 35f7c158c6023ef878644de0b65dbdfa3d28b609 # timeout=10
Commit message: "Merge branch 'main' into refactor/decouple-dask"
 > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10
[nvtabular_tests] $ /bin/bash /tmp/jenkins5945207459896974934.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1430 items / 1 skipped
tests/unit/test_dask_nvt.py ............................................ [  3%]

........................................................................ [  8%]

....                                                                     [  8%]

tests/unit/test_notebooks.py ......                                      [  8%]

tests/unit/test_tf4rec.py .                                              [  8%]

tests/unit/test_tools.py ......................                          [ 10%]

tests/unit/test_triton_inference.py ................................     [ 12%]

tests/unit/framework_utils/test_tf_feature_columns.py .                  [ 12%]

tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%]

...................................................                      [ 18%]

tests/unit/framework_utils/test_torch_layers.py .                        [ 18%]

tests/unit/loader/test_dataloader_backend.py ......                      [ 18%]

tests/unit/loader/test_tf_dataloader.py ................................ [ 20%]

........................................s..                              [ 23%]

tests/unit/loader/test_torch_dataloader.py ............................. [ 25%]

......................................................                   [ 29%]

tests/unit/ops/test_categorify.py ...................................... [ 32%]

........................................................................ [ 37%]

...........................................                              [ 40%]

tests/unit/ops/test_column_similarity.py ........................        [ 42%]

tests/unit/ops/test_drop_low_cardinality.py ..                           [ 42%]

tests/unit/ops/test_fill.py ............................................ [ 45%]

........                                                                 [ 45%]

tests/unit/ops/test_groupyby.py .....................                    [ 47%]

tests/unit/ops/test_hash_bucket.py .........................             [ 49%]

tests/unit/ops/test_join.py ............................................ [ 52%]

........................................................................ [ 57%]

..................................                                       [ 59%]

tests/unit/ops/test_lambda.py ..........                                 [ 60%]

tests/unit/ops/test_normalize.py ....................................... [ 63%]

..                                                                       [ 63%]

tests/unit/ops/test_ops.py ............................................. [ 66%]

....................                                                     [ 67%]

tests/unit/ops/test_ops_schema.py ...................................... [ 70%]

........................................................................ [ 75%]

........................................................................ [ 80%]

........................................................................ [ 85%]

.......................................                                  [ 88%]

tests/unit/ops/test_reduce_dtype_size.py ..                              [ 88%]

tests/unit/ops/test_target_encode.py .....................               [ 89%]

tests/unit/workflow/test_cpu_workflow.py ......                          [ 90%]

tests/unit/workflow/test_workflow.py ................................... [ 92%]

..........................................................               [ 96%]

tests/unit/workflow/test_workflow_chaining.py ...                        [ 96%]

tests/unit/workflow/test_workflow_node.py ...........                    [ 97%]

tests/unit/workflow/test_workflow_ops.py ...                             [ 97%]

tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%]

...                                                                      [100%]
=============================== warnings summary ===============================

../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33

/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings

/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.

other = LooseVersion(other)
nvtabular/loader/init.py:19

/var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The nvtabular.loader module has moved to merlin.models.loader. Support for importing from nvtabular.loader is deprecated, and will be removed in a future version. Please update your imports to refer to merlin.models.loader.

warnings.warn(
tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1]

/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first

self.make_current()
tests/unit/test_dask_nvt.py: 1 warning

tests/unit/test_tf4rec.py: 1 warning

tests/unit/test_tools.py: 5 warnings

tests/unit/test_triton_inference.py: 8 warnings

tests/unit/loader/test_dataloader_backend.py: 6 warnings

tests/unit/loader/test_tf_dataloader.py: 66 warnings

tests/unit/loader/test_torch_dataloader.py: 67 warnings

tests/unit/ops/test_categorify.py: 69 warnings

tests/unit/ops/test_drop_low_cardinality.py: 2 warnings

tests/unit/ops/test_fill.py: 8 warnings

tests/unit/ops/test_hash_bucket.py: 4 warnings

tests/unit/ops/test_join.py: 88 warnings

tests/unit/ops/test_lambda.py: 1 warning

tests/unit/ops/test_normalize.py: 9 warnings

tests/unit/ops/test_ops.py: 11 warnings

tests/unit/ops/test_ops_schema.py: 17 warnings

tests/unit/workflow/test_workflow.py: 27 warnings

tests/unit/workflow/test_workflow_chaining.py: 1 warning

tests/unit/workflow/test_workflow_node.py: 1 warning

tests/unit/workflow/test_workflow_schemas.py: 1 warning

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.

warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files.

warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers

/usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters.

warnings.warn(
tests/unit/test_notebooks.py: 1 warning

tests/unit/test_tools.py: 17 warnings

tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 54 warnings

/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future

warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings

tests/unit/loader/test_torch_dataloader.py: 12 warnings

tests/unit/workflow/test_workflow.py: 9 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files.

warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet]

tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet]

tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True]

/usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings

tests/unit/workflow/test_workflow.py: 12 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files.

warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files.

warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_parquet_output[True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION]

tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None]

/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files.

warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

========== 1429 passed, 2 skipped, 618 warnings in 699.38s (0:11:39) ===========

Performing Post build task...

Match found for : : True

Logical operation result is TRUE

Running script  : #!/bin/bash

cd /var/jenkins_home/

CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"

[nvtabular_tests] $ /bin/bash /tmp/jenkins3684335047674136531.sh

karlhigley added 11 commits July 12, 2022 10:47

Extract MerlinPythonExecutor from nvt.Workflow

10c51e5

Extract MerlinDaskExecutor from nvt.Workflow

ead2f5b

Clean up MerlinDaskExecutor

a72c7be

Clean up MerlinPythonExecutor

5101c42

Move _clear_worker_cache to MerlinDaskExecutor

c3eddfb

Move ensure_optimize_dataframe_graph into MerlinDaskExecutor

32512d8

Clarify Nodes vs Operators in Workflow.fit()

f117717

Extract the Dask-specific part of Workflow.fit to MerlinDaskExecutor

bceae0b

Move Dask client into MerlinDaskExecutor

9fd498b

Inline _get_stat_op_nodes to improve clarity

28ccd11

Clean up MerlinDaskExecutor.fit()

4f3e941

karlhigley added the chore label Jul 12, 2022

karlhigley requested review from oliverholworthy, rjzamora, jperez999 and nv-alaiacano July 12, 2022 16:13

karlhigley self-assigned this Jul 12, 2022

Merge branch 'main' into refactor/decouple-dask

64914a5

Merge branch 'main' into refactor/decouple-dask

7ca7c0d

viswa-nvidia added this to the Merlin 22.08 milestone Jul 29, 2022

oliverholworthy reviewed Aug 2, 2022

View reviewed changes

Merge branch 'main' into refactor/decouple-dask

242fc36

oliverholworthy approved these changes Aug 2, 2022

View reviewed changes

Merge branch 'main' into refactor/decouple-dask

9df466c

Merge branch 'main' into refactor/decouple-dask

35f7c15

karlhigley merged commit aa1240e into NVIDIA-Merlin:main Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Python and Dask `Executor` classes from `Workflow` #1609

Extract Python and Dask `Executor` classes from `Workflow` #1609

karlhigley commented Jul 12, 2022

nvidia-merlin-bot commented Jul 12, 2022

github-actions bot commented Jul 12, 2022

nvidia-merlin-bot commented Jul 15, 2022

rjzamora commented Jul 15, 2022

karlhigley commented Jul 20, 2022

nvidia-merlin-bot commented Jul 20, 2022

viswa-nvidia commented Jul 29, 2022

oliverholworthy Aug 2, 2022

oliverholworthy Aug 2, 2022

karlhigley Aug 2, 2022

oliverholworthy commented Aug 2, 2022

karlhigley commented Aug 2, 2022

nvidia-merlin-bot commented Aug 2, 2022

karlhigley commented Aug 2, 2022

karlhigley commented Aug 2, 2022

nvidia-merlin-bot commented Aug 2, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

Extract Python and Dask Executor classes from Workflow #1609

Extract Python and Dask Executor classes from Workflow #1609

Conversation

karlhigley commented Jul 12, 2022

nvidia-merlin-bot commented Jul 12, 2022

github-actions bot commented Jul 12, 2022

Documentation preview

nvidia-merlin-bot commented Jul 15, 2022

rjzamora commented Jul 15, 2022

karlhigley commented Jul 20, 2022

nvidia-merlin-bot commented Jul 20, 2022

viswa-nvidia commented Jul 29, 2022

oliverholworthy Aug 2, 2022

Choose a reason for hiding this comment

oliverholworthy Aug 2, 2022

Choose a reason for hiding this comment

karlhigley Aug 2, 2022

Choose a reason for hiding this comment

oliverholworthy commented Aug 2, 2022

karlhigley commented Aug 2, 2022

nvidia-merlin-bot commented Aug 2, 2022

karlhigley commented Aug 2, 2022

karlhigley commented Aug 2, 2022

nvidia-merlin-bot commented Aug 2, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

karlhigley commented Aug 15, 2022

nvidia-merlin-bot commented Aug 15, 2022

Extract Python and Dask `Executor` classes from `Workflow` #1609

Extract Python and Dask `Executor` classes from `Workflow` #1609