Skip to content

Commit

Permalink
Remove support for Dask and Pyspark Dataframes (#1857)
Browse files Browse the repository at this point in the history
* initial cleanup

* mass cleanup

* getting close

* add pyarrow to docs reqs

* update release notes

* fix pr number

* update test names

* remove more unused code

* fix release notes
  • Loading branch information
thehomebrewnerd authored May 8, 2024
1 parent ae67c86 commit 762e08f
Show file tree
Hide file tree
Showing 62 changed files with 620 additions and 3,569 deletions.
6 changes: 2 additions & 4 deletions .github/workflows/build_docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,14 @@ on:
- main
env:
PYARROW_IGNORE_TIMEZONE: 1
JAVA_HOME: "/usr/lib/jvm/java-11-openjdk-amd64"
ALTERYX_OPEN_SRC_UPDATE_CHECKER: False
jobs:
build_docs:
name: 3.9 build docs
name: ${{ matrix.python_version }} Build Docs
runs-on: ubuntu-latest
strategy:
matrix:
python_version: ["3.9"]
python_version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- name: Checkout repository
uses: actions/checkout@v3
Expand All @@ -37,7 +36,6 @@ jobs:
run: |
sudo apt update
sudo apt install -y pandoc
sudo apt install -y openjdk-11-jre-headless
python -m pip install --upgrade pip
- name: Install woodwork with doc dependencies (not using cache)
if: steps.cache.outputs.cache-hit != 'true'
Expand Down
3 changes: 0 additions & 3 deletions .github/workflows/install_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,3 @@ jobs:
- name: Check package conflicts
run: |
python -m pip check
- name: Verify extra_requires commands
run: |
python -m pip install "unpacked_sdist/[dask,spark]"
19 changes: 0 additions & 19 deletions .github/workflows/latest_dependency_checker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,6 @@ jobs:
python -m pip install .[test]
make checkdeps OUTPUT_FILEPATH=woodwork/tests/requirement_files/latest_core_dependencies.txt
cat woodwork/tests/requirement_files/latest_core_dependencies.txt
- name: Update latest spark dependencies
run: |
python -m virtualenv venv_spark
source venv_spark/bin/activate
python -m pip install --upgrade pip
python -m pip install .[spark,test]
make checkdeps OUTPUT_FILEPATH=woodwork/tests/requirement_files/latest_spark_dependencies.txt
cat woodwork/tests/requirement_files/latest_spark_dependencies.txt
- name: Update latest dask dependencies
run: |
python -m virtualenv venv_dask
source venv_dask/bin/activate
python -m pip install --upgrade pip
python -m pip install .[test,dask]
make checkdeps OUTPUT_FILEPATH=woodwork/tests/requirement_files/latest_dask_dependencies.txt
cat woodwork/tests/requirement_files/latest_dask_dependencies.txt
python -m pip install .[dev]
make lint-fix
pre-commit autoupdate
- name: Create Pull Request
uses: peter-evans/create-pull-request@v3
with:
Expand Down
16 changes: 0 additions & 16 deletions .github/workflows/minimum_dependency_checker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,22 +29,6 @@ jobs:
paths: 'pyproject.toml'
options: 'dependencies'
output_filepath: 'woodwork/tests/requirement_files/minimum_core_requirements.txt'
- name: Run min dep generator - core + spark reqs
id: min_dep_gen_spark
uses: alteryx/minimum-dependency-generator@v3
with:
paths: 'pyproject.toml'
options: 'dependencies'
extras_require: 'spark'
output_filepath: 'woodwork/tests/requirement_files/minimum_spark_requirements.txt'
- name: Run min dep generator - core + dask
id: min_dep_gen_dask
uses: alteryx/minimum-dependency-generator@v3
with:
paths: 'pyproject.toml'
options: 'dependencies'
extras_require: 'dask'
output_filepath: 'woodwork/tests/requirement_files/minimum_dask_requirements.txt'
- name: Create Pull Request
uses: peter-evans/create-pull-request@v3
with:
Expand Down
61 changes: 8 additions & 53 deletions .github/workflows/tests_with_latest_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,16 @@ on:
workflow_dispatch:
env:
PYARROW_IGNORE_TIMEZONE: 1
JAVA_HOME: "/usr/lib/jvm/java-11-openjdk-amd64"
ALTERYX_OPEN_SRC_UPDATE_CHECKER: False
jobs:
unit_latest_tests:
name: ${{ matrix.python_version }} ${{ matrix.directories }} unit tests
name: ${{ matrix.python_version }} ${{ matrix.directories }} Unit Tests
runs-on: ubuntu-latest
strategy:
fail-fast: true
matrix:
python_version: ["3.9", "3.10", "3.11", "3.12"]
directories: ["Core", "Dask/Spark - All Other Tests", "Dask/Spark - Testing Table Accessor", "Dask/Spark - Testing to Disk with LatLong", "Dask/Spark - All other Serialization"]
directories: ["Core"]
steps:
- name: Set up python ${{ matrix.python_version }}
uses: actions/setup-python@v4
Expand All @@ -38,61 +37,17 @@ jobs:
- name: Install woodwork with test requirements
run: |
python -m pip install -e unpacked_sdist/[test]
- if: ${{ startsWith(matrix.directories, 'Dask/Spark') }}
name: Install Dask and Spark Requirements
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Core' }}
name: Run Unit Tests with core requirements with code coverage
run: |
sudo apt update
sudo apt install -y openjdk-11-jre-headless
python -m pip install unpacked_sdist/[spark]
python -m pip install unpacked_sdist/[dask]
cd unpacked_sdist
coverage erase
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Dask/Spark - Testing to Disk with LatLong' }}
name: Run testing to Disk with LatLong Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Dask/Spark - All other Serialization' }}
name: Run all other Serialization Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Dask/Spark - Testing Table Accessor' }}
name: Run Table Accessor Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_table_accessor.py -n 2 --durations 0
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Dask/Spark - All Other Tests' }}
name: Run all other Unit Tests (no code coverage)
run: |
cd unpacked_sdist
pytest woodwork/ -n 2 --ignore=woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_table_accessor.py --durations 0
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Dask/Spark - Testing to Disk with LatLong' }}
name: Run Testing to Disk with LatLong Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Dask/Spark - All other Serialization' }}
name: Run all other Serialization Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_serialization.py::test_to_disk_with_latlong -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Dask/Spark - Testing Table Accessor' }}
name: Run Table Accessor Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/tests/accessor/test_table_accessor.py -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Dask/Spark - All Other Tests' }}
name: Run all other Unit Tests with code coverage
run: |
cd unpacked_sdist
pytest woodwork/ -n 2 --ignore=woodwork/tests/accessor/test_serialization.py --ignore=woodwork/tests/accessor/test_table_accessor.py --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.directories == 'Core' }}
name: Run Unit Tests with core requirements only (no code coverage)
pytest woodwork/ -n 2 --durations 0 --cov=woodwork --cov-config=../pyproject.toml --cov-report=xml:../coverage.xml
- if: ${{ matrix.python_version != 3.9 && matrix.directories == 'Core' }}
name: Run Unit Tests with core requirements without code coverage
run: |
cd unpacked_sdist
pytest woodwork/ -n 2
- if: ${{ matrix.python_version == 3.9 && matrix.directories != 'Core' }}
- if: ${{ matrix.python_version == 3.9 && matrix.directories == 'Core' }}
name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
Expand Down
12 changes: 2 additions & 10 deletions .github/workflows/tests_with_minimum_deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ on:
branches:
- main
jobs:
py38_unit_tests_minimum_dependencies:
py39_unit_tests_minimum_dependencies:
name: Tests - 3.9 Minimum Dependencies
runs-on: ubuntu-latest
strategy:
matrix:
libraries: ["core", "dask", "spark", "min_min"]
libraries: ["core"]
steps:
- name: Checkout repository
uses: actions/checkout@v3
Expand All @@ -26,14 +26,6 @@ jobs:
run: |
python -m pip install -e . --no-dependencies
python -m pip install -r woodwork/tests/requirement_files/minimum_test_requirements.txt
- if: ${{ matrix.libraries == 'spark' }}
name: Install woodwork - minimum spark, core requirements
run: |
python -m pip install -r woodwork/tests/requirement_files/minimum_spark_requirements.txt
- if: ${{ matrix.libraries == 'dask' }}
name: Install woodwork - minimum dask, core requirements
run: |
python -m pip install -r woodwork/tests/requirement_files/minimum_dask_requirements.txt
- if: ${{ matrix.libraries == 'core' }}
name: Install woodwork - minimum core requirements
run: |
Expand Down
5 changes: 0 additions & 5 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,6 @@ build:
os: "ubuntu-22.04"
tools:
python: "3.9"
apt_packages:
- openjdk-11-jre-headless
jobs:
post_build:
- export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"

python:
install:
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ installdeps-test: upgradepip

.PHONY: checkdeps
checkdeps:
$(eval allow_list='numpy|pandas|scikit|click|pyarrow|distributed|dask|pyspark')
$(eval allow_list='numpy|pandas|scikit|click|pyarrow')
pip freeze | grep -v "woodwork.git" | grep -E $(allow_list) > $(OUTPUT_FILEPATH)

.PHONY: upgradepip
Expand Down
30 changes: 4 additions & 26 deletions contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,40 +24,18 @@ Whether you are a novice or experienced software developer, all contributions an
make installdeps-dev
git checkout -b issue####-branch_name
```
* You will need to install Spark, Scala, and Pandoc to run all unit tests & build docs:

> If you do not install Spark/Scala, you can still run the unit tests (the Spark tests will be skipped).
* You will need to install Pandoc to run all unit tests & build docs:

> Pandoc is only needed to build the documentation locally.

**macOS (Intel)** (use [Homebrew](https://brew.sh/)):
```console
brew tap AdoptOpenJDK/openjdk
brew install --cask adoptopenjdk11
brew install scala apache-spark pandoc
echo 'export JAVA_HOME=$(/usr/libexec/java_home)' >> ~/.zshrc
echo 'export PATH="/usr/local/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
```
**macOS (M1)** (use [Homebrew](https://brew.sh/)):
**macOS** (use [Homebrew](https://brew.sh/)):
```console
brew install openjdk@11 scala apache-spark pandoc
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zshrc
echo 'export CPPFLAGS="-I/opt/homebrew/opt/openjdk@11/include:$CPPFLAGS"' >> ~/.zprofile
sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
brew install pandoc
```

**Ubuntu**:
```console
sudo apt install openjdk-11-jre openjdk-11-jdk scala pandoc -y
echo "export SPARK_HOME=/opt/spark" >> ~/.profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile
```

**Amazon Linux**:
```console
sudo amazon-linux-extras install java-openjdk11 scala -y
amazon-linux-extras enable java-openjdk11
sudo apt install pandoc -y
```

#### 2. Implement your Pull Request
Expand Down
4 changes: 1 addition & 3 deletions docs/source/guides/custom_types_and_type_inference.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,6 @@
" \"\"\"Represents Logical Types that contain 12-digit UPC Codes.\"\"\"\n",
"\n",
" primary_dtype = \"category\"\n",
" pyspark_dtype = \"string\"\n",
" standard_tags = {\"category\", \"upc_code\"}"
]
},
Expand All @@ -64,7 +63,6 @@
"When defining the `UPCCode` LogicalType class, three class attributes were set. All three of these attributes are optional, and will default to the values defined on the `LogicalType` class if they are not set when defining the new type.\n",
"\n",
"- `primary_dtype`: This value specifies how the data will be stored. If the column of the dataframe is not already of this type, Woodwork will convert the data to this dtype. This should be specified as a string that represents a valid pandas dtype. If not specified, this will default to `'string'`.\n",
"- `pyspark_dtype`: This value specifies the dtype to use if pyspark does not support the dtype specified by `primary_dtype`. In our example, we set this to `'string'` since Spark does not currently support the `'category'` dtype.\n",
"- `standard_tags`: This is a set of semantic tags to apply to any column that is set with the specified LogicalType. If not specified, `standard_tags` will default to an empty set.\n",
"- docstring: Adding a docstring for the class is optional, but if specified, this text will be used for adding a description of the type in the list of available types returned by `ww.list_logical_types()`."
]
Expand Down Expand Up @@ -214,7 +212,7 @@
" try:\n",
" series.astype(\"int\")\n",
" return True\n",
" except:\n",
" except Exception:\n",
" return False\n",
" return False"
]
Expand Down
1 change: 0 additions & 1 deletion docs/source/guides/guides_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,5 @@ The guides below provide more detail on the functionality of Woodwork.
working_with_types_and_tags
setting_config_options
statistical_insights
using_woodwork_with_dask_and_spark
custom_types_and_type_inference
saving_and_loading_dataframes
Loading

0 comments on commit 762e08f

Please sign in to comment.