Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier #6654

Merged
merged 10 commits into from
Sep 24, 2024

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Sep 15, 2024

As described in #6653, lightgbm is currently failing scikit-learn compatibility tests against the latest scikit-learn nightlies (1.6.0dev0).

That's being worked on in #6651.

This PR proposes temporarily dropping scikit-learn from the list of projects whose nightlies lightgbm is tested against, to unblock CI here.

Update (Sep 22)

While working on this, CI for the Python package started failing in another way:

libgomp.so.1: cannot allocate memory in static TLS block

This proposes a permanent fix for that as well... trying to dlopen() load_lightgbm as early as possible when running import lightgbm.

@jameslamb jameslamb marked this pull request as ready for review September 15, 2024 06:14
@jameslamb jameslamb changed the title WIP: [ci] [python-package] temporarily stop testing against scikit-learn nightlies [ci] [python-package] temporarily stop testing against scikit-learn nightlies Sep 15, 2024
@jameslamb
Copy link
Collaborator Author

😫 QEMU aarch64 job is failing with this error again:

E ImportError: /root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/../../../../libgomp.so.1: cannot allocate memory in static TLS block

(build link)

For lots of prior context on this: #6509

@jameslamb
Copy link
Collaborator Author

Back in #6509 (comment), I found that much of the static TLS was being used by libraries for cloud providers (AWS / Azure / GCP).

I just pushed 0229097 switching the CI environments here from conda-forge's pyarrow to its pyarrow-core.

That pyarrow-core package is slimmer and doesn't pull in those cloud provider libraries:

@jameslamb
Copy link
Collaborator Author

Switching to pyarrow-core did not resolve this. I still see the same static TLS issue. (build link)

I'm able to reproduce it locally (in Docker on my M2 mac, which importantly is also arm64):

environment setup in docker similar to CI (click me)
docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it lightgbm/vsts-agent:manylinux2014_aarch64 \
    bash

curl \
    -sL \
    -o miniforge.sh \
    https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh

sh miniforge.sh -b -p "${HOME}/miniconda3"
export PATH="${HOME}/miniconda3/bin:${PATH}"

conda create \
    -y \
    -c conda-forge \
    -n test-env \
    --file ./.ci/conda-envs/ci-core.txt \
    "python=3.12"

source activate test-env

This fails with the same error seen in CI:

source activate test-env

sh build-python.sh bdist_wheel
pip install --no-deps \
    ./dist/lightgbm-4.5.0.99-py3-none-linux_aarch64.whl

python -c "import lightgbm; print(lightgbm.__version__)"
# OSError: /root/miniconda3/envs/test-env/bin/../lib/libgomp.so.1: cannot allocate memory in static TLS block

Uninstalling scikit-learn resolves the issue.

conda uninstall --yes scikit-learn
python -c "import lightgbm; print(lightgbm.__version__)"
# 4.5.0.99
output of 'conda info' (click me)
     active environment : test-env
    active env location : /root/miniconda3/envs/test-env
            shell level : 1
       user config file : /root/.condarc
 populated config files : /root/miniconda3/.condarc
          conda version : 24.7.1
    conda-build version : not installed
         python version : 3.12.5.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=m1
                          __conda=24.7.1=0
                          __glibc=2.17=0
                          __linux=6.6.16=0
                          __unix=0=0
       base environment : /root/miniconda3  (writable)
      conda av data dir : /root/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-aarch64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /root/miniconda3/pkgs
                          /root/.conda/pkgs
       envs directories : /root/miniconda3/envs
                          /root/.conda/envs
               platform : linux-aarch64
             user-agent : conda/24.7.1 requests/2.32.3 CPython/3.12.5 Linux/6.6.16-linuxkit centos/7.9.2009 glibc/2.17 solver/libmamba conda-libmamba-solver/24.7.0 libmambapy/1.5.9
                UID:GID : 0:0
             netrc file : None
           offline mode : False
output of 'conda env export' (click me)
name: test-env
channels:
  - conda-forge
dependencies:
  - _openmp_mutex=4.5=2_gnu
  - atk-1.0=2.38.0=hedc4a1f_2
  - aws-c-auth=0.7.29=h240fa19_1
  - aws-c-cal=0.7.4=h51bfcdd_1
  - aws-c-common=0.9.28=h86ecc28_0
  - aws-c-compression=0.2.19=h57e602e_1
  - aws-c-event-stream=0.4.3=h947aafb_1
  - aws-c-http=0.8.8=hc7031c7_2
  - aws-c-io=0.14.18=h42e3277_9
  - aws-c-mqtt=0.10.4=h0d18003_19
  - aws-c-s3=0.6.5=hc60c6a8_2
  - aws-c-sdkutils=0.1.19=h57e602e_3
  - aws-checksums=0.1.18=h57e602e_11
  - aws-crt-cpp=0.28.2=h5862e02_4
  - aws-sdk-cpp=1.11.379=he9ffd98_9
  - azure-core-cpp=1.13.0=h60f91e5_0
  - azure-identity-cpp=1.8.0=hf0f394c_2
  - azure-storage-blobs-cpp=12.12.0=h17ca4bd_0
  - azure-storage-common-cpp=12.7.0=h68dbd84_1
  - azure-storage-files-datalake-cpp=12.11.0=h36e5eb4_1
  - bokeh=3.5.2=pyhd8ed1ab_0
  - brotli=1.1.0=h86ecc28_2
  - brotli-bin=1.1.0=h86ecc28_2
  - brotli-python=1.1.0=py312h6f74592_2
  - bzip2=1.0.8=h68df207_7
  - c-ares=1.32.3=h68df207_0
  - ca-certificates=2024.8.30=hcefe29a_0
  - cairo=1.18.0=hdb1a16f_3
  - certifi=2024.8.30=pyhd8ed1ab_0
  - cffi=1.17.1=py312hac81daf_0
  - click=8.1.7=unix_pyh707e725_0
  - cloudpickle=3.0.0=pyhd8ed1ab_0
  - colorama=0.4.6=pyhd8ed1ab_0
  - contourpy=1.3.0=py312h451a7dd_1
  - cycler=0.12.1=pyhd8ed1ab_0
  - cytoolz=0.12.3=py312h9ef2f89_0
  - dask=2024.9.0=pyhd8ed1ab_0
  - dask-core=2024.9.0=pyhd8ed1ab_0
  - dask-expr=1.1.14=pyhd8ed1ab_0
  - distributed=2024.9.0=pyhd8ed1ab_0
  - exceptiongroup=1.2.2=pyhd8ed1ab_0
  - expat=2.6.3=h5ad3122_0
  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
  - font-ttf-inconsolata=3.000=h77eed37_0
  - font-ttf-source-code-pro=2.038=h77eed37_0
  - font-ttf-ubuntu=0.83=h77eed37_2
  - fontconfig=2.14.2=ha9a116f_0
  - fonts-conda-ecosystem=1=0
  - fonts-conda-forge=1=0
  - fonttools=4.53.1=py312hb2c0f52_1
  - freetype=2.12.1=hf0a5ef3_2
  - fribidi=1.0.10=hb9de7d4_0
  - fsspec=2024.9.0=pyhff2d567_0
  - gdk-pixbuf=2.42.12=ha61d561_0
  - gflags=2.2.2=h54f1f3f_1004
  - glog=0.7.1=h468a4a4_0
  - graphite2=1.3.13=h2f0025b_1003
  - graphviz=12.0.0=h2a7c30b_0
  - gtk2=2.24.33=h4cb56f0_5
  - gts=0.7.6=he293c15_4
  - h2=4.1.0=pyhd8ed1ab_0
  - harfbuzz=9.0.0=hbf49d6b_1
  - hpack=4.0.0=pyh9f0ad1d_0
  - hyperframe=6.0.1=pyhd8ed1ab_0
  - icu=75.1=hf9b3779_0
  - importlib-metadata=8.5.0=pyha770c72_0
  - importlib_metadata=8.5.0=hd8ed1ab_0
  - iniconfig=2.0.0=pyhd8ed1ab_0
  - jinja2=3.1.4=pyhd8ed1ab_0
  - joblib=1.4.2=pyhd8ed1ab_0
  - keyutils=1.6.1=h4e544f5_0
  - kiwisolver=1.4.7=py312h88dc405_0
  - krb5=1.21.3=h50a48e9_0
  - lcms2=2.16=h922389a_0
  - ld_impl_linux-aarch64=2.40=h9fc2d93_7
  - lerc=4.0.0=h4de3ea5_0
  - libabseil=20240116.2=cxx17_h0a1ffab_1
  - libarrow=17.0.0=h8bb5f56_14_cpu
  - libarrow-acero=17.0.0=h5ad3122_14_cpu
  - libarrow-dataset=17.0.0=h5ad3122_14_cpu
  - libarrow-substrait=17.0.0=h08b7278_14_cpu
  - libblas=3.9.0=23_linuxaarch64_openblas
  - libbrotlicommon=1.1.0=h86ecc28_2
  - libbrotlidec=1.1.0=h86ecc28_2
  - libbrotlienc=1.1.0=h86ecc28_2
  - libcblas=3.9.0=23_linuxaarch64_openblas
  - libcrc32c=1.1.2=h01db608_0
  - libcurl=8.10.0=h3ec0cbf_0
  - libdeflate=1.21=h68df207_0
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=h31becfc_2
  - libevent=2.1.12=h4ba1bb4_1
  - libexpat=2.6.3=h5ad3122_0
  - libffi=3.4.2=h3557bc0_5
  - libgcc=14.1.0=he277a41_1
  - libgcc-ng=14.1.0=he9431aa_1
  - libgd=2.3.3=h6818b27_10
  - libgfortran=14.1.0=he9431aa_1
  - libgfortran-ng=14.1.0=he9431aa_1
  - libgfortran5=14.1.0=h9420597_1
  - libglib=2.80.3=haee52c6_2
  - libgomp=14.1.0=he277a41_1
  - libgoogle-cloud=2.29.0=hbb89541_0
  - libgoogle-cloud-storage=2.29.0=hb9b2b65_0
  - libgrpc=1.62.2=h98a9317_0
  - libiconv=1.17=h31becfc_2
  - libjpeg-turbo=3.0.0=h31becfc_1
  - liblapack=3.9.0=23_linuxaarch64_openblas
  - libnghttp2=1.58.0=hb0e430d_1
  - libnsl=2.0.1=h31becfc_0
  - libopenblas=0.3.27=pthreads_h076ed1e_1
  - libparquet=17.0.0=h501616e_14_cpu
  - libpng=1.6.44=hc4a20ef_0
  - libprotobuf=4.25.3=h648ac29_0
  - libre2-11=2023.09.01=h9d008c2_2
  - librsvg=2.58.4=h00090f3_0
  - libsqlite=3.46.1=hc4a20ef_0
  - libssh2=1.11.0=h492db2e_0
  - libstdcxx=14.1.0=h3f4de04_1
  - libstdcxx-ng=14.1.0=hf1166c9_1
  - libthrift=0.20.0=h154c74f_1
  - libtiff=4.6.0=h395e79b_4
  - libutf8proc=2.8.0=h4e544f5_0
  - libuuid=2.38.1=hb4cce97_0
  - libwebp-base=1.4.0=h31becfc_0
  - libxcb=1.16=h57736b2_1
  - libxcrypt=4.4.36=h31becfc_1
  - libxml2=2.12.7=h00a45b3_4
  - libzlib=1.3.1=h68df207_1
  - locket=1.0.0=pyhd8ed1ab_0
  - lz4=4.3.3=py312h8c5c2bf_1
  - lz4-c=1.9.4=hd600fc2_0
  - markupsafe=2.1.5=py312h52516f5_1
  - matplotlib-base=3.9.2=py312h965bf68_1
  - msgpack-python=1.1.0=py312h451a7dd_0
  - munkres=1.1.4=pyh9f0ad1d_0
  - ncurses=6.5=hcccb83c_1
  - numpy=2.1.1=py312h2eb110b_0
  - openjpeg=2.5.2=h0d9d63b_0
  - openssl=3.3.2=h86ecc28_0
  - orc=2.0.2=h383807c_0
  - packaging=24.1=pyhd8ed1ab_0
  - pandas=2.2.2=py312h14eacfc_1
  - pango=1.54.0=h7579590_1
  - partd=1.4.2=pyhd8ed1ab_0
  - pcre2=10.44=h070dd5b_2
  - pillow=10.4.0=py312h18c71c7_1
  - pip=24.2=pyh8b19718_1
  - pixman=0.43.4=h2f0025b_0
  - pluggy=1.5.0=pyhd8ed1ab_0
  - psutil=6.0.0=py312hb2c0f52_1
  - pthread-stubs=0.4=hb9de7d4_1001
  - pyarrow=17.0.0=py312h55cb1a1_1
  - pyarrow-core=17.0.0=py312h66f7834_1_cpu
  - pyarrow-hotfix=0.6=pyhd8ed1ab_0
  - pycparser=2.22=pyhd8ed1ab_0
  - pyparsing=3.1.4=pyhd8ed1ab_0
  - pysocks=1.7.1=pyha2e5f31_6
  - pytest=8.3.3=pyhd8ed1ab_0
  - python=3.12.5=hb188aa9_0_cpython
  - python-dateutil=2.9.0=pyhd8ed1ab_0
  - python-graphviz=0.20.3=pyh717bed2_0
  - python-tzdata=2024.1=pyhd8ed1ab_0
  - python_abi=3.12=5_cp312
  - pytz=2024.2=pyhd8ed1ab_0
  - pyyaml=6.0.2=py312hb2c0f52_1
  - qhull=2020.2=h70be974_5
  - re2=2023.09.01=h9caee61_2
  - readline=8.2=h8fc344f_1
  - s2n=1.5.2=hd08dc88_0
  - scipy=1.14.1=py312hca5e164_0
  - setuptools=73.0.1=pyhd8ed1ab_0
  - six=1.16.0=pyh6c4a22f_0
  - snappy=1.2.1=h1088aeb_0
  - sortedcontainers=2.4.0=pyhd8ed1ab_0
  - tblib=3.0.0=pyhd8ed1ab_0
  - tk=8.6.13=h194ca79_0
  - tomli=2.0.1=pyhd8ed1ab_0
  - toolz=0.12.1=pyhd8ed1ab_0
  - tornado=6.4.1=py312h52516f5_1
  - tzdata=2024a=h8827d51_1
  - urllib3=2.2.2=pyhd8ed1ab_1
  - wheel=0.44.0=pyhd8ed1ab_0
  - xorg-kbproto=1.0.7=h3557bc0_1002
  - xorg-libice=1.1.1=h7935292_0
  - xorg-libsm=1.2.4=h5a01bc2_0
  - xorg-libx11=1.8.9=h08be655_1
  - xorg-libxau=1.0.11=h31becfc_0
  - xorg-libxdmcp=1.1.3=h3557bc0_0
  - xorg-libxext=1.3.4=h2a766a3_2
  - xorg-libxrender=0.9.11=h7935292_0
  - xorg-renderproto=0.11.1=h3557bc0_1002
  - xorg-xextproto=7.3.0=h2a766a3_1003
  - xorg-xproto=7.0.31=h3557bc0_1007
  - xyzservices=2024.9.0=pyhd8ed1ab_0
  - xz=5.2.6=h9cdd2b7_0
  - yaml=0.2.5=hf897c2e_2
  - zict=3.0.0=pyhd8ed1ab_0
  - zipp=3.20.2=pyhd8ed1ab_0
  - zlib=1.3.1=h68df207_1
  - zstandard=0.23.0=py312hb698573_1
  - zstd=1.5.6=h02f22dd_0
  - pip:
      - build==1.2.2
      - lightgbm==4.5.0.99
      - pyproject-hooks==1.1.0
prefix: /root/miniconda3/envs/test-env

That doesn't mean anything scikit-learn is doing anything wrong... it might just be the only other OpenMP-using library loaded at runtime by lightgbm.

I suspected that maybe the issue is from mixing pip-installed lightgbm (compiled with the system toolchain) with a bunch of conda-installed dependencies (compiled with conda's compilers).

I'll try to investigate more tomorrow 😫


try:
# this issue seems specific to libgomp, so no need to attempt e.g. libomp or libiomp
_ = ctypes.CDLL("libgomp.so.1", ctypes.RTLD_GLOBAL)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did seem to fix the issues observed in CI!

But it would probably be simpler to just load lib_lightgbm.{dylib,dll,so} earlier instead. I will test that and push some changes in a few hours. Let's not merge this yet, please.

@jameslamb jameslamb changed the title [ci] [python-package] temporarily stop testing against scikit-learn nightlies WIP: [ci] [python-package] temporarily stop testing against scikit-learn nightlies Sep 21, 2024
@jameslamb jameslamb changed the title WIP: [ci] [python-package] temporarily stop testing against scikit-learn nightlies WIP: [ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier Sep 22, 2024
@jameslamb jameslamb changed the title WIP: [ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier [ci] [python-package] temporarily stop testing against scikit-learn nightlies, load lib_lightgbm earlier Sep 22, 2024
@jameslamb
Copy link
Collaborator Author

@StrikerRUS I've changed this significantly since your first review, in response to this issue: #6654 (comment)

Could you please review again whenever you have time?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Thanks a lot for the fix to another one problem from our old friend OpenMP!

@jameslamb
Copy link
Collaborator Author

Haha yep, thanks very much! I appreciate you reviewing this, I know the evidence supporting this fix is very dense.

At least this one feels like it will make the experience permanently better and isn't just a workaround... and I think it's getting better with every release as we learn more.

@jameslamb jameslamb merged commit e057ae0 into master Sep 24, 2024
44 checks passed
@jameslamb jameslamb deleted the ci/skip-sklearn-nightlies branch September 24, 2024 21:51
@StrikerRUS
Copy link
Collaborator

@jameslamb
Could you please remove ci-skip-sklearn-nightlies version from RTD?

@jameslamb
Copy link
Collaborator Author

ugh, sorry I keep forgetting to do that! Thank you for reminding me. I just removed that version from RTD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants