Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] aarch64 python-package job failing: "cannot allocate memory in static TLS block" #6509

Closed
jameslamb opened this issue Jun 28, 2024 · 15 comments · Fixed by #6527
Closed

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Jun 28, 2024

Description

For the last few days, I've observed the aarch64 CI job (which we run on an x86_64 box, using QEMU for emulation), failing with errors like the following during test collection:

___________ ERROR collecting tests/python_package_test/test_basic.py ___________
ImportError while importing test module '/LightGBM/tests/python_package_test/test_basic.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/root/miniforge/envs/test-env/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/python_package_test/test_basic.py:12: in <module>
    from sklearn.datasets import dump_svmlight_file, load_svmlight_file, make_blobs
/root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/__init__.py:97: in <module>
    from .utils._show_versions import show_versions
/root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/_show_versions.py:15: in <module>
    from ._openmp_helpers import _openmp_parallelism_enabled
E   ImportError: /root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/../../../../libgomp.so.1: cannot allocate memory in static TLS block

Reproducible example

This is happening across several different PRs, with changesets that are very unlikely to be causing this, suggesting it's some other change in the environment. For example:

Environment info

N/A

Additional Comments

"TLS" in this error refers to "thread-local storage".

There is lots of prior discussion on similar issues:

All of those are about using libgomp on aarch64.

From https://bugzilla.redhat.com/show_bug.cgi?id=1722181:

The GNU TLS2 model which I'm afraid aarch64 uses unfortunately eats from the same TLS preallocated pool as libraries that require static TLS like libgomp, where it is performance critical to have it as static TLS.

On opencv/opencv#14884, there's some discussion about this specifically being caused by bundled libgomp in multiple Python packages, and there are suggestions that importing those libraries earlier (and therefore loading their libgomp earlier) can resolve this.

These also have some helpful information:

@jameslamb
Copy link
Collaborator Author

jameslamb commented Jun 28, 2024

I recall now that allocation of static TLS was one of the reasons we switched x86_64 wheels to manylinux_2_28 (#5584 (comment))... there were some fixes in newer GLIBC versions related to it.

Similarly, a comment from August 2020 on https://bugzilla.redhat.com/show_bug.cgi?id=1722181 says:

the fix in glibc is allowing the installer to run correctly without the LD_PRELOAD workaround

So it's possible that we'd be less likely to see this if we moved aarch64 builds to a newer GLIBC.

They're currently using manylinux2014: https://github.com/guolinke/lightgbm-ci-docker/blob/a0df140729a4006b4ff87b880e7aa53934ecc0e7/images/manylinux2014_aarch64/Dockerfile#L1

@jameslamb jameslamb pinned this issue Jul 2, 2024
@jameslamb
Copy link
Collaborator Author

I just tried a rebuild, and this is still happening: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=16481&view=logs&j=7a417b3a-6502-5a0d-1db8-7ef6155c93de&t=380f8b13-0b2d-5f03-5de0-8353018c7351

Looking more closely at the logs, I noticed that it is happening whenever a libgomp.so is loaded... sometimes from loading scikit-learn's bundled one

E   ImportError: /root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/../../../../libgomp.so.1: cannot allocate memory in static TLS block

sometimes conda's

E   OSError: /root/miniforge/envs/test-env/bin/../lib/libgomp.so.1: cannot allocate memory in static TLS block

So I suspect part of the problem is "if you use scikit-learn, it's likely 2 libgomp.so.1s are going to be loaded"... maybe because of some change to the order of imports somewhere down the call stack. I'll keep testing changes.

@jmoralez
Copy link
Collaborator

jmoralez commented Jul 3, 2024

Just spitballing here but since we're getting conda's libgomp, we could also try installing conda's c++ compiler in that job and that'd link against that libgomp, which is the one being loaded by numpy and scipy, and maybe that'd solve the issue. WDYT?

@jameslamb
Copy link
Collaborator Author

It looks like cuml is running into similar issues (also only in its aarch64 builds): rapidsai/cuml#5949

It has lots of the same dependencies as lightgbm (like scikit-learn, numpy, scipy, dask, and distributed from conda-forge).


we could also try installing conda's c++ compiler in that job and that'd link against that libgomp, which is the one being loaded by numpy and scipy

Interesting idea! Thanks for catching that conda-forge's libgomp is getting pulled into the environment... that might point to a source of the problem.

I'm nervous about using conda's compilers to build a wheel that's intended to be distributed for use outside of conda.

conda's toolchain is configured for producing conda packages, and I know it does things like rewriting paths in binaries so that conda's libraries will always be found. (not sure how much of that happens in compilers/linkers vs. in conda-build though).

BUT... that's testable, and it's worth testing. I can try that to see if it'd help.

Another thing I'm going to test... I'm going to try reducing the tests to simply import lightgbm and see if that's sufficient to trigger this problem. If it is, that'd narrow it down signficantly.

@jmoralez
Copy link
Collaborator

jmoralez commented Jul 3, 2024

Oh sorry, I didn't realize that job was building wheels. We could also go the other way and drop conda from that job, I've used uv a lot lately and it's great.

@jameslamb
Copy link
Collaborator Author

drop conda from that job

I like that idea too! Although it could be pretty involved... I suspect conda is doing a LOT of useful environment-resolution work for us that might be difficult to replicate.

As a start... would you support commenting out this CI job for now so that we can keep making progress in the repo while we investigate this? With the understanding that this would have to be fixed and working again before we do the next release.

I'm really happy with the recent pace of non-maintainer contributions we've been getting, and I'd like to get those peoples' work merged so we build momentum with them:

@mayer79 and @nicklamiller in particular have been more patient with us than they should have to be, waiting on various CI issues 😅

@jmoralez
Copy link
Collaborator

jmoralez commented Jul 3, 2024

would you support commenting out this CI job for now so that we can keep making progress in the repo while we investigate this?

Sure! That shouldn't block our CI since it's just a packaging problem.

@jameslamb
Copy link
Collaborator Author

put up #6517

Thankfully we now have that one macOS job still providing coverage of compiling LightGBM on aarch64.

@msarahan
Copy link

msarahan commented Jul 3, 2024

A couple of thoughts here:

  1. Are you installing scikit-learn from PyPI (a wheel?) If so, is it an option to install that from conda? The conda package should not bundle libgomp, and so should not have this problem.
  2. I wonder if scikit-learn has thought about name-mangling their copy of libgomp to avoid these issues. That would come at the cost of multiple copies of libgomp being loaded in one process, which may explode resource contention.

Probably no great answer right now, but all the more incentive to work on https://discuss.python.org/t/implementation-variants-rehashing-and-refocusing/54884!

@jameslamb
Copy link
Collaborator Author

Thanks @msarahan ! We're getting scikit-learn from conda-forge.

Looking at the log more closely, I don't think this has anything to do with scikit-learn bundling its own libgomp. Its conda package doesn't do that (as you said and as we'd expect):

wget https://anaconda.org/conda-forge/scikit-learn/1.5.1/download/linux-aarch64/scikit-learn-1.5.1-py311haece950_0.conda

mkdir -p ./tmp
cph extract \
   --dest ./tmp \
   ./scikit-learn-1.5.1-py311haece950_0.conda

find ./tmp -name '*libgomp*'
# (empty)

I think I was confused by this log line

/root/miniforge/envs/test-env/lib/python3.12/site-packages/sklearn/utils/../../../../libgomp.so.1

If you trace back all those relative paths, that's actually /root/miniforge/envs/test-env/lib/libgomp.so.1... the one coming from conda-forge.

@jameslamb jameslamb removed the blocking label Jul 4, 2024
@jameslamb
Copy link
Collaborator Author

Had an idea tonight 💡

Now that GitHub Actions offers a free aarch64 macOS runner (#6391), we could probably build aarch64 linux wheels on that runner + Docker! That should be much faster than the approach we were using (QEMU emulation of aarch64 on an x86_64 system), because it wouldn't require emulation. 😀

@jameslamb
Copy link
Collaborator Author

jameslamb commented Jul 7, 2024

I was able to reproduce this tonight without involving lightgbm's unit tests. At least that narrows it down a bit!

Ran the following on my M2 mac (which has an aarch64 CPU, so no emulation required). Note it's using exactly the same Docker image used in the CI job being discussed here.

docker run \
    --rm \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it lightgbm/vsts-agent:manylinux2014_aarch64 \
    bash

curl \
    -sL \
    -o miniforge.sh \
    https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh

sh miniforge.sh -b -p "${HOME}/miniconda3"
export PATH="${HOME}/miniconda3/bin:${PATH}"

conda create \
    -y \
    -c conda-forge \
    -n test-env \
    --file ./.ci/conda-envs/ci-core.txt \
    "python=3.11"

source activate test-env
pip install --no-deps 'lightgbm==4.4.0'
python -c "import lightgbm"
error traceback (click me)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/root/miniconda3/envs/test-env/lib/python3.11/site-packages/lightgbm/__init__.py", line 9, in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
  File "/root/miniconda3/envs/test-env/lib/python3.11/site-packages/lightgbm/basic.py", line 279, in <module>
    _LIB = _load_lib()
           ^^^^^^^^^^^
  File "/root/miniconda3/envs/test-env/lib/python3.11/site-packages/lightgbm/basic.py", line 263, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/test-env/lib/python3.11/ctypes/__init__.py", line 454, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/test-env/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /root/miniconda3/envs/test-env/bin/../lib/libgomp.so.1: cannot allocate memory in static TLS block

It definitely looks like scikit-learn and lightgbm are loading different copies of libgomp.so, and together those are eating up too much of the static TLS space. Uninstall scikit-learn resolves the error.

conda uninstall --yes scikit-learn

# this succeeds
python -c "import lightgbm"
output of 'conda info' (click me)
     active environment : test-env
    active env location : /root/miniconda3/envs/test-env
            shell level : 1
       user config file : /root/.condarc
 populated config files : /root/miniconda3/.condarc
          conda version : 24.3.0
    conda-build version : not installed
         python version : 3.10.14.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=m1
                          __conda=24.3.0=0
                          __glibc=2.17=0
                          __linux=6.6.16=0
                          __unix=0=0
       base environment : /root/miniconda3  (writable)
      conda av data dir : /root/miniconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/conda-forge/linux-aarch64
                          https://conda.anaconda.org/conda-forge/noarch
          package cache : /root/miniconda3/pkgs
                          /root/.conda/pkgs
       envs directories : /root/miniconda3/envs
                          /root/.conda/envs
               platform : linux-aarch64
             user-agent : conda/24.3.0 requests/2.31.0 CPython/3.10.14 Linux/6.6.16-linuxkit centos/7.9.2009 glibc/2.17 solver/libmamba conda-libmamba-solver/24.1.0 libmambapy/1.5.8
                UID:GID : 0:0
             netrc file : None
           offline mode : False
output of 'conda env export' (click me)
name: test-env
channels:
  - conda-forge
dependencies:
  - _openmp_mutex=4.5=2_gnu
  - atk-1.0=2.38.0=hedc4a1f_2
  - aws-c-auth=0.7.22=h742e1ef_7
  - aws-c-cal=0.7.0=h1194e0d_0
  - aws-c-common=0.9.23=h68df207_0
  - aws-c-compression=0.2.18=h3ff8e8a_7
  - aws-c-event-stream=0.4.2=hf436a7f_14
  - aws-c-http=0.8.2=h7e628a3_4
  - aws-c-io=0.14.9=h694ca1a_5
  - aws-c-mqtt=0.10.4=ha07a1b7_7
  - aws-c-s3=0.6.0=he19ceea_0
  - aws-c-sdkutils=0.1.16=h3ff8e8a_3
  - aws-checksums=0.1.18=h3ff8e8a_7
  - aws-crt-cpp=0.27.1=h1828411_1
  - aws-sdk-cpp=1.11.329=h38de1dc_7
  - azure-core-cpp=1.12.0=h60f91e5_0
  - azure-identity-cpp=1.8.0=he93b4e3_1
  - azure-storage-blobs-cpp=12.11.0=h41c6e6e_1
  - azure-storage-common-cpp=12.6.0=hc9a6983_1
  - azure-storage-files-datalake-cpp=12.10.0=h6eec737_1
  - bokeh=3.4.2=pyhd8ed1ab_0
  - brotli=1.1.0=h31becfc_1
  - brotli-bin=1.1.0=h31becfc_1
  - brotli-python=1.1.0=py311h8715677_1
  - bzip2=1.0.8=h31becfc_5
  - c-ares=1.28.1=h31becfc_0
  - ca-certificates=2024.7.4=hcefe29a_0
  - cairo=1.18.0=h5c54ea9_2
  - certifi=2024.6.2=pyhd8ed1ab_0
  - cffi=1.16.0=py311h7963103_0
  - click=8.1.7=unix_pyh707e725_0
  - cloudpickle=3.0.0=pyhd8ed1ab_0
  - colorama=0.4.6=pyhd8ed1ab_0
  - contourpy=1.2.1=py311h098ece5_0
  - cycler=0.12.1=pyhd8ed1ab_0
  - cytoolz=0.12.3=py311hc8f2f60_0
  - dask=2024.7.0=pyhd8ed1ab_0
  - dask-core=2024.7.0=pyhd8ed1ab_0
  - dask-expr=1.1.7=pyhd8ed1ab_0
  - distributed=2024.7.0=pyhd8ed1ab_0
  - exceptiongroup=1.2.0=pyhd8ed1ab_2
  - expat=2.6.2=h2f0025b_0
  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
  - font-ttf-inconsolata=3.000=h77eed37_0
  - font-ttf-source-code-pro=2.038=h77eed37_0
  - font-ttf-ubuntu=0.83=h77eed37_2
  - fontconfig=2.14.2=ha9a116f_0
  - fonts-conda-ecosystem=1=0
  - fonts-conda-forge=1=0
  - fonttools=4.53.0=py311hf4892ed_0
  - freetype=2.12.1=hf0a5ef3_2
  - fribidi=1.0.10=hb9de7d4_0
  - fsspec=2024.6.1=pyhff2d567_0
  - gdk-pixbuf=2.42.12=ha61d561_0
  - gflags=2.2.2=h54f1f3f_1004
  - giflib=5.2.2=h31becfc_0
  - glog=0.7.1=h468a4a4_0
  - graphite2=1.3.13=h2f0025b_1003
  - graphviz=11.0.0=h8cf0465_0
  - gtk2=2.24.33=h0d7db29_4
  - gts=0.7.6=he293c15_4
  - h2=4.1.0=pyhd8ed1ab_0
  - harfbuzz=8.5.0=h9812418_0
  - hpack=4.0.0=pyh9f0ad1d_0
  - hyperframe=6.0.1=pyhd8ed1ab_0
  - icu=73.2=h787c7f5_0
  - importlib-metadata=8.0.0=pyha770c72_0
  - importlib_metadata=8.0.0=hd8ed1ab_0
  - iniconfig=2.0.0=pyhd8ed1ab_0
  - jinja2=3.1.4=pyhd8ed1ab_0
  - joblib=1.4.2=pyhd8ed1ab_0
  - keyutils=1.6.1=h4e544f5_0
  - kiwisolver=1.4.5=py311h0d5d7b0_1
  - krb5=1.21.3=h50a48e9_0
  - lcms2=2.16=h922389a_0
  - ld_impl_linux-aarch64=2.40=h9fc2d93_7
  - lerc=4.0.0=h4de3ea5_0
  - libabseil=20240116.2=cxx17_h2f0025b_0
  - libarrow=16.1.0=h8503109_12_cpu
  - libarrow-acero=16.1.0=h5ad3122_12_cpu
  - libarrow-dataset=16.1.0=h5ad3122_12_cpu
  - libarrow-substrait=16.1.0=h08b7278_12_cpu
  - libblas=3.9.0=22_linuxaarch64_openblas
  - libbrotlicommon=1.1.0=h31becfc_1
  - libbrotlidec=1.1.0=h31becfc_1
  - libbrotlienc=1.1.0=h31becfc_1
  - libcblas=3.9.0=22_linuxaarch64_openblas
  - libcrc32c=1.1.2=h01db608_0
  - libcurl=8.8.0=h4e8248e_1
  - libdeflate=1.20=h31becfc_0
  - libedit=3.1.20191231=he28a2e2_2
  - libev=4.33=h31becfc_2
  - libevent=2.1.12=h4ba1bb4_1
  - libexpat=2.6.2=h2f0025b_0
  - libffi=3.4.2=h3557bc0_5
  - libgcc-ng=14.1.0=he277a41_0
  - libgd=2.3.3=hcd22fd5_9
  - libgfortran-ng=14.1.0=he9431aa_0
  - libgfortran5=14.1.0=h9420597_0
  - libglib=2.80.2=haee52c6_1
  - libgomp=14.1.0=he277a41_0
  - libgoogle-cloud=2.26.0=hc02380a_0
  - libgoogle-cloud-storage=2.26.0=hd572f31_0
  - libgrpc=1.62.2=h98a9317_0
  - libiconv=1.17=h31becfc_2
  - libjpeg-turbo=3.0.0=h31becfc_1
  - liblapack=3.9.0=22_linuxaarch64_openblas
  - libnghttp2=1.58.0=hb0e430d_1
  - libnsl=2.0.1=h31becfc_0
  - libopenblas=0.3.27=pthreads_h5a5ec62_0
  - libparquet=16.1.0=h6fe2c6f_12_cpu
  - libpng=1.6.43=h194ca79_0
  - libprotobuf=4.25.3=h648ac29_0
  - libre2-11=2023.09.01=h9d008c2_2
  - librsvg=2.58.1=h010368b_0
  - libsqlite=3.46.0=hf51ef55_0
  - libssh2=1.11.0=h492db2e_0
  - libstdcxx-ng=14.1.0=h3f4de04_0
  - libthrift=0.19.0=h043aeee_1
  - libtiff=4.6.0=hf980d43_3
  - libutf8proc=2.8.0=h4e544f5_0
  - libuuid=2.38.1=hb4cce97_0
  - libwebp=1.4.0=h8b4e01b_0
  - libwebp-base=1.4.0=h31becfc_0
  - libxcb=1.16=h7935292_0
  - libxcrypt=4.4.36=h31becfc_1
  - libxml2=2.12.7=h49dc7a2_1
  - libzlib=1.3.1=h68df207_1
  - locket=1.0.0=pyhd8ed1ab_0
  - lz4=4.3.3=py311h6a4b261_0
  - lz4-c=1.9.4=hd600fc2_0
  - markupsafe=2.1.5=py311hc8f2f60_0
  - matplotlib-base=3.8.4=py311h55059f0_2
  - msgpack-python=1.0.8=py311hdc7ef93_0
  - munkres=1.1.4=pyh9f0ad1d_0
  - ncurses=6.5=h0425590_0
  - numpy=2.0.0=py311hacb946d_0
  - openjpeg=2.5.2=h0d9d63b_0
  - openssl=3.3.1=h68df207_1
  - orc=2.0.1=hd7aaf90_1
  - packaging=24.1=pyhd8ed1ab_0
  - pandas=2.2.2=py311hb80374c_1
  - pango=1.54.0=h399c48b_0
  - partd=1.4.2=pyhd8ed1ab_0
  - pcre2=10.44=h070dd5b_0
  - pillow=10.4.0=py311h54289d1_0
  - pip=24.0=pyhd8ed1ab_0
  - pixman=0.43.4=h2f0025b_0
  - pluggy=1.5.0=pyhd8ed1ab_0
  - psutil=6.0.0=py311hf4892ed_0
  - pthread-stubs=0.4=hb9de7d4_1001
  - pyarrow=16.1.0=py311h58b41f2_4
  - pyarrow-core=16.1.0=py311h34cd749_4_cpu
  - pyarrow-hotfix=0.6=pyhd8ed1ab_0
  - pycparser=2.22=pyhd8ed1ab_0
  - pyparsing=3.1.2=pyhd8ed1ab_0
  - pysocks=1.7.1=pyha2e5f31_6
  - pytest=8.2.2=pyhd8ed1ab_0
  - python=3.11.9=hddfb980_0_cpython
  - python-dateutil=2.9.0=pyhd8ed1ab_0
  - python-graphviz=0.20.3=pyh717bed2_0
  - python-tzdata=2024.1=pyhd8ed1ab_0
  - python_abi=3.11=4_cp311
  - pytz=2024.1=pyhd8ed1ab_0
  - pyyaml=6.0.1=py311hcd402e7_1
  - re2=2023.09.01=h9caee61_2
  - readline=8.2=h8fc344f_1
  - s2n=1.4.17=h52a6840_0
  - scikit-learn=1.5.1=py311haece950_0
  - scipy=1.14.0=py311hbd9a39d_1
  - setuptools=70.1.1=pyhd8ed1ab_0
  - six=1.16.0=pyh6c4a22f_0
  - snappy=1.2.1=h1088aeb_0
  - sortedcontainers=2.4.0=pyhd8ed1ab_0
  - tblib=3.0.0=pyhd8ed1ab_0
  - threadpoolctl=3.5.0=pyhc1e730c_0
  - tk=8.6.13=h194ca79_0
  - tomli=2.0.1=pyhd8ed1ab_0
  - toolz=0.12.1=pyhd8ed1ab_0
  - tornado=6.4.1=py311h323e239_0
  - tzdata=2024a=h0c530f3_0
  - urllib3=2.2.2=pyhd8ed1ab_1
  - wheel=0.43.0=pyhd8ed1ab_1
  - xorg-kbproto=1.0.7=h3557bc0_1002
  - xorg-libice=1.1.1=h7935292_0
  - xorg-libsm=1.2.4=h5a01bc2_0
  - xorg-libx11=1.8.9=h08be655_1
  - xorg-libxau=1.0.11=h31becfc_0
  - xorg-libxdmcp=1.1.3=h3557bc0_0
  - xorg-libxext=1.3.4=h2a766a3_2
  - xorg-libxrender=0.9.11=h7935292_0
  - xorg-renderproto=0.11.1=h3557bc0_1002
  - xorg-xextproto=7.3.0=h2a766a3_1003
  - xorg-xproto=7.0.31=h3557bc0_1007
  - xyzservices=2024.6.0=pyhd8ed1ab_0
  - xz=5.2.6=h9cdd2b7_0
  - yaml=0.2.5=hf897c2e_2
  - zict=3.0.0=pyhd8ed1ab_0
  - zipp=3.19.2=pyhd8ed1ab_0
  - zlib=1.3.1=h68df207_1
  - zstandard=0.22.0=py311h8938cd4_1
  - zstd=1.5.6=h02f22dd_0
  - pip:
      - lightgbm==4.4.0
prefix: /root/miniconda3/envs/test-env

@jameslamb
Copy link
Collaborator Author

Pre-loading libgomp.so.1 does work, as suggested in many of the threads linked above.

LD_PRELOAD="/root/miniconda3/envs/test-env/lib/libgomp.so.1" \
python -c "import lightgbm; print(lightgbm.__version__)"
# 4.4.0

I think the reason for that is best summarized in this 10+ year old thread about MATLAB (https://stackoverflow.com/a/19468365/3986677):

MATLAB dynamically (with dlopen) loads several libraries that need tls initialization. All those libs need a slot in the dtv (dynamic thread vector). Because MATLAB loads several of these libs dynamically at runtime at compile/link time the linker (at mathworks) had no chance to count the slots needed (that's the important part). Now it's the task of the dynamic lib loader to handle such a case at runtime. But this is not easy. To cite dl-open.c:

,,,,"For static TLS we have to allocate the memory here and now. This includes allocating memory in the DTV. But we cannot change any DTV other than our own. So, if we cannot guarantee that there is room in the DTV we don't even try it and fail the load."

@jameslamb
Copy link
Collaborator Author

I followed the advice in https://bugzilla.redhat.com/show_bug.cgi?id=1722181 to check how much static TLS is being allocated for each library loaded to satisfy import lightgbm.

script (click me)

Get a list of all the shared objects that need to be loaded to load lightgbm.

mkdir -p /opt/LightGBM/ld-logs

LD_DEBUG=libs LD_DEBUG_OUTPUT=/opt/LightGBM/ld-logs/out.txt \
python -c "import lightgbm"

cat /opt/LightGBM/ld-logs/* > /tmp/ld-logs-full.txt

Used readelf to check which ones want to use some memory for thread-local storage, and how much.

for l in `grep trying /tmp/ld-logs-full.txt | cut -d '=' -f 2`; do
    if test -f $l; then
        printf "%d bytes ($(realpath $l))\n" $(
            readelf -Wl $l \
            | grep TLS \
            | awk -F ' ' '{ print $6 }'
        ) >> tls-usage.txt
    fi
done

cat tls-usage.txt \
| grep -v '^0' \
| sort -r -n -u

For context on that $6 in awk... here's what the readelf output looks like.

readelf -Wl /root/miniconda3/envs/test-env/lib/libprotobuf.so.25.3.0
Elf file type is DYN (Shared object file)
Entry point 0x0
There are 7 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x2c08ba 0x2c08ba R E 0x10000
  LOAD           0x2c5200 0x00000000002d5200 0x00000000002d5200 0x00ef39 0x0103b8 RW  0x10000
  DYNAMIC        0x2cb878 0x00000000002db878 0x00000000002db878 0x000380 0x000380 RW  0x8
  TLS            0x2c5200 0x00000000002d5200 0x00000000002d5200 0x000020 0x000024 R   0x20
  GNU_EH_FRAME   0x2708c4 0x00000000002708c4 0x00000000002708c4 0x0092e4 0x0092e4 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x2c5200 0x00000000002d5200 0x00000000002d5200 0x00ae00 0x00ae00 R   0x1

Hexidecimal representations like that are understood by printf.

printf "%d\n" "0x000024"
# 36

OpenMP is not even close to the most expensive in this regard... I wonder if it's just showing up in the error message because it's loaded so late in the process?

61440 bytes (/root/miniconda3/envs/test-env/lib/libopenblasp-r0.3.27.so)
30456 bytes (/root/miniconda3/envs/test-env/lib/libglog.so.0.7.1)
2744 bytes (/root/miniconda3/envs/test-env/lib/libarrow.so.1601.0.0)
2536 bytes (/root/miniconda3/envs/test-env/lib/libaws-cpp-sdk-core.so)
2520 bytes (/root/miniconda3/envs/test-env/lib/libazure-core.so)
728 bytes (/root/miniconda3/envs/test-env/lib/libxml2.so.2.12.7)
144 bytes (/root/miniconda3/envs/test-env/lib/libs2n.so.1.0.0)
136 bytes (/root/miniconda3/envs/test-env/lib/libgomp.so.1.0.0)
120 bytes (/usr/lib64/libc-2.17.so)
72 bytes (/root/miniconda3/envs/test-env/lib/libgoogle_cloud_cpp_common.so.2.26.0)
70 bytes (/root/miniconda3/envs/test-env/lib/libuuid.so.1.3.0)
56 bytes (/root/miniconda3/envs/test-env/lib/libaws-c-common.so.1.0.0)
36 bytes (/root/miniconda3/envs/test-env/lib/libprotobuf.so.25.3.0)
32 bytes (/root/miniconda3/envs/test-env/lib/libstdc++.so.6.0.33)
12 bytes (/root/miniconda3/envs/test-env/lib/libabsl_base.so.2401.0.0)
8 bytes (/root/miniconda3/envs/test-env/lib/libgcc_s.so.1)
6 bytes (/usr/lib64/libsmartcols.so.1.1.0)
1 bytes (/root/miniconda3/envs/test-env/lib/libabsl_log_internal_log_sink_set.so.2401.0.0)

@jameslamb
Copy link
Collaborator Author

I noticed that most of those are pyarrow and its dependencies, which helped me to narrow this down even further from #6509 (comment).

Importing just pyarrow following by scikit-learn (which causes libgomp to be loaded) is sufficient to reproduce the error (in the environment described in #6509 (comment)):

# no error
python -c "import pyarrow"

# no error
python -c "import sklearn"

# error: cannot allocate memory in static TLS block
python -c "import pyarrow; import sklearn"

Just switching the order is sufficient to avoid the error, at least in this case.

# no error
python -c "import sklearn; import pyarrow"

That matches findings from many of the threads linked in the description of this issue. For example, from pytorch/pytorch#2575.

glibc has a table called the DTV. There is a slot for every dlopen'd library with TLS. Its use is not important for this discussion.

The DTV is resizable. However, in older versions of glibc, adding a library with static TLS will not resize the DTV, but do a conservative check that amounts to "have a I loaded more than 14 libraries with TLS"...

... changing import order can fix things, because if you change it in a way that loads all your "static TLS" libraries first, then future "dynamic TLS" libraries will resize the DTV like normal

  • It seems this issue was fixed by a glibc patch in 2014, which eliminates this check and lazily updates the DTV.*

It's late in my timezone so I'm going to stop here for tonight. I'll do some more testing soon. In short, I think that to get around this we should:

@jameslamb jameslamb unpinned this issue Jul 12, 2024
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 1, 2024
Workaround issue where older glibc cannot allocate large TLS
blocks after the program has started running ("cannot allocate memory
in static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 1, 2024
Workaround issue where older glibc cannot allocate large TLS blocks
after the program has started running ("cannot allocate memory in
static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 3, 2024
Workaround issue where older glibc cannot allocate large TLS
blocks after the program has started running ("cannot allocate memory
in static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 3, 2024
Workaround issue where older glibc cannot allocate large TLS
blocks after the program has started running ("cannot allocate memory
in static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 3, 2024
Workaround issue where older glibc cannot allocate large TLS
blocks after the program has started running ("cannot allocate memory
in static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 3, 2024
Workaround issue where older glibc cannot allocate large TLS
blocks after the program has started running ("cannot allocate memory
in static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
ccotter added a commit to ccotter/llvm-project that referenced this issue Aug 9, 2024
Workaround issue where older glibc cannot allocate large TLS
blocks after the program has started running ("cannot allocate memory
in static TLS block").

ref:
microsoft/LightGBM#6509
https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants