-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] aarch64 python-package job failing: "cannot allocate memory in static TLS block" #6509
Comments
I recall now that allocation of static TLS was one of the reasons we switched x86_64 wheels to Similarly, a comment from August 2020 on https://bugzilla.redhat.com/show_bug.cgi?id=1722181 says:
So it's possible that we'd be less likely to see this if we moved They're currently using |
I just tried a rebuild, and this is still happening: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=16481&view=logs&j=7a417b3a-6502-5a0d-1db8-7ef6155c93de&t=380f8b13-0b2d-5f03-5de0-8353018c7351 Looking more closely at the logs, I noticed that it is happening whenever a
sometimes
So I suspect part of the problem is "if you use |
Just spitballing here but since we're getting conda's libgomp, we could also try installing conda's c++ compiler in that job and that'd link against that libgomp, which is the one being loaded by numpy and scipy, and maybe that'd solve the issue. WDYT? |
It looks like It has lots of the same dependencies as
Interesting idea! Thanks for catching that conda-forge's I'm nervous about using
BUT... that's testable, and it's worth testing. I can try that to see if it'd help. Another thing I'm going to test... I'm going to try reducing the tests to simply |
Oh sorry, I didn't realize that job was building wheels. We could also go the other way and drop conda from that job, I've used uv a lot lately and it's great. |
I like that idea too! Although it could be pretty involved... I suspect As a start... would you support commenting out this CI job for now so that we can keep making progress in the repo while we investigate this? With the understanding that this would have to be fixed and working again before we do the next release. I'm really happy with the recent pace of non-maintainer contributions we've been getting, and I'd like to get those peoples' work merged so we build momentum with them:
@mayer79 and @nicklamiller in particular have been more patient with us than they should have to be, waiting on various CI issues 😅 |
Sure! That shouldn't block our CI since it's just a packaging problem. |
put up #6517 Thankfully we now have that one macOS job still providing coverage of compiling LightGBM on aarch64. |
A couple of thoughts here:
Probably no great answer right now, but all the more incentive to work on https://discuss.python.org/t/implementation-variants-rehashing-and-refocusing/54884! |
Thanks @msarahan ! We're getting Looking at the log more closely, I don't think this has anything to do with wget https://anaconda.org/conda-forge/scikit-learn/1.5.1/download/linux-aarch64/scikit-learn-1.5.1-py311haece950_0.conda
mkdir -p ./tmp
cph extract \
--dest ./tmp \
./scikit-learn-1.5.1-py311haece950_0.conda
find ./tmp -name '*libgomp*'
# (empty) I think I was confused by this log line
If you trace back all those relative paths, that's actually |
Had an idea tonight 💡 Now that GitHub Actions offers a free |
I was able to reproduce this tonight without involving Ran the following on my M2 mac (which has an aarch64 CPU, so no emulation required). Note it's using exactly the same Docker image used in the CI job being discussed here. docker run \
--rm \
-v $(pwd):/opt/LightGBM \
-w /opt/LightGBM \
-it lightgbm/vsts-agent:manylinux2014_aarch64 \
bash
curl \
-sL \
-o miniforge.sh \
https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
sh miniforge.sh -b -p "${HOME}/miniconda3"
export PATH="${HOME}/miniconda3/bin:${PATH}"
conda create \
-y \
-c conda-forge \
-n test-env \
--file ./.ci/conda-envs/ci-core.txt \
"python=3.11"
source activate test-env
pip install --no-deps 'lightgbm==4.4.0'
python -c "import lightgbm" error traceback (click me)
It definitely looks like conda uninstall --yes scikit-learn
# this succeeds
python -c "import lightgbm" output of 'conda info' (click me)
output of 'conda env export' (click me)
|
Pre-loading LD_PRELOAD="/root/miniconda3/envs/test-env/lib/libgomp.so.1" \
python -c "import lightgbm; print(lightgbm.__version__)"
# 4.4.0 I think the reason for that is best summarized in this 10+ year old thread about MATLAB (https://stackoverflow.com/a/19468365/3986677):
|
I followed the advice in https://bugzilla.redhat.com/show_bug.cgi?id=1722181 to check how much static TLS is being allocated for each library loaded to satisfy script (click me)Get a list of all the shared objects that need to be loaded to load mkdir -p /opt/LightGBM/ld-logs
LD_DEBUG=libs LD_DEBUG_OUTPUT=/opt/LightGBM/ld-logs/out.txt \
python -c "import lightgbm"
cat /opt/LightGBM/ld-logs/* > /tmp/ld-logs-full.txt Used for l in `grep trying /tmp/ld-logs-full.txt | cut -d '=' -f 2`; do
if test -f $l; then
printf "%d bytes ($(realpath $l))\n" $(
readelf -Wl $l \
| grep TLS \
| awk -F ' ' '{ print $6 }'
) >> tls-usage.txt
fi
done
cat tls-usage.txt \
| grep -v '^0' \
| sort -r -n -u For context on that readelf -Wl /root/miniconda3/envs/test-env/lib/libprotobuf.so.25.3.0
Hexidecimal representations like that are understood by printf "%d\n" "0x000024"
# 36 OpenMP is not even close to the most expensive in this regard... I wonder if it's just showing up in the error message because it's loaded so late in the process?
|
I noticed that most of those are pyarrow and its dependencies, which helped me to narrow this down even further from #6509 (comment). Importing just # no error
python -c "import pyarrow"
# no error
python -c "import sklearn"
# error: cannot allocate memory in static TLS block
python -c "import pyarrow; import sklearn" Just switching the order is sufficient to avoid the error, at least in this case. # no error
python -c "import sklearn; import pyarrow" That matches findings from many of the threads linked in the description of this issue. For example, from pytorch/pytorch#2575.
It's late in my timezone so I'm going to stop here for tonight. I'll do some more testing soon. In short, I think that to get around this we should:
|
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Workaround issue where older glibc cannot allocate large TLS blocks after the program has started running ("cannot allocate memory in static TLS block"). ref: microsoft/LightGBM#6509 https://bugzilla.redhat.com/show_bug.cgi?id=1722181
Description
For the last few days, I've observed the aarch64 CI job (which we run on an x86_64 box, using QEMU for emulation), failing with errors like the following during test collection:
Reproducible example
This is happening across several different PRs, with changesets that are very unlikely to be causing this, suggesting it's some other change in the environment. For example:
Environment info
N/A
Additional Comments
"TLS" in this error refers to "thread-local storage".
There is lots of prior discussion on similar issues:
libgomp
on Linux conda-forge/scikit-learn-feedstock#220linux-aarch64
nightlies dask-contrib/dask-sql#1144All of those are about using
libgomp
onaarch64
.From https://bugzilla.redhat.com/show_bug.cgi?id=1722181:
On opencv/opencv#14884, there's some discussion about this specifically being caused by bundled
libgomp
in multiple Python packages, and there are suggestions that importing those libraries earlier (and therefore loading theirlibgomp
earlier) can resolve this.These also have some helpful information:
The text was updated successfully, but these errors were encountered: