Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update base image to python:3.9 #660

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Update base image to python:3.9 #660

wants to merge 1 commit into from

Conversation

stchris
Copy link
Contributor

@stchris stchris commented Nov 21, 2024

  • This updates the base image for ingest-file from ubuntu:20.04 to python:3.9, which is Debian based.
  • I had to revert to installing tesserocr from source because otherwise we are not able to parse some important image file formats.
  • In the process I enabled running tests in Python dev mode as well as turned on tracemalloc, so we can catch more things during testing, including deprecation warnings etc.

fix failing test (random ordering of dates)

fix and update tesserocr installation
@tillprochaska
Copy link
Contributor

This updates the base image for ingest-file from ubuntu:20.04 to python:3.9, which is Debian based.

Not against it, just curious: Was there a particular reason for this change?

I had to revert to installing tesserocr from source because otherwise we are not able to parse some important image file formats.

Looking at the Dockerfile and history, wasn’t that the case before, too? (Or am I misunderstanding and you researched whether we might be able to switch to prebuilt binaries and the conclusion was that we still cannot?)

@stchris
Copy link
Contributor Author

stchris commented Nov 25, 2024

This updates the base image for ingest-file from ubuntu:20.04 to python:3.9, which is Debian based.

Not against it, just curious: Was there a particular reason for this change?

My thinking here was that the official images will make it easier to upgrade down the line. Also because not all python versions are packaged on all distros, so we might have had to resort to PPAs and such.

I had to revert to installing tesserocr from source because otherwise we are not able to parse some important image file formats.

Looking at the Dockerfile and history, wasn’t that the case before, too? (Or am I misunderstanding and you researched whether we might be able to switch to prebuilt binaries and the conclusion was that we still cannot?)

Sorry I didn't fully explain it, but it's the latter. I was hoping that we can skip this exception of building tesserocr ourselves, but it looks like we can't and that's fine, I guess.

&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
&& localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
RUN echo "deb http://http.us.debian.org/debian stable non-free" >/etc/apt/sources.list.d/nonfree.list \
&& apt-get -qq -y update \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the respective PR in the Aleph repo, here’s a list of packages that should already be installed in the base image:

  • ca-certificates
  • git
  • pkg-config
  • libxslt1-dev
  • libpq-dev
  • zlib1g-dev
  • libicu-dev
  • libxml2-dev
  • imagemagick
  • imagemagick-common (requires only imagemagick-6-common which is installed by default)
  • libtiff5-dev (requires only libtiff-dev which is installed by default)
  • libjpeg-dev
  • libfreetype6-dev
  • libwebp-dev
  • libopenjp2-7-dev
  • libpng-dev
  • fonts-dejavu-core

@@ -12,7 +12,7 @@ def test_audio(self):
self.assertEqual(entity.first("generator"), "com.apple.VoiceMemos (iOS 11.4)")
self.assertEqual(
entity.first("authoredAt"),
datetime.datetime(2018, 6, 20, 12, 9, 42).isoformat(),
datetime.datetime(2018, 6, 20, 12, 9, 28).isoformat(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit surprising, what’s the backstory here?

spacy==3.6.1
numpy<2.0.0 # pinned because otherwise spacy requires an incompatible numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference because I was curious what this is about: explosion/spaCy#13528

Seems like the issue is fixed in the latest release, so as soon as we upgrade spacy to 3.8.2 or later, this isn’t needed anymore. (Was able to confirm that the build succeeds with that version.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to update spacy here and now, thanks!

@tillprochaska
Copy link
Contributor

My thinking here was that the official images will make it easier to upgrade down the line. Also because not all python versions are packaged on all distros, so we might have had to resort to PPAs and such.

Makes sense, thanks for clarifying! One thing we might want to consider is to pin the Debian version, e.g. 3.9-bookworm. (My understanding is that 3.9 is just an alias, and in the future might point to an image that is based on a different Debian version.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants