diff --git a/content/dependencies.rst b/content/dependencies.rst index a5279fda..d848f99e 100644 --- a/content/dependencies.rst +++ b/content/dependencies.rst @@ -7,10 +7,8 @@ Dependency management - Do you expect your code to work in one year? Five? What if it uses ``numpy`` or ``tensorflow`` or ``random-github-package`` ? - - How can my collaborators get the same results as me? What about - future me? - - How can my collaborators easily install my codes with all the necessary dependencies? - - How can I make it easy for my colleagues to reproduce my results? + - How can my collaborators easily install my code with all the necessary dependencies? + - How can I make it easy for my others (and me in future) to reproduce my results? - How can I work on two (or more) projects with different and conflicting dependencies? .. objectives:: @@ -29,8 +27,8 @@ How do you track dependencies of your project? -Exercises 1 ------------ +Exercise 1 +---------- .. challenge:: Dependencies-1: Discuss dependency management (5 min) @@ -48,14 +46,14 @@ Exercises 1 .. _pypi: -PyPI (The Python Package Index) and (Ana)conda ----------------------------------------------- +PyPI (The Python Package Index) and conda ecosystem +--------------------------------------------------- -- PyPI (The Python Package Index) and Conda are popular packaging/dependency - management tools. +PyPI (The Python Package Index) and conda are popular packaging/dependency +management tools: - When you run ``pip install`` you typically install from `PyPI - `__ but you can also ``pip install`` from a GitHub + `__, but you can also ``pip install`` from a GitHub repository and similar. - When you run ``conda install`` you typically install from `Anaconda Cloud @@ -102,90 +100,173 @@ Why are there two ecosystems? - **Cons:** - Package creation is harder -.. admonition:: Anaconda vs. miniconda vs. conda vs. mamba vs. Anaconda Cloud vs. conda-forge vs. miniforge - :class: dropdown - Package sources: +Conda ecosystem explained +------------------------- + +.. warning:: + + Anaconda has recently changed its licensing terms, which affects its + use in a professional setting. This caused uproar among academia + and Anaconda modified their position in + `this article `__. + + Main points of the article are: + + - conda (installation tool) and community channels (e.g. conda-forge) + are free to use. + - Anaconda repository and **Anaconda's channels in the community repository** + are free for universities and companies with fewer than 200 employees. + Non-university research institutions and national laboratories need + licenses. + - Miniconda is free, when it does not download Anaconda's packages. + - Miniforge is not related to Anaconda, so it is free. + + For ease of use on sharing environment files, we recommend using + Miniforge to create the environments and using conda-forge as the main + channel that provides software. + +- Package repositories: + + - `Anaconda Community Repository (anaconda.org) `__ + aka. Anaconda Cloud is a package cloud maintained by Anaconda Inc. + It is a repository that houses mirrors of Anaconda's channels and + community maintained channels. + - `Anaconda Repository (repo.anaconda.com) `__ + houses Anaconda's own proprietary software channels. + +- Major package channels: + + - Anaconda's proprietary channels: ``main``, ``r``, ``msys2`` and ``anaconda``. + These are sometimes called ``defaults``. + - `conda-forge `__ is the largest open source + community channel. It has over 27,000 packages that include open-source + versions of packages in Anaconda's channels. + +- Package distributions and installers: + + - `Anaconda `__ is a distribution of conda packages + made by Anaconda Inc.. When using Anaconda remember to check that your + situation abides with their licensing terms. + - `Miniconda `__ is a minimal installer + maintained by Anaconda Inc. that has conda and uses Anaconda's channels + by default. Check licensing terms when using these packages. + - `Miniforge `__ is an open-source + Miniconda replacement that uses conda-forge as the default channel. + Contains mamba as well. + - `micromamba `__ + is a tiny stand-alone version of the mamba package manager written in C++. + It can be used to create and manage environments without installing + base-environment and Python. It is very useful if you want to automate + environment creation or want a more lightweight tool. + +- Package managers: + + - `conda `__ is a package and environment management system + used by Anaconda. It is an open source project maintained by Anaconda Inc.. + - `mamba `__ is a drop in + replacement for conda. It used be much faster than conda due to better + dependency solver but nowadays conda + `also uses the same solver `__. + It still has some UI improvements. + +Exercise 2 +---------- - - `Anaconda Cloud `__ - a package cloud maintained by - Anaconda Inc. It is a free repository that houses conda package channels. - - `Conda-forge `__ - the largest open source - community channel. +.. challenge:: Dependencies-2: Package language detective (2 min) - Package managers: + Think about the following sentences: - - `conda `__ - a package and environment management system - used by Anaconda. It is an open source project maintained by Anaconda Inc.. - - `mamba `__ - a drop in - replacement for conda that does installations faster. + 1. Yes, you can install my package with pip from GitHub. + 2. I forgot to specify my channels, so my packages came from the defaults. + 3. I have a Miniforge installation and I use mamba to create my environments. + + What hidden information is given in these sentences? - Package manager deployments: + .. solution:: - - `Anaconda `__ - a distribution of conda packages - made by Anaconda Inc.. It is free for academic and non-commercial use. - - `Miniconda `__ - a minimal installer that - has conda and uses - `default channels `__ - by default. - - `Miniforge `__ - Miniconda replacement - that uses conda-forge as the default channel. Contains mamba as well. + 1. The package is provided as a pip package. However, it is most likely + not uploaded to PyPI as it needs to be installed from a repository. + 2. In this case the person saying the sentence is most likely using + Anaconda or Miniconda because these tools use the ``defaults``-channel + as the default channel. They probably meant to install software from + conda-forge, but forgot to specify the channel. + 3. Miniforge uses conda-forge as the default channel. So unless some + other channel has been specified, packages installed with these + tools come from conda-forge as well. +Python environments +------------------- -In the packaging episode we will meet PyPI and Anaconda again and practice how -to share Python packages. +An **environment** is a basically a folder that contains a Python +interpreter and other Python packages in a folder structure similar +to the operating system's folder structure. +These environments can be created by the +`venv-module `__ in base +Python, by a pip package called +`virtualenv `_ +or by conda/mamba. -Creating isolated environments ------------------------------- +Using these environments is highly recommended because they solve the +following problems: -An **isolated environment** allows installing packages without -affecting the rest of your operating system or any other projects. -Isolated environments solve a couple of problems: +- Installing environments won't modify system packages. - You can install specific versions of packages into them. -- You can create one environment for each project and you won't encounter any - problems if the two projects require different versions of packages. +- You can create an environment for each project and you won't encounter any + problems if different projects require different versions of packages. - If you make some mistake and install something you did not want or need, you can remove the environment and create a new one. -- You can export a list of packages in an environment and share it with your - code. This makes replicating your results easier. +- Others can replicate your environment by reusing the same specification + that you used to create the environment. -Exercises 2 ------------ +Creating Python environments +---------------------------- -.. challenge:: Dependencies-2: Create a conda environment (15 min) +.. tabs:: - .. highlight:: console + .. group-tab:: Creating conda environment from environment.yml - Chloe just joined your team and will be working on her Master Thesis. She is - quite familiar with Python, still finishing some Python assignments (due in a - few weeks) and you give her a Python code for analyzing and plotting your - favorite data. The thing is that your Python code has been developed by - another Master Student (from last year) and requires a older version of - Numpy (1.24.3) and Matplotlib (3.7.2) (otherwise the code fails). The code - could probably work with a recent version of Python but has been validated with - Python 3.10 only. Having no idea what the code does, she decides that the best - approach is to **create an isolated environment** with the same dependencies - that were used previously. This will give her a baseline for future upgrade and - developments. + Record channels and packages you need to a file called + ``environment.yml``: + + .. code-block:: yaml + + name: my-environment + channels: + - conda-forge + dependencies: + - python + - numpy + - matplotlib + - pandas + + The ``name`` describes the name of the environment, + ``channels``-list tells which channels should be search for packages + (channel priority goes from top to bottom) and ``dependencies``-list + contains all packages that are needed. + + Using this file you can now create an environment with: + + .. code-block:: console - For this first exercise, we will be using conda for creating an isolated environment. + $ conda env create --file environment.yml - 1. Create a conda environment:: + .. admonition:: You can also use mamba - $ conda create --name python310-env python=3.10 numpy=1.24.3 matplotlib=3.7.2 + If you have mamba installed, you can replace conda + with mamba in each command. - Conda environments can also be managed (create, update, delete) from the - **anaconda-navigator**. Check out the corresponding documentation `here - `_. + You can then activate the environment with: - 2. Activate the environment:: + .. code-block:: console - $ conda activate python310-env + $ conda activate my-environment .. callout:: conda activate versus source activate @@ -199,204 +280,387 @@ Exercises 2 not having ``pip`` installed in a conda environment which results ``pip`` from main environment to be used instead. - You can always try:: + You can then check e.g. installed versions of Python and ``numpy``: + + .. code-block:: console + + $ python -c 'import sys; import numpy; print(f"Python version: {sys.version}\nNumPy version: {numpy.__version__}")' + Python version: 3.13.0 | packaged by conda-forge | (main, Oct 8 2024, 20:04:32) [GCC 13.3.0] + NumPy version: 2.1.2 + + To deactivate the environment, you can run: + + .. code-block:: console + + $ conda deactivate - $ source activate python310-env + .. group-tab:: Creating virtual environment from requirements.txt - 3. Open a Python console and check that you have effectively the - right version for each package: + Record packages you need to a file called + ``requirements.txt``: - .. code-block:: python + .. code-block:: text - import numpy - import matplotlib + numpy + matplotlib + pandas - print('Numpy version: ', numpy.__version__) - print('Matplotlib version: ', matplotlib.__version__) + This is simply a text file that lists all of the packages that + you need. It is usually called ``requirements.txt``. - Or use the one-liner if you have access to a terminal like bash: + Now you can create a virtual environment with: .. code-block:: console - $ python -c 'import numpy; print(numpy.__version__)' - $ python -c 'import matplotlib;print(matplotlib.__version__)' + $ python -m venv my-environment - 4. Deactivate the environment:: + You can then activate the environment by sourcing a file called + ``activate``. - $ conda deactivate + - **Linux/Mac OSX**: + .. code-block:: console - 5. Check Numpy and Matplotlib versions in the default environment to make - sure they are different from **python310-env**. + $ source my-environment/bin/activate - There is no need to specify the conda environment when using deactivate. It - deactivates the current environment. + - **Windows**: most likely you can find it in the Scripts folder. + Now the environment should be active. You can then install packages + listed in ``requirements.txt`` with -Exercises 3 ------------ + .. code-block:: console -.. challenge:: Dependencies-3: Create a virtualenv (15 min, optional) + $ python -m pip install -r requirements.txt - This is the same exercise as before but we use virtualenv rather than conda. + You can then check e.g. installed versions of Python and ``numpy``: + .. code-block:: console - 1. Create a venv:: + $ python -c 'import sys; import numpy; print(f"Python version: {sys.version}\nNumPy version: {numpy.__version__}")' + Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] + NumPy version: 2.1.2 - $ python3 -m venv scicomp + To deactivate the environment, you can run: - Here ``scicomp`` is the name of the virtual environment. It creates a new - folder called ``scicomp``. + .. code-block:: console - 2. Activate it. To activate your newly created virtual environment locate the - script called ``activate`` and *source* it. + $ deactivate - - **Linux/Mac-OSX**: look at ``bin`` folder in the ``scicomp`` folder:: - $ source scicomp/bin/activate +.. admonition:: Creating environments without environment.yml/requirements.txt - - **Windows**: most likely you can find it in the ``Scripts`` folder. + It is possible to create environments with manual commands, but this + is highly discouraged for continuous use. - 3. Install Numpy 1.24.3 and Matplotlib 3.7.2 into the virtual environment:: + Firstly, replicating the environment becomes much harder. - $ pip install numpy==1.24.3 - $ pip install matplotlib==3.7.2 + Secondly, running package installation commands manually in an + environment can result in unexpected behaviour such as: - 4. Deactivate it:: + - Package manager might remove an already installed packages or update them. + - Package manager might not find a package that works with already + installed packages. - $ deactivate + The reason for this behavior is that package managers does not know what + commands you ran in the past. It only knows the state of your environment + and what you're currently telling it to install. -Problems that might happen with manual installation ---------------------------------------------------- + These kinds of problems can be mitigated by recording dependencies in an + ``environment.yml`` or ``requirements.txt`` and using the relevant + package manager to update / recreate the environment. + + +Exercise 3 +---------- + +.. challenge:: Dependencies-3: Create a Python environment (15 min) + + Use conda or venv to create the environment presented in the + example. + + +Adding more packages to existing environments +--------------------------------------------- + +Quite often when you're creating a new environment you might forget +to add all relevant packages to ``environment.yml`` or +``requirements.txt``. + +In these cases the best practice is to add missing packages to +``environment.yml`` or ``requirements.txt`` and to update the environment. + +.. tabs:: + + .. group-tab:: Adding new packages to a conda environment + + Add new packages that you want to install to + ``dependencies`` in + ``environment.yml``. + + Afterwards, run + + .. code-block:: console + + $ conda env update --file environment.yml + + to update the environment. + + .. group-tab:: Adding new packages to a virtual environment + + Add new packages that you want to install to + ``requirements.txt``. + + Afterwards, activate the environment and re-run + + .. code-block:: console + + $ pip install -r requirements.txt + + to update the environment. + +Sometimes the new packages are incompatible with the ones already +in the environment. Maybe they have different dependencies that are +not satisfied with the current versions, maybe the package you're installing +is incompatible with the ones installed. In these cases the safest approach +is to re-create the environment. This will let the dependency solvers +to start from clean slate and with a full picture of what packages +need to be installed. + + +Pinning package versions +------------------------ -Running the install commands manually can result in unexpected behaviour -such as: +Sometimes your code will only work with a certain range of dependencies. +Maybe you use a function or a class that was introduced in a later version +or a newer version has modified its API. -- The installer might remove an already installed packages or update them. -- The installer might not find a package that works with already installed packages. +In these situations, you'll want to **pin the package versions**. -The reason for this is that the installer does not know what commands -you ran in the past. It only knows the state of your environment and what -you're currently telling it to install. +For example, there is usually a delay between doing research and that +research being published. During this time packages used in the research +might update and reviewers or interested researchers might not be able +to replicate your results or run your code if new versions are not +compatible. -These kinds of problems can be mitigated by recording dependencies in an -``environment.yml`` or ``requirements.txt``. +.. tabs:: -Recording dependencies ----------------------- + .. group-tab:: environment.yml with pinned versions -There are two standard ways to record dependencies for Python projects: -``requirements.txt`` and ``environment.yml``. + When pinning versions in ``environment.yml`` one can use a + variety of comparison operators: -``requirements.txt`` (used by virtual environment) is a simple -text file which looks like this: + .. code-block:: yaml -.. code-block:: none + name: my-environment + channels: + - conda-forge + dependencies: + # Use python 3.11 + - python=3.11 + # numpy that is bigger or equal than version 1, but less than version 2 + - numpy>=1,<2 + # matplotlib greater than 3.8.2 + - matplotlib>3.8.2 + # pandas that is compatible with 2.1 + - pandas~=2.1 - numpy - matplotlib - pandas - scipy + .. group-tab:: requirements.txt with pinned versions -``environment.yml`` (for conda) is a yaml-file which looks like this: + When pinning versions in ``requirements.txt`` one can use a + variety of comparison operators: -.. code-block:: yaml + .. code-block:: text - name: my-environment - channels: - - defaults - dependencies: - - numpy - - matplotlib - - pandas - - scipy + # numpy that is bigger or equal than version 1, but less than version 2 + numpy>=1,<2 + # matplotlib greater than 3.8.2 + matplotlib>3.8.2 + # pandas that is compatible with 2.1 + pandas~=2.1 -If you need to recreate the exact same environment later on, it can be very -useful to **pin dependencies** to certain versions. For example, there -is usually a delay between doing research and that research being published. -During this time the dependencies might update and reviewers or interested -researchers might not be able to replicate your results or run your code. +For more information on all possible specifications, see +`this page `__ +from Python's packaging guide. -.. callout:: Conda channels +See also: https://coderefinery.github.io/reproducible-research/dependencies/ - - Sometimes the package version you would need does not seem to be - available. You may have to select another `conda channel - `__. +.. admonition:: To pin or not to pin? That is the question. - Most popular channels are - `defaults `__, - which is managed by - Anaconda Inc. and `conda-forge `__, - which is managed by the open source community. These two channels are - mutually incompatible. + Pinning versions means that you pin the environment to + **that instance in time** when these specific versions of the + dependencies were being used. - Channel priority goes from top to bottom. + This can be good for single-use applications, like replicating a research + paper, but it is usually bad for the long-term maintainability of the software. + + Pinning to major versions or to compatible versions is usually the best + practice as that allows your software to co-exist with other packages even + when they are updated. + + Remember that at some point in time you **will** face a situation where + newer versions of the dependencies are no longer compatible with your + software. At this point you'll have to update your software to use the newer + versions or to lock it into a place in time. + + +Exporting package versions from an existing environment +------------------------------------------------------- + +Sometimes you want to create a file that contains the exact versions +of packages in the environment. This is often called *exporting* or +*freezing* and environment. + +Doing this will create a file that does describe the installed +packages, but it won't tell which packages are **the most important +ones** and which ones are just dependencies for those packages. + +Using manually created ``environment.yml`` or ``requirements.txt`` +are in most cases better than automatically created ones because they +shows which packages are the important packages needed by the software. + +.. tabs:: + + .. group-tab:: Exporting environment.yml from a conda environment + + Once you have activated the environment, you can run + + .. code-block:: console + + $ conda env export > environment.yml + + If package build versions are not relevant for the use case, + one can also run + + .. code-block:: console + + $ conda env export --no-builds > environment.yml + + which leaves out the package build versions. + + Alternatively one can also run + + .. code-block:: console + + $ conda env export --from-history > environment.yml + + which creates the ``environment.yml``-file based on + what packages were asked to be installed. + + .. admonition:: conda-lock + + For even more reproducibility, you should try out + `conda-lock `__. + It turns your ``environment.yml`` into a ``conda.lock`` + that has all information needed to **exactly** create + the same environment. You can use ``conda.lock``-files + in same way as ``environment.yml`` when you create + an environment: + + .. code-block:: console + + $ conda env create --file conda.lock + + .. group-tab:: Exporting requirements.txt from a virtual environment + + Once you have activated the environment, you can run + + .. code-block:: console + + $ pip freeze > requirements.txt + + + +Exercise 4 +---------- + +.. challenge:: Dependencies-4: Export an environment (15 min) + + Export the environment you previously created. + + +Additional tips and tricks +-------------------------- + +.. tabs:: + + .. group-tab:: Creating a conda environment from requirements.txt + + conda supports installing an environment from ``requirements.txt``. + + .. code-block:: console + + $ conda env create --name my-environment --channel conda-forge --file requirements.txt + + To create an ``environment.yml`` from this environment that mimics + the ``requirements.txt``, activate it and run + + .. code-block:: console + $ conda env export --from-history > environment.yml -Here are the two files again, but this time with versions pinned: + .. group-tab:: Adding pip packages into conda environments -``requirements.txt`` with versions: + conda supports installing pip packages in an ``environment.yml``. -.. code-block:: none + Usually this is done to add those packages that are missing + from conda channels. - numpy==1.24.3 - matplotlib==3.7.2 - pandas==2.0.3 - scipy==1.10.1 + To do this you'll want to install ``pip`` into the environment + and then add pip-installed packages to a list called ``pip``. -``environment.yml`` with versions: + See this example ``environment.yml``: -.. code-block:: yaml + .. code-block:: yaml - name: my-environment - channels: - - defaults - dependencies: - - python=3.10 - - numpy=1.24.3 - - matplotlib=3.7.2 - - pandas=2.0.3 - - scipy=1.10.1 + name: my-environment + channels: + - conda-forge + dependencies: + - python + - pip + - pip: + - numpy + - matplotlib + - pandas -- Conda can also read and write ``requirements.txt``. -- ``requirements.txt`` can also refer to packages on Github. -- ``environment.yml`` can also contain a ``pip`` section. -- See also: https://coderefinery.github.io/reproducible-research/dependencies/ . + One can even add a full ``requirements.txt`` to the environment: -.. admonition:: Putting too strict requirements can be counter-productive + .. code-block:: yaml - Putting exact version numbers can be good for single-use applications, - like replicating a research paper, but it is usually bad for long-term - maintenance because the program won't update at the same time as it's - requirements do. + name: my-environment + channels: + - conda-forge + dependencies: + - python + - pip + - pip: + - "-r requirements.txt" - If you're creating a library, adding strict dependencies can also create - a situation where the library cannot coexist with another library. + Do note that in both methods the pip-packages come from PyPI + and not from conda channels. The installation of these packages + is done after conda environment is created and this can also + remove or update conda packages installed previously. -Dependencies 4 --------------- + .. group-tab:: Installing pip packages from GitHub -.. challenge:: Dependencies-4: Freeze an environment (15 min) + Packages available in GitHub or other repositorios + can be given as a URL in ``requirements.txt``. - - Create the file ``environment.yml`` or ``requirements.txt`` + For example, to install a development version of the + `black code formatter `__, one can + write the following ``requirement.txt``. - - Create an environment based on these dependencies: - - Conda: ``$ conda env create --file environment.yml`` - - Virtual environment: First create and activate, then ``$ pip install -r requirements.txt`` + .. code-block:: text - - Freeze the environment: - - Conda: ``$ conda env export > environment.yml`` - - Virtual environment: ``$ pip freeze > requirements.txt`` + git+https://github.com/psf/black - - Have a look at the generated ("frozen") file. + or -.. admonition:: Hint: Updating packages from dependency files + .. code-block:: text - Instead of installing packages with ``$ pip install somepackage``, - you can add ``somepackage`` to ``requirements.txt`` and re-run - ``$ pip install -r requirements.txt``. + https://github.com/psf/black/archive/master.zip - With conda, you can add the package to ``environment.yml`` and - run ``$ conda env update --file environment.yml`` + First one would use git to clone the repository, second would + download the zip archive of the repository. How to communicate the dependencies as part of a report/thesis/publication @@ -417,7 +681,7 @@ Version pinning for package creators ------------------------------------ We will talk about packaging in a different session but when you create a library and package -projects, you express dependencies either in ``setup.py`` or ``pyproject.toml`` +projects, you express dependencies either in ``pyproject.toml`` (or ``setup.py``) (PyPI) or ``meta.yaml`` (conda). These dependencies will then be used by either other libraries (who in turn @@ -488,6 +752,7 @@ Other tools for dependency management: - `micromamba `__: tiny version of Mamba as a static C++ executable. Does not need base environment or Python for installing an environment. +- `pixi `__: a package management tool which builds upon the foundation of the conda ecosystem. Other resources: @@ -496,6 +761,7 @@ Other resources: .. keypoints:: + - If somebody asks you what dependencies your code has, you should be able to answer this question **with a file**. - Install dependencies by first recording them in ``requirements.txt`` or ``environment.yml`` and install using these files, then you have a trace. - Use isolated environments and avoid installing packages system-wide.