Skip to content

Latest commit

 

History

History
333 lines (233 loc) · 29 KB

python.md

File metadata and controls

333 lines (233 loc) · 29 KB

Python

Page maintainer: Patrick Bos @egpbos

Python is the "dynamic language of choice" of the Netherlands eScience Center. We use it for data analysis and data science projects using the SciPy stack and Jupyter notebooks, and for many other types of projects: workflow management, visualization, NLP, web-based tools and much more. It is a good default choice for many kinds of projects due to its generic nature, its large and broad ecosystem of third-party modules and its compact syntax which allows for rapid prototyping. It is not the language of maximum performance, although in many cases performance critical components can be easily replaced by modules written in faster, compiled languages like C(++) or Cython.

The philosophy of Python is summarized in the Zen of Python. In Python, this text can be retrieved with the import this command.

Project setup

When starting a new Python project, consider using our Python template. This template provides a basic project structure, so you can spend less time setting up and configuring your new Python packages, and comply with the software guide right from the start.

Use Python 3, avoid 2

Python 2 and Python 3 have co-existed for a long time, but starting from 2020, development of Python 2 is officially abandoned, meaning Python 2 will no longer be improved, even in case of security issues. If you are creating a new package, use Python 3. It is possible to write Python that is both Python 2 and Python 3 compatible (e.g. using Six), but only do this when you are 100% sure that your package won't be used otherwise. If you need Python 2 because of old, incompatible Python 2 libraries, strongly consider upgrading those libraries to Python 3 or replacing them altogether. Building and/or using Python 2 is probably discouraged even more than, say, using Fortran 77, since at least Fortran 77 compilers are still being maintained.

Learning Python

  • A popular way to learn Python is by doing it the hard way at http://learnpythonthehardway.org/
  • Using pylint and yapf while learning Python is an easy way to get familiar with best practices and commonly used coding styles

Dependencies and package management

To install Python packages use pip or conda (or both, see also what is the difference between pip and conda?).

If you are planning on distributing your code at a later stage, be aware that your choice of package management may affect your packaging process. See Building and packaging for more info.

Use virtual environments

We strongly recommend creating isolated "virtual environments" for each Python project. These can be created with venv or with conda. Advantages over installing packages system-wide or in a single user folder:

  • Installs Python modules when you are not root.
  • Contains all Python dependencies so the environment keeps working after an upgrade.
  • Keeps environments clean for each project, so you don't get more than you need (and can easily reproduce that minimal working situation).
  • Lets you select the Python version per environment, so you can test code compatibility between Python versions

Pip + a virtual environment

If you don't want to use conda, create isolated Python environments with the standard library venv module. If you are still using Python 2, virtualenv and virtualenvwrapper can be used instead.

With venv and virtualenv, pip is used to install all dependencies. An increasing number of packages are using wheel, so pip downloads and installs them as binaries. This means they have no build dependencies and are much faster to install.

If the installation of a package fails because of its non-Python extensions or system library dependencies and you are not root, you could switch to conda (see below).

Conda

Conda can be used instead of venv and pip, since it is both an environment manager and a package manager. It easily installs binary dependencies, like Python itself or system libraries. Installation of packages that are not using wheel, but have a lot of non-Python code, is much faster with Conda than with pip because Conda does not compile the package, it only downloads compiled packages. The disadvantage of Conda is that the package needs to have a Conda build recipe. Many Conda build recipes already exist, but they are less common than the setuptools configuration that generally all Python packages have.

There are two main distributions of Conda: Anaconda and Miniconda. Anaconda is large and contains a lot of common packages, like numpy and matplotlib, whereas Miniconda is very lightweight and only contains Python. If you need more, the conda command acts as a package manager for Python packages. If installation with the conda command is too slow for your purposes, it is recommended that you use mamba instead.

For environments where you do not have admin rights (e.g. DAS-6) either Anaconda or Miniconda is highly recommended since the installation is very straightforward. The installation of packages through Conda is very robust.

A possible downside of Anaconda is the fact that this is offered by a commercial supplier, but we don't foresee any vendor lock-in issues, because all packages are open source and can still be obtained elsewhere. Do note that since 2020, Anaconda has started to ask money from large institutes for downloading packages from their main channel (called the default channel) through conda. This does not apply to universities and most research institutes, but could apply to some government institutes that also perform research and definitely applies to large for-profit companies. Be aware of this when choosing the distribution channel for your package. An alternative installer that avoids this problem altogether because it only installs packages from conda-forge by default is miniforge. There is also a mambaforge version that uses the faster mamba by default.

Building and packaging code

Making an installable package

To create an installable Python package, use the setuptools module. This involves creating two files: setup.cfg and pyproject.toml. Our Python template already does this for you.

setup.cfg is the primary location where you should list your dependencies; use the install_requires argument to list them. Keep version constraints to a minimum; use, in order of descending preference: no constraints, lower bounds, lower + upper bounds, exact versions. Use of requirements.txt is discouraged, unless necessary for something specific, see the discussion here. It is possible to find the currently installed packages with pip freeze or conda list, but note that this is not ideal for listing dependencies in setup.cfg, because it also lists all dependencies of the dependencies that you use. It is better to keep track of direct dependencies for your project from the start. Another quick way to find all direct dependencies is by running your code in a clean environment (probably by running your test suite) and installing one by one the dependencies that are missing, as reported by the ensuing errors.

Most other configuration should also be in setup.cfg. pyproject.toml can be used to specify the build system, i.e. setuptools itself.

It's possible that in the future all configuration will move from setup.cfg to pyproject.toml, but as of yet this is not common practice. Most tools, like pytest, mypy and others do support using pyproject.toml already. The Python build system is still very much in flux, though, so be sure to look up some current practices in authoritative blogs like this one. One important thing to note is that use of setup.py has been officially deprecated and we should migrate away from that.

When the setup.cfg is written, your package can be installed with

pip install -e .

The -e flag will install your package in editable mode, i.e. it will create a symlink to your package in the installation location instead of copying the package. This is convenient when developing, because any changes you make to the source code will immediately be available for use in the installed version.

Set up continuous integration to test your installation setup. Use pyroma (can be run as part of prospector) as a linter for your installation configuration.

Packaging and distributing your package

For packaging your code, you can either use pip or conda. Neither of them is better than the other -- they are different; use the one which is more suitable for your project. pip may be more suitable for distributing pure python packages, and it provides some support for binary dependencies using wheels. conda may be more suitable when you have external dependencies which cannot be packaged in a wheel.

  • Build and upload your package to the Python Package Index (PyPI) so it can be installed with pip.

    • Either do this manually by using twine (tutorial),
    • Or configure GitHub Actions to do it automatically for each release: see this example workflow in DIANNA.
    • Additional guidelines:
      • Packages should be uploaded to PyPI using your own account
      • For packages developed in a team or organization, it is recommended that you create a team or organizational account on PyPI and add that as a collaborator with the owner rule. This will allow your team or organization to maintain the package even if individual contributors at some point move on to do other things. At the Netherlands eScience Center, we are a fairly small organization, so we use a single backup account (nlesc).
      • When distributing code through PyPI, non-python files (such as requirements.txt) will not be packaged automatically, you need to add them to a MANIFEST.in file.
      • To test whether your distribution will work correctly before uploading to PyPI, you can run python -m build in the root of your repository. Then try installing your package with pip install dist/<your_package>tar.gz.
      • python -m build will also build Python wheels, the current standard for distributing Python packages. This will work out of the box for pure Python code, without C extensions. If C extensions are used, each OS needs to have its own wheel. The manylinux Docker images can be used for building wheels compatible with multiple Linux distributions. Wheel building can be automated using GitHub Actions or another CI solution, where you can build on all three major platforms using a build matrix.
  • Build using conda

    • Make use of conda-forge whenever possible, since it provides many automated build services that save you tons of work, compared to using your own conda repository. It also has a very active community for when you need help.
    • Use BioConda or custom channels (hosted on GitHub) as alternatives if need be.

Editors and IDEs

Every major text editor supports Python, either natively or through plugins. At the Netherlands eScience Center, some popular editors or IDEs are:

  • vscode holds the middle ground between a lightweight text editor and a full-fledged language-dedicated IDE.
  • vim or emacs (don't forget to install plugins to get the most out of these two), two versatile classic powertools that can also be used through remote SSH connection when needed.
  • JetBrains PyCharm is the Python-specific IDE of choice. PyCharm Community Edition is free and open source; the source code is available in the python folder of the IntelliJ repository.

Coding style conventions

The style guide for Python code is PEP8 and for docstrings it is PEP257. We highly recommend following these conventions, as they are widely agreed upon to improve readability. To make following them significantly easier, we recommend using a linter.

Many linters exists for Python, prospector is a tool for running a suite of linters, it supports, among others:

Make sure to set strictness to veryhigh for best results. prospector has its own configuration file, like the .prospector.yml default in the Python template, but also supports configuration files for any of the linters that it runs. Most of the above tools can be integrated in text editors and IDEs for convenience.

Autoformatting tools like yapf and black can automatically format code for optimal readability. yapf is configurable to suit your (team's) preferences, whereas black enforces the style chosen by the black authors. The isort package automatically formats and groups all imports in a standard, readable way.

Testing

Use pytest as the basis for your testing setup. This is preferred over the unittest standard library, because it has a much more concise syntax and supports many useful features.

It has many plugins. For linting, we have found pytest-pycodestyle, pytest-pydocstyle, pytest-mypy and pytest-flake8 to be useful. Other plugins we had good experience with are pytest-cov, pytest-html, pytest-xdist and pytest-nbmake.

Creating mocks can also be done within the pytest framework by using the mocker fixture provided by the pytest-mock plugin or by using MagicMock and patch from unittest. For a general explanation about mocking, see the standard library docs on mocking.

To run your test suite, it can be convenient to use tox. Testing with tox allows for keeping the testing environment separate from your development environment. The development environment will typically accumulate (old) packages during development that interfere with testing; this problem is avoided by testing with tox.

Code coverage

When you have tests it is also a good to see which source code is exercised by the test suite. Code coverage can be measured with the coverage Python package. The coverage package can also generate html reports which show which line was covered. Most test runners have have the coverage package integrated.

The code coverage reports can be published online in code quality service or code coverage services. Preferred is to use one of the code quality service which also handles code coverage listed below. If this is not possible or does not fit then use one of the generic code coverage service list in the software guide.

Code quality analysis tools and services

Code quality service is explained in the The Turing Way. There are multiple code quality services available for Python, all of which have their pros and cons. See The Turing Way for links to lists of possible services. We currently setup Sonarcloud by default in our Python template. To reproduce the Sonarcloud pipeline locally, you can use SonarLint in your IDE. If you use another editor, perhaps it is more convenient to pick another service like Codacy or Codecov.

Debugging and profiling

Debugging

Profiling

There are a number of available profiling tools that are suitable for different situations.

  • cProfile measures number of function calls and how much CPU time they take. The output can be further analyzed using the pstats module.
  • For more fine-grained, line-by-line CPU time profiling, two modules can be used:
    • line_profiler provides a function decorator that measures the time spent on each line inside the function.
    • pprofile is less intrusive; it simply times entire Python scripts line-by-line. It can give output in callgrind format, which allows you to study the statistics and call tree in kcachegrind (often used for analyzing c(++) profiles from valgrind).

More realistic profiling information can usually be obtained by using statistical or sampling profilers. The profilers listed below all create nice flame graphs.

Logging

Writing Documentation

Python uses Docstrings for function level documentation. You can read a detailed description of docstring usage in PEP 257. The default location to put HTML documentation is Read the Docs. You can connect your account at Read the Docs to your GitHub account and let the HTML be generated automatically using Sphinx.

Autogenerating the documentation

There are several tools that automatically generate documentation from docstrings. At the eScience Center, we mostly use Sphinx, which uses reStructuredText as its markup language, but can be extended to use Markdown as well.

We recommend using the Google documentation style. Use sphinx-build to build your documentation.

You can also integrate entire Jupyter notebooks into your HTML Sphinx output with nbsphinx. This way, your demo notebooks, for instance, can double as documentation. Of course, the notebooks will not be interactive in the compiled HTMl, but they will include all code and output cells.

Recommended additional packages and libraries

General scientific

  • NumPy
  • SciPy
  • Pandas data analysis toolkit
  • scikit-learn: machine learning in Python
  • Cython speed up Python code by using C types and calling C functions
  • dask larger than memory arrays and parallel execution

IPython and Jupyter notebooks (aka IPython notebooks)

IPython is an interactive Python interpreter -- very much the same as the standard Python interactive interpreter, but with some extra features (tab completion, shell commands, in-line help, etc).

Jupyter notebooks (formerly know as IPython notebooks) are browser based interactive Python enviroments. It incorporates the same features as the IPython console, plus some extras like in-line plotting. Look at some examples to find out more. Within a notebook you can alternate code with Markdown comments (and even LaTeX), which is great for reproducible research. Notebook extensions adds extra functionalities to notebooks. JupyterLab is a web-based environment with a lot of improvements and integrated tools.

Jupyter notebooks contain data that makes it hard to nicely keep track of code changes using version control. If you are using git, you can add filters that automatically remove output cells and unneeded metadata from your notebooks. If you do choose to keep output cells in the notebooks (which can be useful to showcase your code's capabilities statically from GitHub) use ReviewNB to automatically create nice visual diffs in your GitHub pull request threads. It is good practice to restart the kernel and run the notebook from start to finish in one go before saving and committing, so you are sure that everything works as expected.

Visualization

  • Matplotlib has been the standard in scientific visualization. It supports quick-and-dirty plotting through the pyplot submodule. Its object oriented interface can be somewhat arcane, but is highly customizable and runs natively on many platforms, making it compatible with all major OSes and environments. It supports most sources of data, including native Python objects, Numpy and Pandas.
    • Seaborn is a Python visualisation library based on Matplotlib and aimed towards statistical analysis. It supports numpy, pandas, scipy and statmodels.
  • Web-based:
    • Bokeh is Interactive Web Plotting for Python.
    • Plotly is another platform for interactive plotting through a web browser, including in Jupyter notebooks.
    • altair is a grammar of graphics style declarative statistical visualization library. It does not render visualizations itself, but rather outputs Vega-Lite JSON data. This can lead to a simplified workflow.
    • ggplot is a plotting library imported from R.

Database Interface

Parallelisation

CPython (the official and mainstream Python implementation) is not built for parallel processing due to the global interpreter lock. Note that the GIL only applies to actual Python code, so compiled modules like e.g. numpy do not suffer from it.

Having said that, there are many ways to run Python code in parallel:

Web Frameworks

There are convenient Python web frameworks available:

We recommend flask.

NLP/text mining

  • nltk Natural Language Toolkit
  • Pattern: web/text mining module
  • gensim: Topic modeling

Creating programs with command line arguments

  • For run-time configuration via command-line options, the built-in argparse module usually suffices.
  • A more complete solution is ConfigArgParse. This (almost) drop-in replacement for argparse allows you to not only specify configuration options via command-line options, but also via (ini or yaml) configuration files and via environment variables.
  • Other popular libraries are click and fire.