Page maintainer: Patrick Bos @egpbos
Python is the "dynamic language of choice" of the Netherlands eScience Center. We use it for data analysis and data science projects using the SciPy stack and Jupyter notebooks, and for many other types of projects: workflow management, visualization, NLP, web-based tools and much more. It is a good default choice for many kinds of projects due to its generic nature, its large and broad ecosystem of third-party modules and its compact syntax which allows for rapid prototyping. It is not the language of maximum performance, although in many cases performance critical components can be easily replaced by modules written in faster, compiled languages like C(++) or Cython.
The philosophy of Python is summarized in the Zen of Python.
In Python, this text can be retrieved with the import this
command.
When starting a new Python project, consider using our Python template. This template provides a basic project structure, so you can spend less time setting up and configuring your new Python packages, and comply with the software guide right from the start.
Python 2 and Python 3 have co-existed for a long time, but starting from 2020, development of Python 2 is officially abandoned, meaning Python 2 will no longer be improved, even in case of security issues. If you are creating a new package, use Python 3. It is possible to write Python that is both Python 2 and Python 3 compatible (e.g. using Six), but only do this when you are 100% sure that your package won't be used otherwise. If you need Python 2 because of old, incompatible Python 2 libraries, strongly consider upgrading those libraries to Python 3 or replacing them altogether. Building and/or using Python 2 is probably discouraged even more than, say, using Fortran 77, since at least Fortran 77 compilers are still being maintained.
- Things you’re probably not using in Python 3 – but should
- Six: Python 2 and 3 Compatibility Library
- 2to3: Automated Python 2 to 3 code translation
- python-modernize: wrapper around 2to3
- A popular way to learn Python is by doing it the hard way at http://learnpythonthehardway.org/
- Using
pylint
andyapf
while learning Python is an easy way to get familiar with best practices and commonly used coding styles
To install Python packages use pip
or conda
(or both, see also what is the difference between pip and conda?).
If you are planning on distributing your code at a later stage, be aware that your choice of package management may affect your packaging process. See Building and packaging for more info.
We strongly recommend creating isolated "virtual environments" for each Python project.
These can be created with venv
or with conda
.
Advantages over installing packages system-wide or in a single user folder:
- Installs Python modules when you are not root.
- Contains all Python dependencies so the environment keeps working after an upgrade.
- Keeps environments clean for each project, so you don't get more than you need (and can easily reproduce that minimal working situation).
- Lets you select the Python version per environment, so you can test code compatibility between Python versions
If you don't want to use conda
, create isolated Python environments with the standard library venv
module.
If you are still using Python 2, virtualenv
and virtualenvwrapper
can be used instead.
With venv
and virtualenv
, pip
is used to install all dependencies. An increasing number of packages are using wheel
, so pip
downloads and installs them as binaries. This means they have no build dependencies and are much faster to install.
If the installation of a package fails because of its non-Python extensions or system library dependencies and you are not root, you could switch to conda
(see below).
Conda can be used instead of venv and pip, since it is both an environment manager and a package manager. It easily installs binary dependencies, like Python itself or system libraries.
Installation of packages that are not using wheel
, but have a lot of non-Python code, is much faster with Conda than with pip
because Conda does not compile the package, it only downloads compiled packages.
The disadvantage of Conda is that the package needs to have a Conda build recipe.
Many Conda build recipes already exist, but they are less common than the setuptools
configuration that generally all Python packages have.
There are two main distributions of Conda: Anaconda and Miniconda. Anaconda is large and contains a lot of common packages, like numpy and matplotlib, whereas Miniconda is very lightweight and only contains Python. If you need more, the conda
command acts as a package manager for Python packages.
If installation with the conda
command is too slow for your purposes, it is recommended that you use mamba
instead.
For environments where you do not have admin rights (e.g. DAS-6) either Anaconda or Miniconda is highly recommended since the installation is very straightforward. The installation of packages through Conda is very robust.
A possible downside of Anaconda is the fact that this is offered by a commercial supplier, but we don't foresee any vendor lock-in issues, because all packages are open source and can still be obtained elsewhere.
Do note that since 2020, Anaconda has started to ask money from large institutes for downloading packages from their main channel (called the default
channel) through conda
.
This does not apply to universities and most research institutes, but could apply to some government institutes that also perform research and definitely applies to large for-profit companies.
Be aware of this when choosing the distribution channel for your package.
An alternative installer that avoids this problem altogether because it only installs packages from conda-forge
by default is miniforge.
There is also a mambaforge version that uses the faster mamba
by default.
To create an installable Python package, use the setuptools
module.
This involves creating two files: setup.cfg
and pyproject.toml
.
Our Python template already does this for you.
setup.cfg
is the primary location where you should list your dependencies; use the install_requires
argument to list them.
Keep version constraints to a minimum; use, in order of descending preference: no constraints, lower bounds, lower + upper bounds, exact versions.
Use of requirements.txt
is discouraged, unless necessary for something specific, see the discussion here.
It is possible to find the currently installed packages with pip freeze
or conda list
, but note that this is not ideal for listing dependencies in setup.cfg
, because it also lists all dependencies of the dependencies that you use.
It is better to keep track of direct dependencies for your project from the start.
Another quick way to find all direct dependencies is by running your code in a clean environment (probably by running your test suite) and installing one by one the dependencies that are missing, as reported by the ensuing errors.
Most other configuration should also be in setup.cfg
.
pyproject.toml
can be used to specify the build system, i.e. setuptools
itself.
It's possible that in the future all configuration will move from setup.cfg
to pyproject.toml
, but as of yet this is not common practice.
Most tools, like pytest
, mypy
and others do support using pyproject.toml
already.
The Python build system is still very much in flux, though, so be sure to look up some current practices in authoritative blogs like this one.
One important thing to note is that use of setup.py
has been officially deprecated and we should migrate away from that.
When the setup.cfg
is written, your package can be installed with
pip install -e .
The -e
flag will install your package in editable mode, i.e. it will create a symlink to your package in the installation location instead of copying the package. This is convenient when developing, because any changes you make to the source code will immediately be available for use in the installed version.
Set up continuous integration to test your installation setup. Use pyroma
(can be run as part of prospector
) as a linter for your installation configuration.
For packaging your code, you can either use pip
or conda
. Neither of them is better than the other -- they are different; use the one which is more suitable for your project. pip
may be more suitable for distributing pure python packages, and it provides some support for binary dependencies using wheels
. conda
may be more suitable when you have external dependencies which cannot be packaged in a wheel.
-
Build and upload your package to the Python Package Index (PyPI) so it can be installed with pip.
- Either do this manually by using twine (tutorial),
- Or configure GitHub Actions to do it automatically for each release: see this example workflow in DIANNA.
- Additional guidelines:
- Packages should be uploaded to PyPI using your own account
- For packages developed in a team or organization, it is recommended that you create a team or organizational account on PyPI and add that as a collaborator with the owner rule. This will allow your team or organization to maintain the package even if individual contributors at some point move on to do other things. At the Netherlands eScience Center, we are a fairly small organization, so we use a single backup account (
nlesc
). - When distributing code through PyPI, non-python files (such as
requirements.txt
) will not be packaged automatically, you need to add them to aMANIFEST.in
file. - To test whether your distribution will work correctly before uploading to PyPI, you can run
python -m build
in the root of your repository. Then try installing your package withpip install dist/<your_package>tar.gz.
python -m build
will also build Python wheels, the current standard for distributing Python packages. This will work out of the box for pure Python code, without C extensions. If C extensions are used, each OS needs to have its own wheel. The manylinux Docker images can be used for building wheels compatible with multiple Linux distributions. Wheel building can be automated using GitHub Actions or another CI solution, where you can build on all three major platforms using a build matrix.
-
- Make use of conda-forge whenever possible, since it provides many automated build services that save you tons of work, compared to using your own conda repository. It also has a very active community for when you need help.
- Use BioConda or custom channels (hosted on GitHub) as alternatives if need be.
Every major text editor supports Python, either natively or through plugins. At the Netherlands eScience Center, some popular editors or IDEs are:
- vscode holds the middle ground between a lightweight text editor and a full-fledged language-dedicated IDE.
- vim or
emacs
(don't forget to install plugins to get the most out of these two), two versatile classic powertools that can also be used through remote SSH connection when needed. - JetBrains PyCharm is the Python-specific IDE of choice. PyCharm Community Edition is free and open source; the source code is available in the python folder of the IntelliJ repository.
The style guide for Python code is PEP8 and for docstrings it is PEP257. We highly recommend following these conventions, as they are widely agreed upon to improve readability. To make following them significantly easier, we recommend using a linter.
Many linters exists for Python, prospector
is a tool for running a suite of linters, it supports, among others:
Make sure to set strictness to veryhigh
for best results. prospector
has its own configuration file, like the .prospector.yml default in the Python template, but also supports configuration files for any of the linters that it runs. Most of the above tools can be integrated in text editors and IDEs for convenience.
Autoformatting tools like yapf
and black
can automatically format code for optimal readability. yapf
is configurable to suit your (team's) preferences, whereas black
enforces the style chosen by the black
authors. The isort
package automatically formats and groups all imports in a standard, readable way.
Use pytest as the basis for your testing setup.
This is preferred over the unittest
standard library, because it has a much more concise syntax and supports many useful features.
It has many plugins.
For linting, we have found pytest-pycodestyle
, pytest-pydocstyle
, pytest-mypy
and pytest-flake8
to be useful.
Other plugins we had good experience with are pytest-cov
, pytest-html
, pytest-xdist
and pytest-nbmake
.
Creating mocks can also be done within the pytest framework by using the mocker
fixture provided by the pytest-mock
plugin or by using MagicMock
and patch
from unittest
.
For a general explanation about mocking, see the standard library docs on mocking.
To run your test suite, it can be convenient to use tox
.
Testing with tox
allows for keeping the testing environment separate from your development environment.
The development environment will typically accumulate (old) packages during development that interfere with testing; this problem is avoided by testing with tox
.
When you have tests it is also a good to see which source code is exercised by the test suite. Code coverage can be measured with the coverage Python package. The coverage package can also generate html reports which show which line was covered. Most test runners have have the coverage package integrated.
The code coverage reports can be published online in code quality service or code coverage services. Preferred is to use one of the code quality service which also handles code coverage listed below. If this is not possible or does not fit then use one of the generic code coverage service list in the software guide.
Code quality service is explained in the The Turing Way. There are multiple code quality services available for Python, all of which have their pros and cons. See The Turing Way for links to lists of possible services. We currently setup Sonarcloud by default in our Python template. To reproduce the Sonarcloud pipeline locally, you can use SonarLint in your IDE. If you use another editor, perhaps it is more convenient to pick another service like Codacy or Codecov.
-
Python has its own debugger called pdb. It is a part of the Python distribution.
-
pudb is a console-based Python debugger which can easily be installed using pip.
-
If you are looking for IDEs with debugging capabilities, see the Editors and IDEs section.
-
If you are using Windows, Python Tools for Visual Studio adds Python support for Visual Studio.
-
If you would like to integrate pdb with
vim
, you can use Pyclewn. -
List of other available software can be found on the Python wiki page on debugging tools.
-
If you are looking for some tutorials to get started:
There are a number of available profiling tools that are suitable for different situations.
- cProfile measures number of function calls and how much CPU time they take. The output can be further analyzed using the
pstats
module. - For more fine-grained, line-by-line CPU time profiling, two modules can be used:
- line_profiler provides a function decorator that measures the time spent on each line inside the function.
- pprofile is less intrusive; it simply times entire Python scripts line-by-line. It can give output in callgrind format, which allows you to study the statistics and call tree in
kcachegrind
(often used for analyzing c(++) profiles fromvalgrind
).
More realistic profiling information can usually be obtained by using statistical or sampling profilers. The profilers listed below all create nice flame graphs.
- logging module is the most commonly used tool to track events in Python code.
- Tutorials:
Python uses Docstrings for function level documentation. You can read a detailed description of docstring usage in PEP 257. The default location to put HTML documentation is Read the Docs. You can connect your account at Read the Docs to your GitHub account and let the HTML be generated automatically using Sphinx.
There are several tools that automatically generate documentation from docstrings. At the eScience Center, we mostly use Sphinx, which uses reStructuredText as its markup language, but can be extended to use Markdown as well.
- Sphinx quickstart
- reStructuredText Primer
- Instead of using reST, Sphinx can also generate documentation from the more readable NumPy style or Google style docstrings. The Napoleon extension needs to be enabled.
We recommend using the Google documentation style.
Use sphinx-build
to build your documentation.
You can also integrate entire Jupyter notebooks into your HTML Sphinx output with nbsphinx. This way, your demo notebooks, for instance, can double as documentation. Of course, the notebooks will not be interactive in the compiled HTMl, but they will include all code and output cells.
- NumPy
- SciPy
- Pandas data analysis toolkit
- scikit-learn: machine learning in Python
- Cython speed up Python code by using C types and calling C functions
- dask larger than memory arrays and parallel execution
IPython is an interactive Python interpreter -- very much the same as the standard Python interactive interpreter, but with some extra features (tab completion, shell commands, in-line help, etc).
Jupyter notebooks (formerly know as IPython notebooks) are browser based interactive Python enviroments. It incorporates the same features as the IPython console, plus some extras like in-line plotting. Look at some examples to find out more. Within a notebook you can alternate code with Markdown comments (and even LaTeX), which is great for reproducible research. Notebook extensions adds extra functionalities to notebooks. JupyterLab is a web-based environment with a lot of improvements and integrated tools.
Jupyter notebooks contain data that makes it hard to nicely keep track of code changes using version control. If you are using git, you can add filters that automatically remove output cells and unneeded metadata from your notebooks. If you do choose to keep output cells in the notebooks (which can be useful to showcase your code's capabilities statically from GitHub) use ReviewNB to automatically create nice visual diffs in your GitHub pull request threads. It is good practice to restart the kernel and run the notebook from start to finish in one go before saving and committing, so you are sure that everything works as expected.
- Matplotlib has been the standard in scientific visualization. It supports quick-and-dirty plotting through the
pyplot
submodule. Its object oriented interface can be somewhat arcane, but is highly customizable and runs natively on many platforms, making it compatible with all major OSes and environments. It supports most sources of data, including native Python objects, Numpy and Pandas.- Seaborn is a Python visualisation library based on Matplotlib and aimed towards statistical analysis. It supports numpy, pandas, scipy and statmodels.
- Web-based:
- Bokeh is Interactive Web Plotting for Python.
- Plotly is another platform for interactive plotting through a web browser, including in Jupyter notebooks.
- altair is a grammar of graphics style declarative statistical visualization library. It does not render visualizations itself, but rather outputs Vega-Lite JSON data. This can lead to a simplified workflow.
- ggplot is a plotting library imported from R.
- psycopg is an PostgreSQL adapter
- cx_Oracle enables access to Oracle databases
- monetdb.sql is monetdb Python client
- pymongo and motor allow for work with MongoDB database
- py-leveldb are thread-safe Python bindings for LevelDb
CPython (the official and mainstream Python implementation) is not built for parallel processing due to the global interpreter lock. Note that the GIL only applies to actual Python code, so compiled modules like e.g. numpy
do not suffer from it.
Having said that, there are many ways to run Python code in parallel:
- The multiprocessing module is the standard way to do parallel executions in one or multiple machines, it circumvents the GIL by creating multiple Python processess.
- A much simpler alternative in Python 3 is the
concurrent.futures
module. - IPython / Jupyter notebooks have built-in parallel and distributed computing capabilities
- Many modules have parallel capabilities or can be compiled to have them.
- At the eScience Center, we have developed the Noodles package for creating computational workflows and automatically parallelizing it by dispatching independent subtasks to parallel and/or distributed systems.
There are convenient Python web frameworks available:
- flask
- cherrypy
- Django
- bottle (similar to flask, but a bit more light-weight for a JSON-REST service)
We recommend flask
.
- For run-time configuration via command-line options, the built-in
argparse
module usually suffices. - A more complete solution is
ConfigArgParse
. This (almost) drop-in replacement forargparse
allows you to not only specify configuration options via command-line options, but also via (ini or yaml) configuration files and via environment variables. - Other popular libraries are
click
andfire
.