Free e-Book:The Modern Data Stack:A Technical Roadmap.Download for free now!
A Poetic Apology

A Poetic Apology

Or Why Should You Use Poetry to Manage Python Dependencies
Pedro Ferrari

Posted by Pedro Ferrari

on August 21, 2020 · 12 mins read

If you ever spent some time trying to write a Python application you have probably experienced Python's Infamous Dependency Hell at some point. You've probably also gathered by now that it has become a folk tradition to start any piece about such subject with the following notorious xkcd comic:

xkcd

Luckily for you (and all of us) at the time of writing, there are some good solutions to the entanglement of pains brilliantly depicted in the image above. In fact, you most likely already know that if you want to develop against multiple Python versions you can readily use pyenv to start unraveling your twisted environment. You might have also learned that if you happen to be working at the same time on multiple projects with conflicting dependencies then you can employ virtual environments to isolate clashing libraries. In this document, we'll introduce yet another tool, Poetry, and make an argument about why you should probably add it to your own toolkit.

The Problem

Imagine one lonely night you decide to start a simple dummy Python project accurately named foo with the following structure

├── foo
│   ├── foo
│   │   ├── bar
│   │   │   └── data.py
│   │   └── constants.py
│   └── README.md

Since this is not your first Python project and you want to avoid spending more endless nights fixing incompatibilities between your system and project modules, you diligently initiate a virtual environment from your shell with

$> python -m venv ~/Desktop/venv/foo-venv

and activate it within the newly created project with

$> source ~/Desktop/venv/foo-venv/bin/activate

Equipped with an isolated environment, you triumphantly proceed to install the ubiquitous Pandas data library. To achieve such thing you use Python's de-facto package manager pip and carefully pin the library version to ensure replicability

$> pip install pandas==0.25.3

Since you are a bit lazy at doing exploratory data analysis, you also install the nifty pandas-profiling module to help you with that tedious work

$> pip install pandas-profiling==2.5.0

After all this flirting, you finally start coding (assuming that adding the following lines to the data.py file can, in fact, be called that)

import pandas as pd from pandas_profiling import ProfileReport df=pd.DataFrame([['a',1],['b',None]],columns=['category', 'value']) df['category']=df['category'].astype('category') if __name__=='__main__':ProfileReport(df).to_file('foo.html')

Given that your days of abusing print statements for debugging purposes are long gone, you install the beautiful and handy pdbpp library to check that these lines work as intended

$> pip install pdbpp==0.10.2

and run your code in post mortem debugging mode with python -m pdb -cc data.py.

Happy with the clean run, you now realize that in order to ship your voluptuous application without falling into the "works on my machine" trap you need a way to collect all your dependencies. A quick Google search will show you that pip's freeze subcommand allows to record the current environment package into a requirements.txt file by means of the following incantation

$> pip freeze > requirements.txt

which lets anyone use your project by simply installing the needed dependencies with

$> pip install -r requirements.txt

Just as you are about to reveal your masterpiece project to the world, you become aware that the improved debugging module is actually only used by you while developing. With the idea of splitting the frozen requirements into separate production and development files, you take a peek into the generated file only to discover that each and every single sub-dependency of your application's dependencies is listed therein and locked to a specific version. Foreseeing the nightmare of maintaining this immense list you uninstall the pdbpp library to ensure a clean requirements file again by means of

$> pip uninstall -y pdbpp && pip freeze > requirements.txt

A quick glance at the modified requirements file, however, shows that things didn't quite turn out as expected: pdbpp was indeed removed but its dependencies, such as fancycompleter, are still installed. Since this seems a dead-end, you choose to start from scratch by manually creating a requirements.txt file with only production dependencies

pandas==0.25.3 pandas-profiling==2.5.0

and an equivalent development file, requirements_dev.txt, solely containing

pdbpp==0.10.2

Impressed by the cleverness that has seemingly helped you dodge the dreaded Python dependency hell by keeping a record of isolated top-level packages, you decide to call it a day and give a final spin to your application the next day.

When you wake up in the morning, the news are all over the place: Pandas v1 is finally out (after only twelve years!). A couple of hours procrastinating with the incredibly long changelog makes you conclude that your complex foo-project will surely gain notable improvements by updating to the brand new version. Now since you've locked Pandas to an exact version, you cannot simply run

$> pip install -U -r requirements.txt

Instead you must execute

$> pip install pandas==1.0.0

which leads to a particularly bizarre and confusing situation: an error pops in your terminal

ERROR: pandas-profiling 2.5.0 has requirement pandas==0.25.3, but you'll have pandas 1.0.0 which is incompatible.

but the installation of pandas 1.0.0 nonetheless takes place. Assuming this to be a warning that pip mistakes for an error, you update your requirements.txt file accordingly and joyfully proceed to run your data.py module one last time, only to discover that it throws an enigmatic TypeError. Feeling now betrayed by pip's apparent inability to resolve dependencies, you rollback your changes and stick with Pandas' (now) outdated version.

At this point, you seem to have a working project but i) you are unsure whether reverting Pandas version might have broken the desired replicability of your application, ii) the code could definitely look better and iii) after a good night’s sleep you acknowledge that the overall functionality of your application is not as complex and rich as you thought it was the night before. To address the first two issues, you first add the black formatter to your requirements_dev.txt

black==19.10b0

and then within your project directory, you recreate your virtual environment with

$> rm -rf ~/Desktop/venv/foo-venv $> python -m venv ~/Desktop/venv/foo-venv $> source ~/Desktop/venv/foo-venv/bin/activate $> pip install -r requirements_dev.txt $> pip install -r requirements.txt

Now you run black in your project root (with black .) and are mostly satisfied with the prettifying job it did but to abide by Mutt Data's format style (which is coincidentally consistent with your dislike of making every single quote a double quote), you add a pyproject.toml telling black to skip such appalling string normalization default setting

[tool.black] skip-string-normalization = true

The code looks great now and a new post mortem debugging run shows that things seem to work fine in the new (replicable) environment. The only thing left to be done before deploying the code on the server or sharing it with the world is to avoid having constants, such as the report name, hardcoded all around the code. You thus decide to add the following lines to your constants.py empty module

REPORT_FILE = 'foo.html'

and modify the data.py to import such constant from the relative parent file with

from ..constants import REPORT_FILE

A new data.py run however now, unfortunately, shows the next error

ImportError: attempted relative import with no known parent package

which according to the omniscient SO makes sense since Python relative imports only work within a package and therefore if you want to import from a parent directory you should either create such package or hack the sys.path. As a true purist pythonista, you pick the former path and create a setup.py with the following contents

from setuptools import setup with open('requirements.txt') as f: install_requires = f.read().splitlines() with open('requirements_dev.txt') as f: extras_dev_requires = f.read().splitlines() setup( name='foo', version='0.0.1', author='Mutt', author_email='info@muttdata.ai', install_requires=install_requires, extras_require={'dev': extras_dev_requires}, packages=['foo'], )

Now in a brand new virtual env you install your package in editable mode with pip install -e .[dev], change the import line in data.py to account for the package structure

from foo.constants import REPORT_FILE

and cross your fingers hoping everything finally works...

Everything does indeed (brittlely) work, but somehow all the hoop-jumping to make it function makes you uneasy. A brief introspection reveals several reasons for the wave of apprehension:

  1. Since you plan to work on multiples Python projects at the same time, isolation is a fundamental piece of your workflow. Virtual environments do solve this problem, but the activate/deactivate process is cumbersome and easy to forget.
  2. Having dependencies isolated between projects doesn't address dependency clash within a project. Proper dependency resolution is the top required feature of any package manager worthy of respect yet pip will only get such feature by October 2020. Manually guaranteeing dependency consistency in complex projects is a deadlock.
  3. If you want to install your application/project as a package, you must go through the overhead of adding a setup.py on top of your already multiple requirements files. However, you've read PEPs 517-518 and want to try out the simpler and safer build mechanisms mentioned therein.
  4. You thought about trying your application on a different machine, but realized it ran Python 3.7 while your local box runs 3.8. To use pyenv with your isolated virtual envs, you need an extra plugin, pyenv-virtualenv, which makes managing venvs even more burdensome.
  5. You briefly played with Pipenv which promised to bring to Python the envied features of other languages’ more mature package managers (such as Javascript's yarn/npm or Rust's Cargo) only to become quickly disappointed. Not only did Pipenv misleadingly claim to be Python's official recommended packaging tool (when it was truly designed to write applications and not packages) but it also didn't do releases for more than a year and still hangs endlessly when creating the lock file that ensures repeatable/deterministic builds.

In a state of hopeless despair, you frantically start searching online to see if a solution addressing all these problems already exists. Amid a plethora of partial/incomplete candidates, you at last encounter one that incredibly cracks them all: it's called Poetry.

The Solution

Installation (with Pipx)

Poetry is a CLI app written in Python so you can simply install it with pip install --user poetry. However, you probably already installed or will install other Python CLI apps (for instance the fancy PostgreSQL client pgcli or youtube-dl to download YouTube videos). If you install these with your system's package manager (say for instance apt, yay or brew) they will be installed at a global level and their dependencies could potentially clash. You could create an individual venv for each instead, but in order to use them you would have to go through the hassle of activating the environment first...

To circumvent this annoying scenario you can use pipx which will precisely install packages in an isolated virtual enviroment and at the same time will make them readily available in your shell (i.e add the executable to your binaries $PATH). On top of exposing CLI apps for global access, it also makes easy to list, upgrade and uninstall these apps. To install Poetry with pipx you first install pipx with

$> python -m pip install --user pipx $> python -m pipx ensurepath

and then directly do

$> pipx install poetry

If you prefer living on the edge (like I do) you can alternatively install a pre-release version with pipx install --pip-args='--pre' poetry.

Usage

Now, you are all set to try the wonders promised by Poetry. To that effect, you create a new folder/project called foo-poetry with your .py files above and then run poetry init. An interactive prompt will start asking you to provide basic information about your package (name, author, etc) that will be used to create a pyproject.toml file. This is essentially the same metadata you previously added to the setup.py with some minimal variations

This command will guide you through creating your pyproject.toml config. Package name [foo-poetry]: foo Version [0.1.0]: 0.0.1 Description []: Author [petobens <petobens@yahoo.com>, n to skip]: Mutt <info@muttdata.ai> License []: Compatible Python versions [^3.8]: ~3.7 Would you like to define your main dependencies interactively? (yes/no) [yes] no Would you like to define your development dependencies interactively? (yes/no) [yes] no Generated file [tool.poetry] name = "foo" version = "0.0.1" description = "" authors = ["Mutt <info@muttdata.ai>"] [tool.poetry.dependencies] python = "^3.7" [tool.poetry.dev-dependencies] [build-system] requires = ["poetry-core>=1.0.0a5"] build-backend = "poetry.core.masonry.api" Do you confirm generation? (yes/no) [yes] yes

The two relevant settings to highlight are the build-system and the Python version specification. The only thing you need to know for the time being about the first one is that it uses the standards in PEPs 517-518 to define an alternative way to build a project from source code without setuptools (and hence removes the need of a setup.py file). Regarding the second setting, to understand the syntax that specifies the Python version constraints, you should read Poetry versions docs where you will find out that the caret (^) requirement means that only minor and patch updates are allowed (i.e that our application will work with Python 3.7 and 3.8 but not with 4.0).

So far, you only have a TOML file (which you can also use to centralize your black configuration). How do you specify dependencies? Simply run

$> poetry add pandas==0.25.3

which results in

Creating virtualenv foo-KLaC03aC-py3.8 in /home/pedro/.cache/pypoetry/virtualenvs Updating dependencies Resolving dependencies... (0.6s) Writing lock file Package operations: 5 installs, 0 updates, 0 removals - Installing six (1.15.0) - Installing numpy (1.19.1) - Installing python-dateutil (2.8.1) - Installing pytz (2020.1) - Installing pandas (0.25.3)

In other words an initial add command will i) create a virtual environment, ii) install the requested packages and its subdependencies, iii) write the exact version of each downloaded dependency to the poetry.lock file (which you should commit to your VCS so as to enforce replicability) and iv) append a line with the newly added package to the tool.poetry.dependencies section of the pyproject.toml file. The last item also signals that if you want to install a new dependency you can reuse the add command or directly add such a line to your pyproject.toml file. For instance, if you now want to add the pandas-profiling library, you can modify the pyproject so as to have

pandas-profiling = "2.5.0"

Since at this stage a poetry.lock file already exists, if you now run poetry install then Poetry will resolve and install dependencies using the versions specified in such lock file (to ensure version consistency). However, due to the fact that you added a new dependency manually to the pyproject.toml file, the install command will fail. Therefore, in this case, you need to run poetry update which is essentially equivalent to deleting the lock file and running poetry install again.

Adding a development dependency works in a similar fashion with the only caveat that you need to use the --dev flag when executing the add command

$> poetry add pdbpp==0.10.2 --dev $> poetry add black==19.10b0 --dev

and the resulting packages will be appended to the tool.poetry.dev-dependencies section.

Now that dependencies are set, you can run your code data.py file executing

$> poetry run python data.py

which will execute the command within the project's virtualenv. Alternatively, you can spawn a shell within the active venv simply by running

$> poetry shell

Now imagine that you want to update the Pandas version as you did before when checking pip's inability to enforce dependency resolution. To do that, you update the constraint like

$> poetry add pandas==1.0.0

which this time correctly fails with the following error

Updating dependencies Resolving dependencies... (0.0s) [SolverProblemError] Because pandas-profiling (2.5.0) depends on pandas (0.25.3) and foo depends on pandas (1.0.0), pandas-profiling is forbidden. So, because foo depends on pandas-profiling (2.5.0), version solving failed.

By now, you notice that Poetry seems to address the initial two requests you listed in the previous section (namely easy project isolation and proper automatic dependency resolution). Before getting your hopes up, you proceed to verify whether it can straightforwardly package your code (particularly without a setup.py). Notably, this simple boils down to the inclusion of the following line the tool.poetry section of the pyproject.toml file

packages = [{include = "foo"}]

followed by the execution of a new poetry install which will by default install the project in editable mode.

Excited by Poetry's simplicity and ease of use you start to wonder if Poetry is the ultimate tool you've been looking for. Can it check all the boxes? To conclusively answer that question you want to see if it is easy to switch between different Python versions. Given that your local machine uses Python 3.8 by default, you consequently install 3.7.7 with pyenv install 3.7.7 (installing a prior release would not have worked down the road since you set 3.7 to be the lower bound in your application pyproject.toml). To make this version locally available, you add a .python-version file to the root of your project containing a single line with 3.7.7 and then tell Poetry to create and use a virtualenv with that version with

$> poetry env use 3.7

Once you check it's correctly activated with poetry env list you then install all dependencies with poetry install and ultimately run your code which (unsurprisingly) finishes without issues.

Marveled by its intuitive plainness, you conclude that Poetry is exactly what you needed. In fact, you don't yet know this, but you got way more than you bargained for since you've only scratched the surface of features. You still have to discover that it installs packages in parallel, throws beautiful colored exceptions when all hell breaks loose, integrates with your IDE/editor of choice (if that's vim you can try your humble servant's shameless take on the matter), has a command to directly publish a package and, among other countless delights, is scheduled to have a plugin system for further extensibility.

One thing is crystal clear: Poetry is tomorrow's Python package manager. You might as well start using it today.

Note: this article belongs to our internal onboarding docs and has been slightly adjusted for general readership.