Skip to content

A Modern Python Packaging Guide

Packaging is a tedious and dirty job for python developers. The python packaging ecosystem is a mess, and full of historical burdens. You might confused with the different tools and command-line interfaces, and suffer from decision paralysis when it comes to choosing project structures, configuration files, and packaging tools. In the guide, I will give you a brife introduction from create a repo of github to pulish it to pypi and conda in a dictatorial manner.

Up to now, many of python scaffolds have emerged. They provide very convenient ways to create a python project, let developer rid of tiresome configuration. However, under hook, They assist you in using underlying tools such as setuptools. Once you want to customize your project, you will find that you have to learn the underlying tools. So, In the guide, an example project will build up from scratch, rely on less dependencies as possible.

To make the guide more readable, I will use a simple project as an example.

Before we start, we should think what structure of our project choose.

About project structure

Two popular project layouts are used in python project, each one with its own set of pros and cons1[^2]. The first one is called flat-layout, also known as adhoc. The package folders are placed directly under the project root like this:

project_root_directory
├── pyproject.toml  # AND/OR setup.cfg, setup.py
├── ...
├── mypkg/
|   ├── __init__.py
|   ├── ...
|   ├── module.py
|   ├── subpkg1/
|   │   ├── __init__.py
|   │   ├── ...
|   │   └── module1.py
|   └── subpkg2/
|       ├── __init__.py
|       ├── ...
|        └── module2.py
└── tests/
    ├── __init__.py
    ├── ...
    └── test_mypkg.py

This layout is very common and practical for using the REPL, but in some situations it can be more error-prone (e.g. during tests or if you have a bunch of folders or Python files hanging around your project root). The structure will bring up a problem: You get import parity. The current directory(e.g. root dir) is implicitly included in sys.path; but not so when installing & importing from site-packages. Users will never have the same current working directory as you do. For example, once you import the project in the tests in such way: import mypkg, it will not import the project in the site-packages but the current directory. So, more people are consider about the second layout, src-layout like this:

project_root_directory
├── pyproject.toml  # AND/OR setup.cfg, setup.py
├── ...
└── src/  # NOTE: no __init__.py
    └── mypkg/  
        ├── __init__.py
        ├── ...
        ├── module.py
        ├── subpkg1/
        │   ├── __init__.py
        │   ├── ...
        │   └── module1.py
        └── subpkg2/
            ├── __init__.py
            ├── ...
            └── module2.py

In this layout, the package mypkg is under src directory, and the src directory is not a package (no __init__.py). This constraint has beneficial implications in both testing and packaging:

  1. You will be forced to install the package and test the installed code (e.g.: by installing in a virtualenv). This will ensure that the deployed code works (it's packaged correctly) - otherwise your tests will fail.

  2. It prevents you from readily importing your code in the setup.py script. This is a bad practice because it will always blow up if importing the main package or module triggers additional imports for dependencies (which may not be available).

  3. Simpler packaging code and manifest. Without src you get messy editable installs (setup.py develop or pip install -e). Having no separation (no src dir) will force setuptools to put your project's root on sys.path - with all the junk in it.

  4. Less chance for tools to mixup code with non-code.

Also, you will notice that we not include tests in the src or packages because tests are concerned with development, not usage. It require addtional dependencies to run. Module discovery tools will trip over your test modules.

module, package and namespace

Package, module, and namespace are three different concepts in Python:

  1. Module: A module is a file containing Python code. It can define functions, classes, and variables, which can be used in other modules or scripts. Modules can be imported into other modules or scripts using the import statement.

  2. Package: A package is a collection of modules. It is a way to organize related modules together. A package is represented by a directory containing an __init__.py file and can contain other modules or sub-packages.

  3. Namespace: A namespace is a mapping between names and objects. It is used to avoid naming conflicts between different modules or packages. Namespaces can be created using the global statement, or by defining functions, classes, or modules.

In summary, a module is a file containing Python code, a package is a collection of related modules, and a namespace is a mapping between names and objects that helps avoid naming conflicts. We will invoke the namespace in the following sections.

About setuptools

Note

I rewrite the project as a template so it can be used as a scaffold for future project. The file and function name may be different with this tutorial, but the structure keep unchanged.

After understanding the structure, we can create our example project. We will first package the pure python code, install and test it, which ensure that our configuration is correct. Next, we will introduce extra building system to hybrid C++ or others. Our project will be named mypkg, and the structure is like this:

├── pyproject.toml  # and/or setup.cfg
├── setup.py
├── README.md  # ...and other optional declearation
├── src
│   └── mypkg
│       ├── __init__.py
│       ├── ...
│       └── module.py
└── tests
    ├── __init__.py
    ├── ...
    └── test_mypkg.py

Two new files are include in our project: pyproject.toml and setup.py. pyproject.toml is a static configuration file, and the distribution can then be generated with any packaging tools.

distribution and wheel

Source distribution(aka. sdist) provides metadata and the essential source files needed for installing by a tool like pip, or for generating a Built distribution. It only need to be moved to the correct location on the target system.

Built distribution(aka. wheel) requires a build step before it can be installed. This format does not imply that Python files have to be precompiled (Wheel intentionally does not include compiled Python files).

First we should declear the metadata of our project in pyproject.toml:

[project]
name = "my_proj"
authors = [
    {name = "Roy Kid", email = "lijichen365@gmail.com"},
]
description = "My package description"
readme = "README.rst"
requires-python = ">=3.7"
keywords = ["one", "two"]
license = {text = "BSD-3-Clause"}
classifiers = [
    "Programming Language :: Python :: 3",
]
dependencies = [
    'importlib-metadata; python_version<"3.8"',
]

Next, we need to designate build-system requirements and the build backend:

[build-system]
requires = ["setuptools", "setuptools-scm[toml]"]
build-backend = "setuptools.build_meta"

Note that setuptools-scm is a plugin for setuptools that enables setuptools to get metadata and a list of files from SCM (Source Code Management, e.g. git or hg). In this way we avoid to write MINIFEST.in to determine which files to include in the source distribution. If we do not use setuptools-scm, we also can specify the files to include in the source distribution in pyproject.toml:

[tool.setuptools.packages]
find = {}  # Scan the project directory with the default parameters

# OR
[tool.setuptools.packages.find]
# All the following settings are optional:
where = ["src"]  # ["."] by default
# include = ["mypackage*"]  # ["*"] by default
# exclude = ["mypackage.tests*"]  # empty by default
namespaces = false  # true by default

non-py file may not include

As mentioned above, the source distribution only need to include the essential source files needed for installing. So if we want to include non-py files, we need to specify them in MANIFEST.in

The doc recommend users to expose as much as possible configuration in a more declarative way via the pyproject.toml or setup.cfg, and keep the setup.py minimal with only the dynamic parts (or even omit it completely if applicable).

different dependency styles

Three types of dependency styles offered by setuptools: 1. build system requirement, 2. required dependency and 3. optional dependency. Build system requirement is what project replies on when building and packaging, such as setuptools and pybind11. Required dependency is what project replies on when running, such as numpy. Optional dependency is what project replies on when running, but not required, such as matplotlib.

In setup.py, we only need to write the dynamic parts. Before we introduce cpp extension, pure python project only need following two lines to run and test if you configure the pyproject.toml correctly:

# setup.py
from setuptools import setup

setup()

It is simple, right? It will be more complex when we introduce cpp extension. We will start to write, test and package our code in the next section.

Write and Test

├── pyproject.toml  # and/or setup.cfg
├── setup.py
├── README.md  # ...and other optional declearation
├── src
│   └── mypkg
│       ├── __init__.py
│       ├── ...
│       ├── sum.py
|
└── tests
    ├── __init__.py
    ├── ...
    └── test_sum.py

We can write a simple function py_add in src/my_proj/pyfunc.py:

def py_add(a, b):
    return a + b

and a corresponding test case in tests/test_py.py:

from my_proj import py_add

def test_py_add():
    assert py_add(1, 2) == 3

We shoud import my_proj in a way that our user do. That means we SHOULD NOT import my_proj by a relative path from our current work directory(and don't use sys.path.append please), and that is src-layout prevent you to do so (The editor may not intelsense and find the package for not, it will be ok when project installed). Once we need to run and test our code, we should install our package in editable mode:

pip install -e .
# OR 
python -m pip install . --editable

In this way, we can mimic the package is installed in user's system and import it in a way that our user do. Also, as we make any change on python files, we do not need to re-install the package. For limitation of editable mode, please refer to here.

Once you have installed the package, you can run the test case by typing pytest in your terminal. Sometimes you want to test the code in different environment, such as different python version or different OS and so on. Those task may usually achieved by Github Action. Here we strongly recommand you to use tox to test your code in different environment. You can configure the test environment in tox.ini:

[tox]
requires =
    tox>=4
env_list = py{39, 310}

[testenv]
description = run the tests with pytest
package = wheel
wheel_build_env = .pkg
deps =
    pytest>=7
    pytest-sugar
commands =
    pytest {tty:--color=yes} {posargs}

[testenv:lint]
description = run linters
skip_install = true
deps =
    black==22.12
commands = black {posargs:.}

[testenv:type]
description = run type checks
deps =
    mypy>=0.991
commands =
    mypy {posargs:src tests}

As you can see, tox can execute different tasks in different environment. For example, testenv will run the test case in python 3.9 and 3.10, lint will run black to check the code style and type will run mypy to check the type. You can run the test by typing tox in your terminal. If you want to run only one task, you can type tox -e lint to run lint task only. For more information, please refer to tox doc.

Hybrid python and cpp

As a modern python package, we always use cpp extension to accelerate our code. Here we will introduce how to use cpp extension in our package. We will use pybind11.

How to choose wapper

If your project is python-based, you can consider to use cython, ctypes or cffi; If your project is cpp-based, you should choose pybind11 For more info, please refer to here

When using pybind11, we should separate the parts of building C++ projects and binding Python, namely

Give back to Ceasar what is Ceasar's and to God what is God's

This is the first principle we use pybind11 and construct our project. We should write, test and build our cpp code without any distractions, and write the binding code in a separate file, at last bind them together.

We make a directory cpp in the root directory. You can either put cpp under src or in the root directory. Here we put it in the root directory. Our structure is like this:

├── pyproject.toml  # and/or setup.cfg
├── setup.py
├── README.md  # ...and other optional declearation
├── cpp
│    ├── external  # c++ deps
│    ├── include  # headers
│    ├── src  # cpp source code
│    └── CMakeLists.txt
├── src
│   └── mypkg
│       ├── __init__.py
│       ├── ...
│       ├── sum.py
│
└── tests
    ├── __init__.py
    ├── ...
    └── test_sum.py

We add a simple function in example:

// example.h
int int_add(int, int);

// example.cpp
#include "sum.h"
int int_add(int i, int j) {
    return i + j;
}

Now we can consider it as a complete c++ project, and test it with such as gtest. Once the c++ part is flawless, we can write corresponding binding code in bind_example.cpp under cpp directory:

#include <pybind11/pybind11.h>
#include "example.h"

namespace py = pybind11;

PYBIND11_MODULE(cpp_kernel, m) {

    m.doc() = "pybind11 example plugin"; // optional module docstring

    m.def("int_add", &int_add, "A function which adds two numbers",
        py::arg("i"), py::arg("j"));

}

We now need to write CMakeLists.txt. First one is under root directory, and we write down some project-level configuration.

cmake_minimum_required(VERSION 3.18)
project(modern_python_template)

message(STATUS "Start to bind cpp code for modern_python_template")
add_subdirectory(cpp)

In cpp, we configure how to compile cpp code as a shared/static lib, and link it with python module. At last, we specify where the lib to install.

# --- compile cpp code as a shared lib and (optional)test it ---

# Use -fPIC even if statically compiled
set(CMAKE_POSITION_INDEPENDENT_CODE ON)

# If rely on external cpp submodule, 
# add_subdirectory()
# ... or compile it manually
add_library(
    cpp_core
    src/example.cpp
)

# Optional: set_target_properties
# Include directory of headers
target_include_directories(
    cpp_core
    PUBLIC
    ${CMAKE_CURRENT_SOURCE_DIR}/include
)

# --- compile python module ---
pybind11_add_module(cpp_kernel bind_example.cpp)
target_link_libraries(cpp_kernel PUBLIC cpp_core)

# --- Install libs so python module can import it ---

# Handle where to install the resulting Python package
if(CALL_FROM_SETUP_PY)
    # The CMakeExtension will set CMAKE_INSTALL_PREFIX to the root
    # of the resulting wheel archive
    set(KERNEL_INSTALL_PREFIX ${CMAKE_INSTALL_PREFIX})
else()
    # The Python package is installed directly in the folder of the
    # detected interpreter (system, user, or virtualenv)
    set(KERNEL_INSTALL_PREFIX ${Python3_SITELIB})
endif()

# Install the pybind11 library
install(
    TARGETS cpp_kernel
    # COMPONENT bindings
    LIBRARY DESTINATION ${KERNEL_INSTALL_PREFIX}
    ARCHIVE DESTINATION ${KERNEL_INSTALL_PREFIX}
    RUNTIME DESTINATION ${KERNEL_INSTALL_PREFIX}
)

In this work, we compile our cpp code as a static lib and link it to the python module. So we finally only need a python module which file name is cpp_kernel.*.so CMakeLists.txt

One of the advantage of Cython is it asks developer wrap the c++ code, and call it in a more flexible way. Similarly, we can warp ours with a normal python function. We add a module (module is just normal .py file) under src/my_project:

# cppfunc.py
from my_project.cpp_kernel import int_add

def cpp_add(a, b):
    if isinstance(a, int) and isinstance(b, int):
        return int_add(a, b)
    # if a, b are not int, using other function
    # or convert them to int
    # (although float can be converted to int implicitly)

References


  1. https://blog.ionelmc.ro/2014/05/25/python-packaging/#the-structure