Packaging Your Python Project

I recently had to deploy some of my python packages in production. To do this, I needed to package them. Since packaging python code was new to me, I took notes about all the steps I took and references I used to accomplish this. Eventually these notes became this tutorial. In addition to describing one approach to packaging your code, I will point out some best practices as well as pitfalls I ran into.

Code and Prerequisites

Prerequisites: Familiarity with python and bash

Code: The code for the sample project I wrote for this post can be found in my github repo.

Reasons For Packaging Your Project

Packaging your python projects is not something you need to worry about when you first start programming. In fact, if the only person using your projects is you, you can get a lot done without ever having to worry about packaging. However if you need your python project to run in production, or if you are working with a team and you need to make sure everybody is using the same version, you will need to make the project distributable. Apart from allowing your code to be distributed, taking the steps to make it packageable has additional benefits:

If nothing else, structuring your project so that it can be packaged, encourages good software engineering practices like writing unit tests, and adhering to a consistent project structure.
Your package can be installed locally. This means it will be installed into your dist-packages, and you don't have to worry about configuring your PYTHONPATH to include the source.
If you use virtual environments in your other projects, it's easy to incorporate one project in another by installing it into the corresponding virtual environment.

Distutils, Pip And Setuptools, Oh My

One of the more confusing aspects of packaging python modules are the many tools available. Guidance on which tools to use is not straightforward, and I highly recommend reading this answer on stackoverflow. As of current (September 2015) the recommendation is to use setup tools and wheels.

Structuring Python Projects

Structuring your projects is important. When you first create a python project it's worth adhering to a structure that will allow git, setup.py and virtualenvironment to play together nicely. To achieve this, I currently use the following project scaffold:

    projectname
         ├── MANIFEST.in
         ├── setup.py
         ├── README
         ├── .gitignore
         ├── .git
         ├── projectname_env
         └── projectname
             ├── __init__.py
             ├── subpackageone
             │   ├── __init__.py
             │   ├── second_module.py
             │   ├── tests
             │   │   └── test_second_module.py
             │   └── models
             │       └── model1
             ├── first_module.py   
             └── tests
                 └── test_second_module.py

I'll cover the details of setup.py and MANIFEST.in shortly. For now, notice that there are two projectname directories. The hierarchically lower level ‘projectname' directory contains the actual code you want to package. Alongside it are all the tools that help manage (git) and package (setup.py, MANIFEST.in) it. Note that there is no __init__.py file directly underneath the top level ‘projectname' directory, as the files at this level are not part of a python package. Within each python package you will find a tests package containing the unittests for each module in that package, as well as an optional ‘models' directory containing binary files used by the modules in that package. The tests directories do not contain __init__.py files as they are not packages. For the naming conventions for modules as well as packages, it's good to try and stick to the pep8 reference

Importing Modules and Referencing Files

Imports

I recommend always using absolute imports. That is, if first_module needs to import from second_module then it should import it as:

from projectname.subpackageone import second_module

relative imports are messy, and I find they lead to confusing bugs. When executing my python code, I like to execute it from the top level projectname directory, e.g.: python projectname/first_module.py

Referencing Files

If some of your modules load binary files, it is important not to hard code in absolute paths to these files, as the path will not be valid once the project has been installed. One method to get around this, which I like, is to define the following variable in any module that will be performing i/o to a binary file:

__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))

__location__ will then be the absolute path to the directory where the python module resides. Getting the path to where the binary file is located is simply a matter of joining __location__ with the correct relative path. For example if my module ‘first_module.py' had to access model1, the python code in first_module.py would be:

import os
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
model1_path = os.path.join(__location__, 'tests', 'model1')

setup.py

The setup.py script is the main piece that describes your module distribution. All metadata about your modules are passed in through keyword arguments. Let's take a look at an example:

setup(name='projectname',
      version='0.1',
      description='Demo Modules',
      url='https://github.com/alexanderwaldin/projectname',
      author='The Author',
      author_email='theauthor@example.com',
      license='MIT',
      packages=find_packages(),
      include_package_data=True,
      test_suite='nose.collector',
      tests_require=['nose'],
      install_requires=[])

Many of the arguments are self explanatory, but I want to touch on a few of them:

packages: This argument takes a list of package names that will be installed. You can either manually list them, or you can use the find_packages function.
include_package_data: When this is set to true, all files under version control or that are specified in the MANIFEST.in file will be included in the installation. There are more fine grained ways to specify the files you want to include, see the documentation for details.
test_suite, tests_require: used to specify the test runner that finds all the tests to run. I like nose.
install_requires: a list of libraries that your project depends on. When installing your project, Pip will ensure these are installed, and if they are not, it will try and install them. install_requires is not a duplication of requirements.txt, see the section requirements.txt vs install_requires.

Additional `setup()` keywords

As setuptools is an enhancement of distutils, the complete list of keyword arguments is a combination of the keyword arguments in distutils as well as the new and changed keywords from setuptools.

MANIFEST.in

Now that we've written the setup.py script, let's take a look at writing the MANIFEST.in file.

recursive-include projectname *.py
recursive-include projectname *.pickle
include README
include LICENSE
include requirements.txt

The manifest is used to specify files to include in the distribution that are not mentioned in the setup.py script. Usually these are files like unittests, the README file, the requirements.txt file etc. Specify which files to include using Unix-style "glob" patterns.

Check-Manifest

Often it can be tricky to make sure that everything that needs to be included in the source distribution is actually getting included. A wonderful tool to help you is check-manifest, check-manifest verifies that everything that's under source control is being included in the distribution, and if something is missing, it will offer suggestions as to how you should modify your MANIFEST.in file.

Virtual Environment and requirements.txt

virtualenv is a tool to create isolated Python environments. It addresses the problem of dependencies and versions. Before distributing your code, I highly recommend creating a virtual environment in which you run your testsuite on the project. This will allow you to collect a list of dependencies your project requires using pip freeze. You can store these requirements in the requirements.txt file using pip freeze > requirements.txt. I don't want to go into details on how to use virtualenv, as there's already a good tutorial out there. I personally store the files for the project's virtual environment in a directory called projectname_env in the top level projectname directory. The following instructions will get you started:

mkdir projectname_env
virtualenv projectname_env/
source projectname_env/bin/activate

This will create a new directory for the environment, create the virtual environment, and then activate the environment. To disable the virtual environment simply type deactivate. Since you don't want the virtual environment under version control, simply add ‘projectname_env' to your .gitignore.

You don't need to develop the project in the virtual environment. If you've written unittests to cover your code, it suffices to activate the virtualenvironment when you are done developing and then run the test_suite. Your tests will break until you've installed all required dependencies.

requirements.txt vs install_requires

Maybe you're like me, and you ask yourself why there are two places where you define the requirements for your project, the requirements.txt file, as well as the install_requires keyword in setup.py . Doesn't this violate the DRY principle? You could write a function that parses the requirements.txt to create a list of requirements to pass to the install_requires keyword:

from pip.req import parse_requirements
from pip.download import PipSession
import os

def read_requirements():
    '''parses requirements from requirements.txt'''
    reqs_path = os.path.join(__location__, 'requirements.txt')
    install_reqs = parse_requirements(reqs_path, session=PipSession())
    reqs = [str(ir.req) for ir in install_reqs]
    return reqs

Unfortunately, this is probably not best practice - requirements.txt is supposed to be just a complement. There is a lot of talk about why this is the case. My solution to this is to generate the requirements.txt with pip freeze, and then to manually copy this list of required packages into setup.py as well as strip out their version number. That said, I don't like this solution, as it's easy to get the two lists out of sync if you're not careful, so if you have a better suggestion, please leave it in a comment.

Testing, Packaging And Installing Your Project

Before distributing your project I highly recommend writing a test suite for it (or perhaps you follow a test driven approach) . Among many other benefits, tests allow the people who are using your code to make sure all the dependencies are fulfilled, as well as get confidence your module works. You can test your project by executing python setup.py test from your top level projectname directory.

Once you are confident that everything is working as you expect, you can build a source distribution using python setup.py sdist. This will create a dist directory, inside you will find a tar.gz archive containing your project.

pip allows you to install from a local folder, so the easiest way to install your project is to simply unpackage the archive, navigate into the directory, and execute: sudo pip install.