How to Include a Data Folder in Your Python Project with `pyproject.toml`

What will you learn?

In this tutorial, you will learn how to effortlessly add a folder containing data to your Python project using the pyproject.toml file. This guide is tailored for individuals seeking to seamlessly package data alongside their code.

Introduction to the Problem and Solution

When working on Python projects that require packaging or distribution, it’s common to have non-code files like datasets or configuration files that are essential for your code. Previously, including these files involved manual specification in setup scripts or MANIFEST.in files when using setuptools. However, with the advent of the pyproject.toml file through PEP 518 and its adoption by packaging tools such as Poetry and Flit, there exists a standardized and simpler method for specifying project dependencies and configurations�including how to include additional files.

This tutorial delves into effectively utilizing the pyproject.toml file to integrate a data folder within your project structure. You’ll be guided through creating an exemplary project structure followed by configuring your pyproject.toml file accordingly. By following these steps, you ensure that your data gets packaged alongside your code whenever you distribute or install your package.

Code

[tool.poetry]
name = "example_project"
version = "0.1.0"
description = ""
authors = ["Your Name <you@example.com>"]

[tool.poetry.dependencies]
python = "^3.8"

[tool.poetry.include]
path = "data_folder/*"

# Copyright PHD

To incorporate or modify this snippet within your existing pyproject.toml, navigate under the [tool.poetry] section if leveraging Poetry for packaging.

Explanation

The TOML configuration above illustrates how straightforward it is to include extra non-code directories in your project packaging process using Poetry�a prominent dependency management and packaging tool in Python.

  • [tool.poetry]: This section outlines fundamental metadata about your project such as name, version, authors, etc.
  • [tool.poetry.dependencies]: Here, project dependencies are specified; we’ve solely included Python itself as an illustration.
  • [tool.poetry.include]: This key enables us to specify additional paths/files that should be included during our package build; “data_folder/*” directs Poetry to package everything within the ‘data_folder’ directory along with our codebase.

By adopting this approach, individuals installing this package via pip (or other installers supporting poetry-built packages) receive not only the code but also these crucial data resources harmoniously bundled together.

    1. What is pyproject.toml?

      • A configuration file defined by PEP 518 for constructing Python packages which provides developers with a unified format for specifying their project�s build system requirements alongside other configurations.
    2. Why use poetry.include instead of directly adding my data folder into my source code?

      • Explicitly including static resources clarifies what should constitute part of your package�s distribution; it aids in maintaining a clean division between code and data/resources while offering flexibility for conditional includes based on environment or build options potentially.
    3. Can I include multiple folders or specific types of files?

      • Certainly! You can enumerate multiple paths within [tool.poetry.include], even supporting glob patterns e.g., “*.json” would encompass all JSON files at the root level of the specified directory/folder.
    4. Does this work with other tools besides Poetry?

      • Although this example utilizes Poetry-specific syntax under [tool.poetry], akin concepts apply if employing Flit (flit_core.metadata) or setuptools (MANIFEST.in). Refer to respective documentation for precise syntax details.
    5. Is it possible exclude certain files from being included?

      • Absolutely! Just as inclusion patterns/rules can be specified, exclusion rules ([tools.exclude]) permit eliminating specific paths/files from final builds even if they match inclusion criteria elsewhere.
Conclusion

Incorporating non-code assets like datasets and configurations has become easier than ever thanks to evolving standards and practices in software development�particularly concerning packaging and distribution. The modern tooling ecosystem surrounding Python provides robust mechanisms catering to a wide range of scenarios and needs�from individual developer hobby projects to sophisticated enterprise applications alike. Mastering effective resource management is a crucial component in delivering reliable user experiences throughout the entire journey of developing and distributing quality software products.

Leave a Comment