PEP 402 – Simplified Package Layout and Partitioning
- Author:
- P.J. Eby
- Status:
- Rejected
- Type:
- Standards Track
- Topic:
- Packaging
- Created:
- 12-Jul-2011
- Python-Version:
- 3.3
- Post-History:
- 20-Jul-2011
- Replaces:
- 382
Rejection Notice
On the first day of sprints at US PyCon 2012 we had a long and fruitful discussion about PEP 382 and PEP 402. We ended up rejecting both but a new PEP will be written to carry on in the spirit of PEP 402. Martin von Löwis wrote up a summary: [3].
Abstract
This PEP proposes an enhancement to Python’s package importing to:
- Surprise users of other languages less,
- Make it easier to convert a module into a package, and
- Support dividing packages into separately installed components (ala “namespace packages”, as described in PEP 382)
The proposed enhancements do not change the semantics of any currently-importable directory layouts, but make it possible for packages to use a simplified directory layout (that is not importable currently).
However, the proposed changes do NOT add any performance overhead to
the importing of existing modules or packages, and performance for the
new directory layout should be about the same as that of previous
“namespace package” solutions (such as pkgutil.extend_path()
).
The Problem
“Most packages are like modules. Their contents are highly interdependent and can’t be pulled apart. [However,] some packages exist to provide a separate namespace. … It should be possible to distribute sub-packages or submodules of these [namespace packages] independently.”—Jim Fulton, shortly before the release of Python 2.3 [1]
When new users come to Python from other languages, they are often
confused by Python’s package import semantics. At Google, for example,
Guido received complaints from “a large crowd with pitchforks” [2]
that the requirement for packages to contain an __init__
module
was a “misfeature”, and should be dropped.
In addition, users coming from languages like Java or Perl are sometimes confused by a difference in Python’s import path searching.
In most other languages that have a similar path mechanism to Python’s
sys.path
, a package is merely a namespace that contains modules
or classes, and can thus be spread across multiple directories in
the language’s path. In Perl, for instance, a Foo::Bar
module
will be searched for in Foo/
subdirectories all along the module
include path, not just in the first such subdirectory found.
Worse, this is not just a problem for new users: it prevents anyone
from easily splitting a package into separately-installable
components. In Perl terms, it would be as if every possible Net::
module on CPAN had to be bundled up and shipped in a single tarball!
For that reason, various workarounds for this latter limitation exist,
circulated under the term “namespace packages”. The Python standard
library has provided one such workaround since Python 2.3 (via the
pkgutil.extend_path()
function), and the “setuptools” package
provides another (via pkg_resources.declare_namespace()
).
The workarounds themselves, however, fall prey to a third issue with Python’s way of laying out packages in the filesystem.
Because a package must contain an __init__
module, any attempt
to distribute modules for that package must necessarily include that
__init__
module, if those modules are to be importable.
However, the very fact that each distribution of modules for a package
must contain this (duplicated) __init__
module, means that OS
vendors who package up these module distributions must somehow handle
the conflict caused by several module distributions installing that
__init__
module to the same location in the filesystem.
This led to the proposing of PEP 382 (“Namespace Packages”) - a way to signal to Python’s import machinery that a directory was importable, using unique filenames per module distribution.
However, there was more than one downside to this approach. Performance for all import operations would be affected, and the process of designating a package became even more complex. New terminology had to be invented to explain the solution, and so on.
As terminology discussions continued on the Import-SIG, it soon became apparent that the main reason it was so difficult to explain the concepts related to “namespace packages” was because Python’s current way of handling packages is somewhat underpowered, when compared to other languages.
That is, in other popular languages with package systems, no special term is needed to describe “namespace packages”, because all packages generally behave in the desired fashion.
Rather than being an isolated single directory with a special marker module (as in Python), packages in other languages are typically just the union of appropriately-named directories across the entire import or inclusion path.
In Perl, for example, the module Foo
is always found in a
Foo.pm
file, and a module Foo::Bar
is always found in a
Foo/Bar.pm
file. (In other words, there is One Obvious Way to
find the location of a particular module.)
This is because Perl considers a module to be different from a package: the package is purely a namespace in which other modules may reside, and is only coincidentally the name of a module as well.
In current versions of Python, however, the module and the package are
more tightly bound together. Foo
is always a module – whether it
is found in Foo.py
or Foo/__init__.py
– and it is tightly
linked to its submodules (if any), which must reside in the exact
same directory where the __init__.py
was found.
On the positive side, this design choice means that a package is quite self-contained, and can be installed, copied, etc. as a unit just by performing an operation on the package’s root directory.
On the negative side, however, it is non-intuitive for beginners, and
requires a more complex step to turn a module into a package. If
Foo
begins its life as Foo.py
, then it must be moved and
renamed to Foo/__init__.py
.
Conversely, if you intend to create a Foo.Bar
module from the
start, but have no particular module contents to put in Foo
itself, then you have to create an empty and seemingly-irrelevant
Foo/__init__.py
file, just so that Foo.Bar
can be imported.
(And these issues don’t just confuse newcomers to the language, either: they annoy many experienced developers as well.)
So, after some discussion on the Import-SIG, this PEP was created as an alternative to PEP 382, in an attempt to solve all of the above problems, not just the “namespace package” use cases.
And, as a delightful side effect, the solution proposed in this PEP
does not affect the import performance of ordinary modules or
self-contained (i.e. __init__
-based) packages.
The Solution
In the past, various proposals have been made to allow more intuitive approaches to package directory layout. However, most of them failed because of an apparent backward-compatibility problem.
That is, if the requirement for an __init__
module were simply
dropped, it would open up the possibility for a directory named, say,
string
on sys.path
, to block importing of the standard library
string
module.
Paradoxically, however, the failure of this approach does not arise
from the elimination of the __init__
requirement!
Rather, the failure arises because the underlying approach takes for granted that a package is just ONE thing, instead of two.
In truth, a package comprises two separate, but related entities: a module (with its own, optional contents), and a namespace where other modules or packages can be found.
In current versions of Python, however, the module part (found in
__init__
) and the namespace for submodule imports (represented
by the __path__
attribute) are both initialized at the same time,
when the package is first imported.
And, if you assume this is the only way to initialize these two
things, then there is no way to drop the need for an __init__
module, while still being backwards-compatible with existing directory
layouts.
After all, as soon as you encounter a directory on sys.path
matching the desired name, that means you’ve “found” the package, and
must stop searching, right?
Well, not quite.
A Thought Experiment
Let’s hop into the time machine for a moment, and pretend we’re back
in the early 1990s, shortly before Python packages and __init__.py
have been invented. But, imagine that we are familiar with
Perl-like package imports, and we want to implement a similar system
in Python.
We’d still have Python’s module imports to build on, so we could
certainly conceive of having Foo.py
as a parent Foo
module
for a Foo
package. But how would we implement submodule and
subpackage imports?
Well, if we didn’t have the idea of __path__
attributes yet,
we’d probably just search sys.path
looking for Foo/Bar.py
.
But we’d only do it when someone actually tried to import
Foo.Bar
.
NOT when they imported Foo
.
And that lets us get rid of the backwards-compatibility problem
of dropping the __init__
requirement, back here in 2011.
How?
Well, when we import Foo
, we’re not even looking for Foo/
directories on sys.path
, because we don’t care yet. The only
point at which we care, is the point when somebody tries to actually
import a submodule or subpackage of Foo
.
That means that if Foo
is a standard library module (for example),
and I happen to have a Foo
directory on sys.path
(without
an __init__.py
, of course), then nothing breaks. The Foo
module is still just a module, and it’s still imported normally.
Self-Contained vs. “Virtual” Packages
Of course, in today’s Python, trying to import Foo.Bar
will
fail if Foo
is just a Foo.py
module (and thus lacks a
__path__
attribute).
So, this PEP proposes to dynamically create a __path__
, in the
case where one is missing.
That is, if I try to import Foo.Bar
the proposed change to the
import machinery will notice that the Foo
module lacks a
__path__
, and will therefore try to build one before proceeding.
And it will do this by making a list of all the existing Foo/
subdirectories of the directories listed in sys.path
.
If the list is empty, the import will fail with ImportError
, just
like today. But if the list is not empty, then it is saved in
a new Foo.__path__
attribute, making the module a “virtual
package”.
That is, because it now has a valid __path__
, we can proceed
to import submodules or subpackages in the normal way.
Now, notice that this change does not affect “classic”, self-contained
packages that have an __init__
module in them. Such packages
already have a __path__
attribute (initialized at import time)
so the import machinery won’t try to create another one later.
This means that (for example) the standard library email
package
will not be affected in any way by you having a bunch of unrelated
directories named email
on sys.path
. (Even if they contain
*.py
files.)
But it does mean that if you want to turn your Foo
module into
a Foo
package, all you have to do is add a Foo/
directory
somewhere on sys.path
, and start adding modules to it.
But what if you only want a “namespace package”? That is, a package that is only a namespace for various separately-distributed submodules and subpackages?
For example, if you’re Zope Corporation, distributing dozens of
separate tools like zc.buildout
, each in packages under the zc
namespace, you don’t want to have to make and include an empty
zc.py
in every tool you ship. (And, if you’re a Linux or other
OS vendor, you don’t want to deal with the package installation
conflicts created by trying to install ten copies of zc.py
to the
same location!)
No problem. All we have to do is make one more minor tweak to the
import process: if the “classic” import process fails to find a
self-contained module or package (e.g., if import zc
fails to find
a zc.py
or zc/__init__.py
), then we once more try to build a
__path__
by searching for all the zc/
directories on
sys.path
, and putting them in a list.
If this list is empty, we raise ImportError
. But if it’s
non-empty, we create an empty zc
module, and put the list in
zc.__path__
. Congratulations: zc
is now a namespace-only,
“pure virtual” package! It has no module contents, but you can still
import submodules and subpackages from it, regardless of where they’re
located on sys.path
.
(By the way, both of these additions to the import protocol (i.e. the
dynamically-added __path__
, and dynamically-created modules)
apply recursively to child packages, using the parent package’s
__path__
in place of sys.path
as a basis for generating a
child __path__
. This means that self-contained and virtual
packages can contain each other without limitation, with the caveat
that if you put a virtual package inside a self-contained one, it’s
gonna have a really short __path__
!)
Backwards Compatibility and Performance
Notice that these two changes only affect import operations that
today would result in ImportError
. As a result, the performance
of imports that do not involve virtual packages is unaffected, and
potential backward compatibility issues are very restricted.
Today, if you try to import submodules or subpackages from a module
with no __path__
, it’s an immediate error. And of course, if you
don’t have a zc.py
or zc/__init__.py
somewhere on sys.path
today, import zc
would likewise fail.
Thus, the only potential backwards-compatibility issues are:
- Tools that expect package directories to have an
__init__
module, that expect directories without an__init__
module to be unimportable, or that expect__path__
attributes to be static, will not recognize virtual packages as packages.(In practice, this just means that tools will need updating to support virtual packages, e.g. by using
pkgutil.walk_modules()
instead of using hardcoded filesystem searches.) - Code that expects certain imports to fail may now do something unexpected. This should be fairly rare in practice, as most sane, non-test code does not import things that are expected not to exist!
The biggest likely exception to the above would be when a piece of
code tries to check whether some package is installed by importing
it. If this is done only by importing a top-level module (i.e., not
checking for a __version__
or some other attribute), and there
is a directory of the same name as the sought-for package on
sys.path
somewhere, and the package is not actually installed,
then such code could be fooled into thinking a package is installed
that really isn’t.
For example, suppose someone writes a script (datagen.py
)
containing the following code:
try:
import json
except ImportError:
import simplejson as json
And runs it in a directory laid out like this:
datagen.py
json/
foo.js
bar.js
If import json
succeeded due to the mere presence of the json/
subdirectory, the code would incorrectly believe that the json
module was available, and proceed to fail with an error.
However, we can prevent corner cases like these from arising, simply
by making one small change to the algorithm presented so far. Instead
of allowing you to import a “pure virtual” package (like zc
),
we allow only importing of the contents of virtual packages.
That is, a statement like import zc
should raise ImportError
if there is no zc.py
or zc/__init__.py
on sys.path
. But,
doing import zc.buildout
should still succeed, as long as there’s
a zc/buildout.py
or zc/buildout/__init__.py
on sys.path
.
In other words, we don’t allow pure virtual packages to be imported directly, only modules and self-contained packages. (This is an acceptable limitation, because there is no functional value to importing such a package by itself. After all, the module object will have no contents until you import at least one of its subpackages or submodules!)
Once zc.buildout
has been successfully imported, though, there
will be a zc
module in sys.modules
, and trying to import it
will of course succeed. We are only preventing an initial import
from succeeding, in order to prevent false-positive import successes
when clashing subdirectories are present on sys.path
.
So, with this slight change, the datagen.py
example above will
work correctly. When it does import json
, the mere presence of a
json/
directory will simply not affect the import process at all,
even if it contains .py
files. The json/
directory will still
only be searched in the case where an import like import
json.converter
is attempted.
Meanwhile, tools that expect to locate packages and modules by
walking a directory tree can be updated to use the existing
pkgutil.walk_modules()
API, and tools that need to inspect
packages in memory should use the other APIs described in the
Standard Library Changes/Additions section below.
Specification
A change is made to the existing import process, when importing
names containing at least one .
– that is, imports of modules
that have a parent package.
Specifically, if the parent package does not exist, or exists but
lacks a __path__
attribute, an attempt is first made to create a
“virtual path” for the parent package (following the algorithm
described in the section on virtual paths, below).
If the computed “virtual path” is empty, an ImportError
results,
just as it would today. However, if a non-empty virtual path is
obtained, the normal import of the submodule or subpackage proceeds,
using that virtual path to find the submodule or subpackage. (Just
as it would have with the parent’s __path__
, if the parent package
had existed and had a __path__
.)
When a submodule or subpackage is found (but not yet loaded),
the parent package is created and added to sys.modules
(if it
didn’t exist before), and its __path__
is set to the computed
virtual path (if it wasn’t already set).
In this way, when the actual loading of the submodule or subpackage
occurs, it will see a parent package existing, and any relative
imports will work correctly. However, if no submodule or subpackage
exists, then the parent package will not be created, nor will a
standalone module be converted into a package (by the addition of a
spurious __path__
attribute).
Note, by the way, that this change must be applied recursively: that
is, if foo
and foo.bar
are pure virtual packages, then
import foo.bar.baz
must wait until foo.bar.baz
is found before
creating module objects for both foo
and foo.bar
, and then
create both of them together, properly setting the foo
module’s
.bar
attribute to point to the foo.bar
module.
In this way, pure virtual packages are never directly importable:
an import foo
or import foo.bar
by itself will fail, and the
corresponding modules will not appear in sys.modules
until they
are needed to point to a successfully imported submodule or
self-contained subpackage.
Virtual Paths
A virtual path is created by obtaining a PEP 302 “importer” object for
each of the path entries found in sys.path
(for a top-level
module) or the parent __path__
(for a submodule).
(Note: because sys.meta_path
importers are not associated with
sys.path
or __path__
entry strings, such importers do not
participate in this process.)
Each importer is checked for a get_subpath()
method, and if
present, the method is called with the full name of the module/package
the path is being constructed for. The return value is either a
string representing a subdirectory for the requested package, or
None
if no such subdirectory exists.
The strings returned by the importers are added to the path list
being built, in the same order as they are found. (None
values
and missing get_subpath()
methods are simply skipped.)
The resulting list (whether empty or not) is then stored in a
sys.virtual_package_paths
dictionary, keyed by module name.
This dictionary has two purposes. First, it serves as a cache, in the event that more than one attempt is made to import a submodule of a virtual package.
Second, and more importantly, the dictionary can be used by code that
extends sys.path
at runtime to update imported packages’
__path__
attributes accordingly. (See Standard Library
Changes/Additions below for more details.)
In Python code, the virtual path construction algorithm would look something like this:
def get_virtual_path(modulename, parent_path=None):
if modulename in sys.virtual_package_paths:
return sys.virtual_package_paths[modulename]
if parent_path is None:
parent_path = sys.path
path = []
for entry in parent_path:
# Obtain a PEP 302 importer object - see pkgutil module
importer = pkgutil.get_importer(entry)
if hasattr(importer, 'get_subpath'):
subpath = importer.get_subpath(modulename)
if subpath is not None:
path.append(subpath)
sys.virtual_package_paths[modulename] = path
return path
And a function like this one should be exposed in the standard
library as e.g. imp.get_virtual_path()
, so that people creating
__import__
replacements or sys.meta_path
hooks can reuse it.
Standard Library Changes/Additions
The pkgutil
module should be updated to handle this
specification appropriately, including any necessary changes to
extend_path()
, iter_modules()
, etc.
Specifically the proposed changes and additions to pkgutil
are:
- A new
extend_virtual_paths(path_entry)
function, to extend existing, already-imported virtual packages’__path__
attributes to include any portions found in a newsys.path
entry. This function should be called by applications extendingsys.path
at runtime, e.g. when adding a plugin directory or an egg to the path.The implementation of this function does a simple top-down traversal of
sys.virtual_package_paths
, and performs any necessaryget_subpath()
calls to identify what path entries need to be added to the virtual path for that package, given that path_entry has been added tosys.path
. (Or, in the case of sub-packages, adding a derived subpath entry, based on their parent package’s virtual path.)(Note: this function must update both the path values in
sys.virtual_package_paths
as well as the__path__
attributes of any corresponding modules insys.modules
, even though in the common case they will both be the samelist
object.) - A new
iter_virtual_packages(parent='')
function to allow top-down traversal of virtual packages fromsys.virtual_package_paths
, by yielding the child virtual packages of parent. For example, callingiter_virtual_packages("zope")
might yieldzope.app
andzope.products
(if they are virtual packages listed insys.virtual_package_paths
), but notzope.foo.bar
. (This function is needed to implementextend_virtual_paths()
, but is also potentially useful for other code that needs to inspect imported virtual packages.) ImpImporter.iter_modules()
should be changed to also detect and yield the names of modules found in virtual packages.
In addition to the above changes, the zipimport
importer should
have its iter_modules()
implementation similarly changed. (Note:
current versions of Python implement this via a shim in pkgutil
,
so technically this is also a change to pkgutil
.)
Last, but not least, the imp
module (or importlib
, if
appropriate) should expose the algorithm described in the virtual
paths section above, as a
get_virtual_path(modulename, parent_path=None)
function, so that
creators of __import__
replacements can use it.
Implementation Notes
For users, developers, and distributors of virtual packages:
- While virtual packages are easy to set up and use, there is still
a time and place for using self-contained packages. While it’s not
strictly necessary, adding an
__init__
module to your self-contained packages lets users of the package (and Python itself) know that all of the package’s code will be found in that single subdirectory. In addition, it lets you define__all__
, expose a public API, provide a package-level docstring, and do other things that make more sense for a self-contained project than for a mere “namespace” package. sys.virtual_package_paths
is allowed to contain entries for non-existent or not-yet-imported package names; code that uses its contents should not assume that every key in this dictionary is also present insys.modules
or that importing the name will necessarily succeed.- If you are changing a currently self-contained package into a
virtual one, it’s important to note that you can no longer use its
__file__
attribute to locate data files stored in a package directory. Instead, you must search__path__
or use the__file__
of a submodule adjacent to the desired files, or of a self-contained subpackage that contains the desired files.(Note: this caveat is already true for existing users of “namespace packages” today. That is, it is an inherent result of being able to partition a package, that you must know which partition the desired data file lives in. We mention it here simply so that new users converting from self-contained to virtual packages will also be aware of it.)
- XXX what is the __file__ of a “pure virtual” package?
None
? Some arbitrary string? The path of the first directory with a trailing separator? No matter what we put, some code is going to break, but the last choice might allow some code to accidentally work. Is that good or bad?
For those implementing PEP 302 importer objects:
- Importers that support the
iter_modules()
method (used bypkgutil
to locate importable modules and packages) and want to add virtual package support should modify theiriter_modules()
method so that it discovers and lists virtual packages as well as standard modules and packages. To do this, the importer should simply list all immediate subdirectory names in its jurisdiction that are valid Python identifiers.XXX This might list a lot of not-really-packages. Should we require importable contents to exist? If so, how deep do we search, and how do we prevent e.g. link loops, or traversing onto different filesystems, etc.? Ick. Also, if virtual packages are listed, they still can’t be imported, which is a problem for the way that
pkgutil.walk_modules()
is currently implemented. - “Meta” importers (i.e., importers placed on
sys.meta_path
) do not need to implementget_subpath()
, because the method is only called on importers corresponding tosys.path
entries and__path__
entries. If a meta importer wishes to support virtual packages, it must do so entirely within its ownfind_module()
implementation.Unfortunately, it is unlikely that any such implementation will be able to merge its package subpaths with those of other meta importers or
sys.path
importers, so the meaning of “supporting virtual packages” for a meta importer is currently undefined!(However, since the intended use case for meta importers is to replace Python’s normal import process entirely for some subset of modules, and the number of such importers currently implemented is quite small, this seems unlikely to be a big issue in practice.)
References
Copyright
This document has been placed in the public domain.
Source: https://github.com/python/peps/blob/main/pep-0402.txt
Last modified: 2022-06-14 21:22:20 GMT