PEP 467 – Minor API improvements for binary sequences
- Author:
- Nick Coghlan <ncoghlan at gmail.com>, Ethan Furman <ethan at stoneleaf.us>
- Status:
- Draft
- Type:
- Standards Track
- Created:
- 30-Mar-2014
- Python-Version:
- 3.12
- Post-History:
- 30-Mar-2014, 15-Aug-2014, 16-Aug-2014, 07-Jun-2016, 01-Sep-2016, 13-Apr-2021, 03-Nov-2021
Table of Contents
Abstract
This PEP proposes five small adjustments to the APIs of the bytes
and
bytearray
types to make it easier to operate entirely in the binary domain:
- Add
fromsize
alternative constructor - Add
fromint
alternative constructor - Add
ascii
alternative constructor - Add
getbyte
byte retrieval method - Add
iterbytes
alternative iterator
Rationale
During the initial development of the Python 3 language specification, the
core bytes
type for arbitrary binary data started as the mutable type
that is now referred to as bytearray
. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series, for example with PEP 461.
Motivation
With Python 3 and the split between str
and bytes
, one small but
important area of programming became slightly more difficult, and much more
painful – wire format protocols.
This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). The addition of the new constructors, methods, and iterators will aid both in writing new wire format code, and in porting any remaining Python 2 wire format code.
Common use-cases include dbf
and pdf
file formats, email
formats, and FTP
and HTTP
communications, among many others.
Proposals
Addition of explicit “count and byte initialised sequence” constructors
To replace the now discouraged behavior, this PEP proposes the addition of an
explicit fromsize
alternative constructor as a class method on both
bytes
and bytearray
whose first argument is the count, and whose
second argument is the fill byte to use (defaults to \x00
):
>>> bytes.fromsize(3)
b'\x00\x00\x00'
>>> bytearray.fromsize(3)
bytearray(b'\x00\x00\x00')
>>> bytes.fromsize(5, b'\x0a')
b'\x0a\x0a\x0a\x0a\x0a'
>>> bytearray.fromsize(5, fill=b'\x0a')
bytearray(b'\x0a\x0a\x0a\x0a\x0a')
fromsize
will behave just as the current constructors behave when passed a
single integer, while allowing for non-zero fill values when needed.
Addition of explicit “single byte” constructors
As binary counterparts to the text chr
function, this PEP proposes
the addition of an explicit fromint
alternative constructor as a class
method on both bytes
and bytearray
:
>>> bytes.fromint(65)
b'A'
>>> bytearray.fromint(65)
bytearray(b'A')
These methods will only accept integers in the range 0 to 255 (inclusive):
>>> bytes.fromint(512)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: integer must be in range(0, 256)
>>> bytes.fromint(1.0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'float' object cannot be interpreted as an integer
The documentation of the ord
builtin will be updated to explicitly note
that bytes.fromint
is the primary inverse operation for binary data, while
chr
is the inverse operation for text data, and that bytearray.fromint
also exists.
Behaviorally, bytes.fromint(x)
will be equivalent to the current
bytes([x])
(and similarly for bytearray
). The new spelling is
expected to be easier to discover and easier to read (especially when used
in conjunction with indexing operations on binary sequence types).
As a separate method, the new spelling will also work better with higher
order functions like map
.
These new methods intentionally do NOT offer the same level of general integer
support as the existing int.to_bytes
conversion method, which allows
arbitrarily large integers to be converted to arbitrarily long bytes objects. The
restriction to only accept positive integers that fit in a single byte means
that no byte order information is needed, and there is no need to handle
negative numbers. The documentation of the new methods will refer readers to
int.to_bytes
for use cases where handling of arbitrary integers is needed.
Addition of “ascii” constructors
In Python 2 converting an object, such as the integer 123
, to bytes (aka the
Python 2 str
) was as simple as:
>>> str(123)
'123'
With Python 3 that became the more verbose:
>>> b'%d' % 123
or even:
>>> str(123).encode('ascii')
This PEP proposes that an ascii
method be added to bytes
and bytearray
to handle this use-case:
>>> bytes.ascii(123)
b'123'
Note that bytes.ascii()
would handle simple ascii-encodable text correctly,
unlike the ascii()
built-in:
>>> ascii("hello").encode('ascii')
b"'hello'"
Addition of “getbyte” method to retrieve a single byte
This PEP proposes that bytes
and bytearray
gain the method getbyte
which will always return bytes
:
>>> b'abc'.getbyte(0)
b'a'
If an index is asked for that doesn’t exist, IndexError
is raised:
>>> b'abc'.getbyte(9)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: index out of range
Addition of optimised iterator methods that produce bytes
objects
This PEP proposes that bytes
and bytearray
gain an optimised
iterbytes
method that produces length 1 bytes
objects rather than
integers:
for x in data.iterbytes():
# x is a length 1 ``bytes`` object, rather than an integer
For example:
>>> tuple(b"ABC".iterbytes())
(b'A', b'B', b'C')
Design discussion
Why not rely on sequence repetition to create zero-initialised sequences?
Zero-initialised sequences can be created via sequence repetition:
>>> b'\x00' * 3
b'\x00\x00\x00'
>>> bytearray(b'\x00') * 3
bytearray(b'\x00\x00\x00')
However, this was also the case when the bytearray
type was originally
designed, and the decision was made to add explicit support for it in the
type constructor. The immutable bytes
type then inherited that feature
when it was introduced in PEP 3137.
This PEP isn’t revisiting that original design decision, just changing the
spelling as users sometimes find the current behavior of the binary sequence
constructors surprising. In particular, there’s a reasonable case to be made
that bytes(x)
(where x
is an integer) should behave like the
bytes.fromint(x)
proposal in this PEP. Providing both behaviors as separate
class methods avoids that ambiguity.
Omitting the originally proposed builtin function
When submitted to the Steering Council, this PEP proposed the introduction of
a bchr
builtin (with the same behaviour as bytes.fromint
), recreating
the ord
/chr
/unichr
trio from Python 2 under a different naming
scheme (ord
/bchr
/chr
).
The SC indicated they didn’t think this functionality was needed often enough
to justify offering two ways of doing the same thing, especially when one of
those ways was a new builtin function. That part of the proposal was therefore
dropped as being redundant with the bytes.fromint
alternate constructor.
Developers that use this method frequently will instead have the option to
define their own bchr = bytes.fromint
aliases.
Scope limitation: memoryview
Updating memoryview
with the new item retrieval methods is outside the scope
of this PEP.
References
- Initial March 2014 discussion thread on python-ideas
- Guido’s initial feedback in that thread
- Issue proposing moving zero-initialised sequences to a dedicated API
- Issue proposing to use calloc() for zero-initialised binary sequences
- August 2014 discussion thread on python-dev
- June 2016 discussion thread on python-dev
Copyright
This document has been placed in the public domain.
Source: https://github.com/python/peps/blob/main/pep-0467.txt
Last modified: 2022-08-24 22:39:36 GMT