Python bytes

Saturday, September 29, 2012

Python 3.3 is my Favorite Python Release

Today, Python 3.3 was released. During the 4.5 years I've been a CPython core developer, 6 major Python releases (2.6, 2.7, 3.0, 3.1, 3.2, and 3.3) have past by me. In this post, I will explain why 3.3 is the most exciting Python release to me. I will be cherrypicking, consult "What's New in Python 3.3" and the Misc/NEWS file for complete details.

Unicode

PEP 393 completely changed the internal format of Python's Unicode implementation. It does away with the concept of wide and narrow unicode builds. The encoding of a string now depends on its maximum codepoint; there are 1-byte, 2-byte, or 4-byte strings internally. This means, for example, that strings with only ASCII characters can be represented in their most compact format. Partially as a consequence, Unicode standard compilance has improved. Indexing strings always gives code points not surrogates like on < 3.3 narrow builds. str.lower(), str.upper(), and str.title() have been fixed to use full Unicode case-mappings instead of the simple 1-1 ones. The str.casefold method implements the Unicode casefolding algorithm.
If the gods of PyCon talk selection smile on me, I will be giving a talk about this and the history of Unicode in Python.

Glorious Return of the "u" Prefix

Python 3.3 allows the u in front of strings again. Since the b prefix is supported from Python 2.6, code which wants to support 2.x and 3.3 shouldn't need to use unpleasant kludges like six's u() and b() functions. I don't think it would be unreasonable for libraries to only support 2.7 and 3.3+ now just to have the more natural string syntaxes.

Many Nice Things

One of the annoyances in previous Python 3 versions was it was impossible to turn off PEP 3134's implicit exception chaining. The raise exc from None syntax introduced in 3.3 prevents the __context__ of an exception from being printed.
There were improvements in exceptions themselves. PEP 3151 merged IOError, OSError, WindowsError, and various error types in the standard library. It also created a hierarchy of specialized exception subclasses. This means that most code dealing with IO errors won't have to dig into the errno module. For example, this standard pattern

try:
    fp = open("data", "rb")
except OSError as e:
    if e.errno != errno.ENOENT:
        raise
    # Create file

can become

try:
    fp = open("data", "rb")
except FileNotFoundError:
    # Create file

. (Of course, for this sort of thing you can also use the new "x" mode in open().) The errors from incorrect call signatures have improved:

Python 3.3.0+ (3.3:7e83c8ccb1ba, Sep 29 2012, 10:34:54) 
[GCC 4.5.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def f(a, b, c=5, *, kw1, kw2): pass
... 
>>> f(1, kw2=42)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'b'
>>> f(1, 2, kw2=42)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required keyword-only argument: 'kw1'

In the future, I think there should be a ArgumentsError subclass of TypeError which provides programmatic access to the signature mismatch, but this is a start. The new standard library modules, ipaddress, lzma, a dn unittest.mock are certainly worth a look.
The Windows installer has an option to set up PATH for you.

Saturday, May 12, 2012

The Architecture of Open Source Applications volume 2 has been published. It includes my chapter on PyPy. You can buy the dead tree version for $35 on Lulu where all the proceeds go to Amnesty International.

Saturday, July 9, 2011

Behind the scenes of py.test's new assertion rewriting

py.test 2.1 was just released. py.test, which uses the Python assert statement to check test conditions, has long had support for displaying intermediate values in subexpressions of a failing assert statement. This feature is called assertion introspection. Historically, py.test performed assertion introspection by reinterpreting failed assertions in order to glean information about subexpressions. In assertion reinterpreting, py.test actually reruns the assertion noting intermediate values during interpretation. This works pretty well but is subject to several problems, most importantly that assert statements with side-effects can produce strange results because they are evaluated twice on failure. py.test 2.1's main new feature, which I wrote (with generous sponsorship from Merlinux GmbH), is a new assertion introspection technique called assertion rewriting. Assertion rewriting modifies the AST of test modules to produce subexpression information when assertions fail. This blog post will give a peek into how this is done and what the rewritten tests look like.

py.test tries to rewrite every module that it collects as a test module. Assertion rewriting uses a PEP 302 import hook to capture test modules for rewriting. I'm happy to report doing this was easier than I expected. Most of the code in the import hook I had to write was dealing with detecting test modules rather than supporting import's extremely complicated API. Rewriting has a non-zero cost during test collection, so py.test compiles rewritten modules to bytecode and caches them in the PEP 3147 PYC repository, __pycache__. One major thing I did have to account for was the possibility that multiple py.test processes would be writing PYC files. (This is a very real possibility when the xdist plugin is being used. Therefore, py.test uses only atomic operations on the rewritten PYC file. Windows, lacking atomic rename, was a pain here.

I'm now going to demonstrate what py.test's rewriting phase does to a test module. Let's dive in with a failing test for a (broken) function that is supposed to create empty files:

import os

def make_empty_file(name):
    with open(name, "w") as fp:
        fp.write("hello")

def test_make_empty_file():
    name = "/tmp/empty_test"
    make_empty_file(name)
    with open(name, "r") as fp:
        assert not fp.read()

This test nicely demonstrates the problem with py.test's old assertion method mentioned in the first paragraph. If we force the old assertion interpretation mode with --assert=reinterp, we see:

    def test_make_empty_file():
        name = "/tmp/empty_test"
        make_empty_file(name)
        with open(name, "r") as fp:
>           assert not fp.read()
E           AssertionError: (assertion failed, but when it was re-run for printing intermediate values, it did not fail.  Suggestions: compute assert expression before the assert or use --no-assert)                                                                                                                                                                  

test_empty_file.py:11: AssertionError

The problem is that assert statement has the side-effect of reading the file. When py.test reinterprets the assert statement, it uses the same file object, now at EOF, and read() returns an empty string. py.test's new rewriting mode fixes this by scanning the assert for introspection information before executing the test. Running py.test with assertion rewriting enabled gives the desired result:

    def test_make_empty_file():
        name = "/tmp/empty_test"
        make_empty_file(name)
        with open(name, "r") as fp:
>           assert not fp.read()
E           assert not 'hello'
E            +  where 'hello' = ()
E            +    where  = .read

test_empty_file.py:11: AssertionError

So what magic has py.test worked to display such nice debugging information? This is what Python is actually executing:

def test_make_empty_file():
    name = '/tmp/empty_test'
    make_empty_file(name)
    with open(name, 'r') as fp:
        @py_assert1 = fp.read
        @py_assert3 = @py_assert1()
        @py_assert5 = (not @py_assert3)
        if (not @py_assert5):
            @py_format6 = ('assert not %(py4)s\n{%(py4)s = %(py2)s\n{%(py2)s = %(py0)s.read\n}()\n}' %
            {'py0': (@pytest_ar._saferepr(fp) if ('fp' in @py_builtins.locals() is not @py_builtins.globals()) else 'fp'),
             'py2': @pytest_ar._saferepr(@py_assert1),
             'py4': @pytest_ar._saferepr(@py_assert3)})
            raise AssertionError(@pytest_ar._format_explanation(@py_format6))
        del @py_assert5, @py_assert1, @py_assert3

As you can see, it's not going to be winning any awards for beautiful Python! (Ideally, though, you'll never have to see or think about it.) Examining the rewritten code, we see a lot of internal variables starting with "@" have been created. The "@", invalid in Python identifiers, is to make sure internal names don't conflict with any user-defined names which might be in the scope. In the first four written lines under the with statement, the test of the assert statement has been expanded into its component subexpressions. This allows py.test to display the values of subexpressions should the assertion fail. If the assertion fails, the if statement in the fifth line of rewriting evaluates to True and a AssertionError will be raised. Under the if statement is the real mess. This is where the helpful error message is generated. The line starting with @py_format6 is simply does string formatting (with %) on a template generated from the structure of the assert statement. This template is filled in with the intermediate values of the expressions collected above. @py_builtins is the builtins module, used in case the test is shadowing builtins the rewriting code uses. The @pytest_ar variable is a special module of assertion formatting helpers. For example, @pytest_ar._saferepr is like builtin repr but gracefully handles long reprs and __repr__ methods that raise exceptions. A non-obvious trick in the format dict is the expression @pytest_ar._saferepr(fp) if ('fp' in @py_builtins.locals() is not @py_builtins.globals()) else 'fp'. This checks whether fp is a local variable or not and customizes the display accordingly. After the initial formatting, the helper function _format_explanation is called. This function produces the indentation and "+" you see in the error message. Finally, we note that if the assertion doesn't fail, py.test cleans up after itself by deleting temporary variables.

The example above is a fairly tame (and luckily also typical) assertion. Rewriting gets more "exciting" when boolean operations and comparisons enter because they require short circuit evaluation, which complicates both the expansion of expressions and formatting (think lots of nested ifs).

In conclusion, py.test's new assertion rewriting fixes some long standing issues with assertion introspection and continues py.test's long tradition of excellent debugging support. (There are now three(!) assertion introspection methods in py.test: two reinterpretation implementations as well as rewriting) I just hope I haven't scared you completely off py.test! :)

Tuesday, March 15, 2011

six 1.0.0 final finally released

I finally found time to release six 1.0.0. six is a Python 2 and 3 compatibility library. You can find the documentation and download it on PyPI.

There haven't been many changes since the beta: one bugfix and one new advanced feature. The bugfix is that unicode escapes are now properly decoded with the u() fake literal in both Python 2 and 3. The feature is that there is now an api for adding items to the "six.moves" interface. This was requested by ActiveState, which uses six in the ActivePython package manager.

Enjoy!

Saturday, November 20, 2010

New version of six, lean and mean

I just released a new version a of six, my Python 2/3 compatibility library. The main feature in this release has that six has been flattened into one source file on the philosophy of "flat is better than nested" and for ease of distributing in projects. I've also switched from Bazaar to Mercurial, since the latter seems more popular and it's all the same to me. The issue tracker and source code is no on BitBucket.

I'm calling this version, 1.0.0 beta 1. Assuming no one complains, I think I'd like to release a final version in the next month or so.

Your feedback is appreciated.

Tuesday, June 29, 2010

Six: Python 2/3 compatibility helpers

Increasingly, I've seen a movement towards supporting Python 2 and Python 3 in the same code base. Having ported a few projects myself, I decided to collect the code I've duplicated between them into a library. The result is six. It includes fake byte and unicode literals, b() and u() and has wrappers for syntax changes such as print and exec. You can check out the documentation on PyPi.The license is MIT, so I hope it can see wide use in projects planning to support Python 2 and 3 simultaneously.

Friday, March 26, 2010

On commit messages

I would like to address the issue of commit messages. Good commit messages can make finding bugs and understanding the timeline of a project easy, and bad ones can result in an infuriating waste of time reading diffs and trying to locate information.

First of all, all commits should be atomic, that is they shouldn't include unrelated changes. Fixing a typo or spacing while fixing bug in related code is acceptable, but fixing 6 bugs and adding 2 features in the same commit makes it hard for people to parse out what change was for in the future. A good rule of thumb is that if a summary of your changes can't fit in one line, it's probably too big.

The first line of the commit message is most important part. This is especially true today, where many DVCSes only show the first line of the commit by default in their log command. The summary line should succinctly summarize what your change is and what it accomplishes. It need not be a full sentence, but just a bug number or general statement ("fix this") is not appropriate. The best summary lines quickly inform any log browser of the purpose and changes in the commit. Summary lines should also never be wrapped. Nothing is more annoying than reading a summary line which is cut off in the middle by a line break. Simple typo fixes do not require complicated messages. Good examples:

fix #2345 by preventing add() from accepting strings

fix a segfault in foo_my_bars() #4563

fix spelling

add a Python interface to the tokenizer #3222

and bad ones:

test and a fix

ugg

bah

a huge change to Foo class

why does this not work?

bug #4543

After the summary line can optionally come a body. A blank line should always separate the commit message from the body and different sections of the body from another. Bodies should also always be line wrapped. The body can include any of the following:

Bullet points describing various aspect of the change in more detail.

A paragraph description explaining why how something was implemented or why it's written a certain way.

A reference to mailing list discussions or decisions that lead to the commit.

Authors and attributions.

Any other significant information about the commit. For example, explain how it affects external components or might result in unexpected behavior.

Some projects follow the convention of listing affected files in bullet points and describing the individual changes to each. I personally find a prose summary of the changes in the body along with a diff or the verbose version of the log which shows changed files more helpful than this technique.
Good examples of complete commit messages:


"""
normalize encoding before opening file #3242

This change requires that tokenizer.c be linked with the Unicode
library.
"""

"""
silence foo warnings by default

Approved by BDFL in
http://mail.python.org/pipermail/mailinglist/bladh.html
"""

"""
support unicode in shlex module #4523

This is implemented by providing a separate class for Unicode and
requiring a locale to be set before parsing commences.

Patch by J. Hacker and J. Programmer
"""

"""
boost the speed of keyword argument comparisons

This improves some function calls by over 30% by comparing for
identity before falling back to the regular comparison. stringobject.c
was modified to provide faster access to a string's value.
"""