Saturday, September 29, 2012

Python 3.3 is my Favorite Python Release

Today, Python 3.3 was released. During the 4.5 years I've been a CPython core developer, 6 major Python releases (2.6, 2.7, 3.0, 3.1, 3.2, and 3.3) have past by me. In this post, I will explain why 3.3 is the most exciting Python release to me. I will be cherrypicking, consult "What's New in Python 3.3" and the Misc/NEWS file for complete details.

Unicode

PEP 393 completely changed the internal format of Python's Unicode implementation. It does away with the concept of wide and narrow unicode builds. The encoding of a string now depends on its maximum codepoint; there are 1-byte, 2-byte, or 4-byte strings internally. This means, for example, that strings with only ASCII characters can be represented in their most compact format. Partially as a consequence, Unicode standard compilance has improved. Indexing strings always gives code points not surrogates like on < 3.3 narrow builds. str.lower(), str.upper(), and str.title() have been fixed to use full Unicode case-mappings instead of the simple 1-1 ones. The str.casefold method implements the Unicode casefolding algorithm.
If the gods of PyCon talk selection smile on me, I will be giving a talk about this and the history of Unicode in Python.

Glorious Return of the "u" Prefix

Python 3.3 allows the u in front of strings again. Since the b prefix is supported from Python 2.6, code which wants to support 2.x and 3.3 shouldn't need to use unpleasant kludges like six's u() and b() functions. I don't think it would be unreasonable for libraries to only support 2.7 and 3.3+ now just to have the more natural string syntaxes.

Many Nice Things

One of the annoyances in previous Python 3 versions was it was impossible to turn off PEP 3134's implicit exception chaining. The raise exc from None syntax introduced in 3.3 prevents the __context__ of an exception from being printed.
There were improvements in exceptions themselves. PEP 3151 merged IOError, OSError, WindowsError, and various error types in the standard library. It also created a hierarchy of specialized exception subclasses. This means that most code dealing with IO errors won't have to dig into the errno module. For example, this standard pattern
try:
    fp = open("data", "rb")
except OSError as e:
    if e.errno != errno.ENOENT:
        raise
    # Create file
can become
try:
    fp = open("data", "rb")
except FileNotFoundError:
    # Create file
. (Of course, for this sort of thing you can also use the new "x" mode in open().) The errors from incorrect call signatures have improved:
Python 3.3.0+ (3.3:7e83c8ccb1ba, Sep 29 2012, 10:34:54) 
[GCC 4.5.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def f(a, b, c=5, *, kw1, kw2): pass
... 
>>> f(1, kw2=42)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'b'
>>> f(1, 2, kw2=42)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required keyword-only argument: 'kw1'
In the future, I think there should be a ArgumentsError subclass of TypeError which provides programmatic access to the signature mismatch, but this is a start. The new standard library modules, ipaddress, lzma, a dn unittest.mock are certainly worth a look.
The Windows installer has an option to set up PATH for you.







Saturday, May 12, 2012

The Architecture of Open Source Applications volume 2 has been published. It includes my chapter on PyPy. You can buy the dead tree version for $35 on Lulu where all the proceeds go to Amnesty International.

Saturday, July 9, 2011

Behind the scenes of py.test's new assertion rewriting

py.test 2.1 was just released. py.test, which uses the Python assert statement to check test conditions, has long had support for displaying intermediate values in subexpressions of a failing assert statement. This feature is called assertion introspection. Historically, py.test performed assertion introspection by reinterpreting failed assertions in order to glean information about subexpressions. In assertion reinterpreting, py.test actually reruns the assertion noting intermediate values during interpretation. This works pretty well but is subject to several problems, most importantly that assert statements with side-effects can produce strange results because they are evaluated twice on failure. py.test 2.1's main new feature, which I wrote (with generous sponsorship from Merlinux GmbH), is a new assertion introspection technique called assertion rewriting. Assertion rewriting modifies the AST of test modules to produce subexpression information when assertions fail. This blog post will give a peek into how this is done and what the rewritten tests look like.

py.test tries to rewrite every module that it collects as a test module. Assertion rewriting uses a PEP 302 import hook to capture test modules for rewriting. I'm happy to report doing this was easier than I expected. Most of the code in the import hook I had to write was dealing with detecting test modules rather than supporting import's extremely complicated API. Rewriting has a non-zero cost during test collection, so py.test compiles rewritten modules to bytecode and caches them in the PEP 3147 PYC repository, __pycache__. One major thing I did have to account for was the possibility that multiple py.test processes would be writing PYC files. (This is a very real possibility when the xdist plugin is being used. Therefore, py.test uses only atomic operations on the rewritten PYC file. Windows, lacking atomic rename, was a pain here.

I'm now going to demonstrate what py.test's rewriting phase does to a test module. Let's dive in with a failing test for a (broken) function that is supposed to create empty files:

import os

def make_empty_file(name):
with open(name, "w") as fp:
fp.write("hello")

def test_make_empty_file():
name = "/tmp/empty_test"
make_empty_file(name)
with open(name, "r") as fp:
assert not fp.read()


This test nicely demonstrates the problem with py.test's old assertion method mentioned in the first paragraph. If we force the old assertion interpretation mode with --assert=reinterp, we see:


def test_make_empty_file():
name = "/tmp/empty_test"
make_empty_file(name)
with open(name, "r") as fp:
> assert not fp.read()
E AssertionError: (assertion failed, but when it was re-run for printing intermediate values, it did not fail. Suggestions: compute assert expression before the assert or use --no-assert)

test_empty_file.py:11: AssertionError


The problem is that assert statement has the side-effect of reading the file. When py.test reinterprets the assert statement, it uses the same file object, now at EOF, and read() returns an empty string. py.test's new rewriting mode fixes this by scanning the assert for introspection information before executing the test. Running py.test with assertion rewriting enabled gives the desired result:


def test_make_empty_file():
name = "/tmp/empty_test"
make_empty_file(name)
with open(name, "r") as fp:
> assert not fp.read()
E assert not 'hello'
E + where 'hello' = ()
E + where = .read

test_empty_file.py:11: AssertionError


So what magic has py.test worked to display such nice debugging information? This is what Python is actually executing:

def test_make_empty_file():
name = '/tmp/empty_test'
make_empty_file(name)
with open(name, 'r') as fp:
@py_assert1 = fp.read
@py_assert3 = @py_assert1()
@py_assert5 = (not @py_assert3)
if (not @py_assert5):
@py_format6 = ('assert not %(py4)s\n{%(py4)s = %(py2)s\n{%(py2)s = %(py0)s.read\n}()\n}' %
{'py0': (@pytest_ar._saferepr(fp) if ('fp' in @py_builtins.locals() is not @py_builtins.globals()) else 'fp'),
'py2': @pytest_ar._saferepr(@py_assert1),
'py4': @pytest_ar._saferepr(@py_assert3)})
raise AssertionError(@pytest_ar._format_explanation(@py_format6))
del @py_assert5, @py_assert1, @py_assert3


As you can see, it's not going to be winning any awards for beautiful Python! (Ideally, though, you'll never have to see or think about it.) Examining the rewritten code, we see a lot of internal variables starting with "@" have been created. The "@", invalid in Python identifiers, is to make sure internal names don't conflict with any user-defined names which might be in the scope. In the first four written lines under the with statement, the test of the assert statement has been expanded into its component subexpressions. This allows py.test to display the values of subexpressions should the assertion fail. If the assertion fails, the if statement in the fifth line of rewriting evaluates to True and a AssertionError will be raised. Under the if statement is the real mess. This is where the helpful error message is generated. The line starting with @py_format6 is simply does string formatting (with %) on a template generated from the structure of the assert statement. This template is filled in with the intermediate values of the expressions collected above. @py_builtins is the builtins module, used in case the test is shadowing builtins the rewriting code uses. The @pytest_ar variable is a special module of assertion formatting helpers. For example, @pytest_ar._saferepr is like builtin repr but gracefully handles long reprs and __repr__ methods that raise exceptions. A non-obvious trick in the format dict is the expression @pytest_ar._saferepr(fp) if ('fp' in @py_builtins.locals() is not @py_builtins.globals()) else 'fp'. This checks whether fp is a local variable or not and customizes the display accordingly. After the initial formatting, the helper function _format_explanation is called. This function produces the indentation and "+" you see in the error message. Finally, we note that if the assertion doesn't fail, py.test cleans up after itself by deleting temporary variables.

The example above is a fairly tame (and luckily also typical) assertion. Rewriting gets more "exciting" when boolean operations and comparisons enter because they require short circuit evaluation, which complicates both the expansion of expressions and formatting (think lots of nested ifs).

In conclusion, py.test's new assertion rewriting fixes some long standing issues with assertion introspection and continues py.test's long tradition of excellent debugging support. (There are now three(!) assertion introspection methods in py.test: two reinterpretation implementations as well as rewriting) I just hope I haven't scared you completely off py.test! :)

Tuesday, March 15, 2011

six 1.0.0 final finally released

I finally found time to release six 1.0.0. six is a Python 2 and 3 compatibility library. You can find the documentation and download it on PyPI.

There haven't been many changes since the beta: one bugfix and one new advanced feature. The bugfix is that unicode escapes are now properly decoded with the u() fake literal in both Python 2 and 3. The feature is that there is now an api for adding items to the "six.moves" interface. This was requested by ActiveState, which uses six in the ActivePython package manager.

Enjoy!

Saturday, November 20, 2010

New version of six, lean and mean

I just released a new version a of six, my Python 2/3 compatibility library. The main feature in this release has that six has been flattened into one source file on the philosophy of "flat is better than nested" and for ease of distributing in projects. I've also switched from Bazaar to Mercurial, since the latter seems more popular and it's all the same to me. The issue tracker and source code is no on BitBucket.

I'm calling this version, 1.0.0 beta 1. Assuming no one complains, I think I'd like to release a final version in the next month or so.

Your feedback is appreciated.

Tuesday, June 29, 2010

Six: Python 2/3 compatibility helpers

Increasingly, I've seen a movement towards supporting Python 2 and Python 3 in the same code base. Having ported a few projects myself, I decided to collect the code I've duplicated between them into a library. The result is six. It includes fake byte and unicode literals, b() and u() and has wrappers for syntax changes such as print and exec. You can check out the documentation on PyPi.The license is MIT, so I hope it can see wide use in projects planning to support Python 2 and 3 simultaneously.

Friday, March 26, 2010

On commit messages

I would like to address the issue of commit messages. Good commit messages can make finding bugs and understanding the timeline of a project easy, and bad ones can result in an infuriating waste of time reading diffs and trying to locate information.

First of all, all commits should be atomic, that is they shouldn't include unrelated changes. Fixing a typo or spacing while fixing bug in related code is acceptable, but fixing 6 bugs and adding 2 features in the same commit makes it hard for people to parse out what change was for in the future. A good rule of thumb is that if a summary of your changes can't fit in one line, it's probably too big.

The first line of the commit message is most important part. This is especially true today, where many DVCSes only show the first line of the commit by default in their log command. The summary line should succinctly summarize what your change is and what it accomplishes. It need not be a full sentence, but just a bug number or general statement ("fix this") is not appropriate. The best summary lines quickly inform any log browser of the purpose and changes in the commit. Summary lines should also never be wrapped. Nothing is more annoying than reading a summary line which is cut off in the middle by a line break. Simple typo fixes do not require complicated messages. Good examples:
fix #2345 by preventing add() from accepting strings

fix a segfault in foo_my_bars() #4563

fix spelling

add a Python interface to the tokenizer #3222

and bad ones:
test and a fix

ugg

bah

a huge change to Foo class

why does this not work?

bug #4543


After the summary line can optionally come a body. A blank line should always separate the commit message from the body and different sections of the body from another. Bodies should also always be line wrapped. The body can include any of the following:

  • Bullet points describing various aspect of the change in more detail.

  • A paragraph description explaining why how something was implemented or why it's written a certain way.

  • A reference to mailing list discussions or decisions that lead to the commit.

  • Authors and attributions.

  • Any other significant information about the commit. For example, explain how it affects external components or might result in unexpected behavior.


Some projects follow the convention of listing affected files in bullet points and describing the individual changes to each. I personally find a prose summary of the changes in the body along with a diff or the verbose version of the log which shows changed files more helpful than this technique.
Good examples of complete commit messages:

"""
normalize encoding before opening file #3242

This change requires that tokenizer.c be linked with the Unicode
library.
"""

"""
silence foo warnings by default

Approved by BDFL in
http://mail.python.org/pipermail/mailinglist/bladh.html
"""

"""
support unicode in shlex module #4523

This is implemented by providing a separate class for Unicode and
requiring a locale to be set before parsing commences.

Patch by J. Hacker and J. Programmer
"""

"""
boost the speed of keyword argument comparisons

This improves some function calls by over 30% by comparing for
identity before falling back to the regular comparison. stringobject.c
was modified to provide faster access to a string's value.
"""

Saturday, October 3, 2009

% formatting to str.format converter

Recent discussions on Python-dev have revolved around transitioning the standard library to the new str.format method. One suggestion was to write a automatic converter for old format strings to new ones. I've taken on the task and written mod2format at https://code.launchpad.net/~gutworth/+junk/mod2format. You can try it out by running "python3 -m mod2format [your format strings here]".

Friday, September 4, 2009

Reivew: IronPython in action

Disclaimer: Manning Press and Michael Ford very generously sent me a free copy of the book.

One thing that always slightly annoys me when I'm reading a book about Python programming is having the first few chapters devoted to introducing the Python language. However, I'm sure experienced .NET people felt the same while scanning through the introduction chapters to .NET, which was totally new to me. (I'm also glad there was an appendix about C# syntax; I learned that C# seems to have invented a new syntax or keyword for every possible programming paradigm.) IronPython in Action seems to do a very job, overall, of catering both Python programmers tiptoeing into IronPython and .NET and C# developers finding the light of dynamic programming.

I found the web programming part of the book, especially the part on Silverlight, most interesting, since embedding Python in the browser seems like a lot more fun than writing cross-browser JavaScript. Michael Foord's Try Python (source) is a good demonstration of what can be accomplished. (Though, I wonder if PyPy's sandboxing could someday be used in the browser to do the same thing.)

I would have appreciated a chapter or section on parallel processing, since IronPython offers much better threading and concurrency primitives than CPython. Perhaps an example where IronPython can perform a task that would be impossible on other implementations of Python is in order. I want to see how .NET can make concurrency easy and pythonic.

Before reading this book, I had dismissed .NET as a non-cross-platform hunk of Javaish APIs. I see now, though, that IronPython is able to combine the beauty of Python with some of .NET's better APIs (I would still rather use PyQt for GUI programming. Windows Forms has not improved.) to make a powerful development platform.

Tuesday, August 25, 2009

parser-compiler branch merged

The PyPy project I have been working on over the summer, rewriting the parser and compiler, has finally been merged back to trunk (during the now-ending JIT sprint in Gothenburg, Sweden). I wrote a little summary up about it on the PyPy blog.

Saturday, June 27, 2009

Python 3.1 released!

I'm happy to announce that today Python 3.1 was released. I won't dwell the new features, since those are more completely listed elsewhere. I'm quite happy with this release. A lot of work has been put into 3.x as stable as its older 2.x siblings. I would like to see a lot of libraries and applications start serious looking at the port to 3.x now. As always there's a bunch of core developers waiting to help on the python-porting mailing list.

Anyway, 3.1 is available for download in source and several binary formats on python.org.

Wednesday, May 6, 2009

Python 3.1 beta 1 released

I'm pleased to announce that the first beta of Python 3.1 has been released. In addition to the features found in the previous alphas, the beta has several more improvements. Most importantly, I think, is PEP 383. It defines a way for undecodable paths in file systems to be safely round tripped from Unicode strings. The repr of floats also now uses a new algorithm which determines the shortest possible value.

It is planned that this will be the only beta in order for 3.1 to make a final in late June. Please download it and try it out. This is 3.x's future, and in my opinion, much of an improvement. As always, you can submit any problems you see to bugs.python.org

Tuesday, March 24, 2009

unittest - now with test skipping (finally)

Yesterday, I was happy to commit a patch which added test skipping and expected failure support to the venerable unittest module. It adds a skip() method to TestCase, which marks the current test being run as skipped, as well as a set of useful decorators. Here's a short example:

import sys
import unittest

class SkippingExample(unittest.TestCase):

@unittest.skip("testing skipping")
def test_skip_me(self):
self.fail("shouldn't happen")

def test_normal(self):
self.assertEqual(1, 1)

@unittest.skipIf(sys.version_info < (2, 6),
"not supported in this veresion")
def test_show_skip_if(self):
# testing some things here
pass

@unittest.expectedFailure
def test_expected_failure(self):
self.fail("this should happen unfortunately")


# Yes, you can skip whole classes, too!
@unittest.skip("classing skipping")
class CompletelySkippedTest(unittest.TestCase):

def test_not_run_at_all(self):
self.fail("shouldn't happen")


if __name__ == "__main__":
unittest.main()


Running it in verbose mode gives:


__main__.CompletelySkippedTest ... skipped 'classing skipping'
test_expected_failure (__main__.SkippingExample) ... expected failure
test_normal (__main__.SkippingExample) ... ok
test_show_skip_if (__main__.SkippingExample) ... ok
test_skip_me (__main__.SkippingExample) ... skipped 'testing skipping'

----------------------------------------------------------------------
Ran 5 tests in 0.010s

OK (skipped=2, expected failures=1)


I have high hopes for this and Python's regression tests. Hopefully it will simplify the ugly system of test skipping we have now. It should also help us pacify other implementations who want CPython implementation detail tests skipped.

Saturday, February 14, 2009

Python 3.0.1 released

The first bugfix release, 3.0.1, of the new Python 3.x series has been released! Many embarrassing bugs have been fixed. Among other things:


  • The wsgiref package has been fixed for 3.x.

  • A few hideous bugs in the new IO implementation have been squashed. In addition, a few cases have been optimized. (Note that IO in 3.x is still quite a bit slower than 2.x; more on that later.)

  • Unbuffered standard streams (the "-u" flag) have been restored.



This is actually a bit more than an average point release. Somehow, the builtin cmp() and __cmp__ slipped into the final release and has been removed in 3.0.1.

Our next goal is 3.1. We plan to compress our rather huge release cycle to make 3.1 between 1/2 and 1 year after 3.0. The focus of 3.1 will be stabilizing the feature set and change in 3.x. This is includes the rewrite of IO in C for speed and Brett's rewrite of import in Python.

Thursday, January 1, 2009

MRO magic

Here's some more less publicized evil you can accomplish with metaclasses. Here's a simple example file: (Note that although I'm using Python 3.0, this works with all new-style class supporting versions.)

# mromagic.py
class A(object):

def a_method(self):
print("A")

class B(object):

def b_method(self):
print("B")

class MROMagicMeta(type):

def mro(cls):
return (cls, B, object)

class C(A, metaclass=MROMagicMeta):

def c_method(self):
print("C")


Now let's play with this a little:

>>> import mromagic
>>> mycls = mromagic.C()
>>> mycls
<mromagic.C object at 0x622890>
>>> mycls.c_method()
C
>>> mycls.a_method()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'C' object has no attribute 'a_method'
>>> mycls.b_method()
B
>>> type(mycls).__mro__
(<class 'mromagic.C'>, <class 'mromagic.B'>, <class 'object'>)
>>> type(mycls).__bases__
(<class 'mromagic.A'>,)


How does this work? By overriding mro() on the a metaclass we can define a custom __mro__ for our class. Python will then traverse it instead of the default implementation, which is provided by type.mro().

Thursday, December 11, 2008

Get help porting your package!

To help the community with porting their packages to Python 3, we have created the python-porting mailing list. Many core developers are subscribed to the list, so you should be able to get excellent advice on 2to3, the bytes/unicode split, C-API, or other incompatibilities.

Wednesday, December 3, 2008

Is it all futile?

Today, the much anticipated Python 3.0 final was released. Truly, this is a historic release for the Python community, the first intentional incompatible Python release. It's been a long time in the making, and I applaud everybody who is responsible for proving Py3k is not vaporware. Guido and the other decision makers on Python-dev also deserve credit for not making py3k changes too gratuitous; many revolutionary ideas and features were proposed and rejected. I have every confidence that every incompatibility was well thought out. Much thought has also been given to making the transition as easy as possible for the community. The suggested migration path, fixing Py3k warnings in 2.6 and then applying 2to3, has been used with a fair amount of success on several projects. We've even had several Py3k love letters!

Still, I can't help being a little worried. I think the bytes and str divide will be difficult for people especially with IO where everything has to be dealt with in bytes. We may see many "x.encode('ascii')" lines popping up all over codebases. Userland libraries will need to maintain compatibility with 2.4 and 2.5 for a while; that significantly complicates the dream of maintaining just one branch for 2.x and py3k. 2to3 is not even close to perfect and will only correct the surface incompatibilities of syntax between the versions. I'm also concerned about burn out. The excitement of a new major version will certainly spur an interest in porting for a few months, but I suspect it won't be so fun after the aura wears off a bit. I hope common base libraries (PIL, Twisted, lxml, etc...) are ported soon. It will build the bridge for everything else to cross over too.

Of course, what I'm forgetting is the amazing Python community. Whatever the results, a new era has certainly begun. We just need time.

Sunday, October 19, 2008

Pure Python Dictionary Implementation

For those curious about how CPython's dict implementation works, I've written a Python implementation using the same algorithms. Aside from the education value, it's pretty useless because it doesn't support None as a value and is extremely slow. You can get the source in a Bazaar repo: http://code.python.org/python/users/benjamin.peterson/pydict/

"""
A Python dict implementation.
"""

import collections

MINSIZE = 8
PERTURB_SHIFT = 5
dummy = "<dummy key>"


class Entry(object):
"""
A hash table entry.

Attributes:
* key - The key for this entry.
* hash - The has of the key.
* value - The value associated with the key.
"""

__slots__ = ("key", "value", "hash")

def __init__(self):
self.key = None
self.value = None
self.hash = 0

def __repr__(self):
return "<Entry: key={0} value={1}>".format(self.key, self.value)



class Dict(object):
"""
A mapping interface implemented as a hash table.

Attributes:
* used - The number of entires used in the table.
* filled - used + number of entries with a dummy key.
* table - List of entries; contains the actual dict data.
* mask - Length of table - 1. Used to fetch values.
"""

__slots__ = ("filled", "used", "mask", "table")


def __init__(self, arg=None, **kwargs):
self.clear()
self._update(arg, kwargs)

@classmethod
def fromkeys(cls, keys, value=0):
"""
Return a new dictionary from a sequence of keys.
"""
d = cls()
for key in keys:
d[key] = value
return d

def clear(self):
"""
Clear the dictionary of all data.
"""
self.filled = 0
self.used = 0
self.mask = MINSIZE - 1
self.table = []
# Initialize the table to a clean slate of entries.
for i in range(MINSIZE):
self.table.append(Entry())

def pop(self, *args):
"""
Remove and return the value for a key.
"""
have_default = len(args) == 2
try:
v = self[args[0]]
except KeyError:
if have_default:
return args[1]
raise
else:
del self[args[0]]
return v

def popitem(self):
"""
Remove and return any key-value pair from the dictionary.
"""
if self.used == 0:
raise KeyError("empty dictionary")
entry0 = self.table[0]
entry = entry0
i = 0
if entry0.value is None:
# The first entry in the table's hash is abused to hold the index to
# the next place to look for a value to pop.
i = entry0.hash
if i > self.mask or i < i:
i = 1
entry = self.table[i]
while entry.value is None:
i += 1
if i > self.mask:
i = 1
entry = self.table[i]
res = entry.key, entry.value
self._del(entry)
# Set the next place to start.
entry0.hash = i + 1
return res

def setdefault(self, key, default=0):
"""
If key is in the dictionary, return it. Otherwise, set it to the default
value.
"""
val = self._lookup(key).value
if val is None:
self[key] = default
return default
return val

def _lookup(self, key):
"""
Find the entry for a key.
"""
key_hash = hash(key)
i = key_hash & self.mask
entry = self.table[i]
if entry.key is None or entry is key:
return entry
free = None
if entry.key is dummy:
free = entry
elif entry.hash == key_hash and key == entry.key:
return entry

perturb = key_hash
while True:
i = (i << 2) + i + perturb + 1;
entry = self.table[i & self.mask]
if entry.key is None:
return entry if free is None else free
if entry.key is key or \
(entry.hash == key_hash and key == entry.key):
return entry
elif entry.key is dummy and free is None:
free = dummy
perturb >>= PERTURB_SHIFT

assert False, "not reached"

def _resize(self, minused):
"""
Resize the dictionary to at least minused.
"""
newsize = MINSIZE
# Find the smalled value for newsize.
while newsize <= minused and newsize > 0:
newsize <<= 1
oldtable = self.table
# Create a new table newsize long.
newtable = []
while len(newtable) < newsize:
newtable.append(Entry())
# Replace the old table.
self.table = newtable
self.used = 0
self.filled = 0
# Copy the old data into the new table.
for entry in oldtable:
if entry.value is not None:
self._insert_into_clean(entry)
elif entry.key is dummy:
entry.key = None
self.mask = newsize - 1

def _insert_into_clean(self, entry):
"""
Insert an item in a clean dict. This is a helper for resizing.
"""
i = entry.hash & self.mask
new_entry = self.table[i]
perturb = entry.hash
while new_entry.key is not None:
i = (i << 2) + i + perturb + 1
new_entry = self.table[i & self.mask]
perturb >>= PERTURB_SHIFT
new_entry.key = entry.key
new_entry.value = entry.value
new_entry.hash = entry.hash
self.used += 1
self.filled += 1

def _insert(self, key, value):
"""
Add a new value to the dictionary or replace an old one.
"""
entry = self._lookup(key)
if entry.value is None:
self.used += 1
if entry.key is not dummy:
self.filled += 1
entry.key = key
entry.hash = hash(key)
entry.value = value

def _del(self, entry):
"""
Mark an entry as free with the dummy key.
"""
entry.key = dummy
entry.value = None
self.used -= 1

def __getitem__(self, key):
value = self._lookup(key).value
if value is None:
# Check if we're a subclass.
if type(self) is not Dict:
# Try to call the __missing__ method.
missing = getattr(self, "__missing__")
if missing is not None:
return missing(key)
raise KeyError("no such key: {0!r}".format(key))
return value

def __setitem__(self, key, what):
# None is used as a marker for empty entries, so it can't be in a
# dictionary.
assert what is not None and key is not None, \
"key and value must not be None"
old_used = self.used
self._insert(key, what)
# Maybe resize the dict.
if not (self.used > old_used and
self.filled*3 >= (self.mask + 1)*2):
return
# Large dictionaries (< 5000) are only doubled in size.
factor = 2 if self.used > 5000 else 4
self._resize(factor*self.used)

def __delitem__(self, key):
entry = self._lookup(key)
if entry.value is None:
raise KeyError("no such key: {0!r}".format(key))
self._del(entry)

def __contains__(self, key):
"""
Check if a key is in the dictionary.
"""
return self._lookup(key).value is not None

def __eq__(self, other):
if not isinstance(other, Dict):
try:
# Try to coerce the other to a Dict, so we can compare it.
other = Dict(other)
except TypeError:
return NotImplemented
if self.used != other.used:
# They're not the same size.
return False
# Look through the table and compare every entry, breaking out early if
# we find a difference.
for entry in self.table:
if entry.value is not None:
try:
bval = other[entry.key]
except KeyError:
return False
if not bval == entry.value:
return False
return True

def __ne__(self, other):
return not self == other

def keys(self):
"""
Return a list of keys in the dictionary.
"""
return [entry.key for entry in self.table if entry.value is not None]

def values(self):
"""
Return a list of values in the dictionary.
"""
return [entry.value for entry in self.table if entry.value is not None]

def items(self):
"""
Return a list of key-value pairs.
"""
return [(entry.key, entry.value) for entry in self.table
if entry.value is not None]

def __iter__(self):
return DictKeysIterator(self)

def itervalues(self):
"""
Return an iterator over the values in the dictionary.
"""
return DictValuesIterator(self)

def iterkeys(self):
"""
Return an iterator over the keys in the dictionary.
"""
return DictKeysIterator(self)

def iteritems(self):
"""
Return an iterator over key-value pairs.
"""
return DictItemsIterator(self)

def _merge(self, mapping):
"""
Update the dictionary from a mapping.
"""
for key in mapping.keys():
self[key] = mapping[key]

def _from_sequence(self, seq):
for double in seq:
if len(double) != 2:
raise ValueError("{0!r} doesn't have a length of 2".format(
double))
self[double[0]] = double[1]

def _update(self, arg, kwargs):
if arg:
if isinstance(arg, collections.Mapping):
self._merge(arg)
else:
self._from_sequence(arg)
if kwargs:
self._merge(kwargs)

def update(self, arg=None, **kwargs):
"""
Update the dictionary from a mapping or sequence containing key-value
pairs. Any existing values are overwritten.
"""
self._update(arg, kwargs)

def get(self, key, default=0):
"""
Return the value for key if it exists otherwise the default.
"""
try:
return self[key]
except KeyError:
return default

def __len__(self):
return self.used

def __repr__(self):
r = ["{0!r} : {1!r}".format(k, v) for k, v in self.iteritems()]
return "Dict({" + ", ".join(r) + "})"

collections.Mapping.register(Dict)


class DictIterator(object):

def __init__(self, d):
self.d = d
self.used = self.d.used
self.len = self.d.used
self.pos = 0

def __iter__(self):
return self

def next(self):
# Check if the dictionary has been mutated under us.
if self.used != self.d.used:
# Make this state permanent.
self.used = -1
raise RuntimeError("dictionary size changed during interation")
i = self.pos
while i <= self.d.mask and self.d.table[i].value is None:
i += 1
self.pos = i + 1
if i > self.d.mask:
# We're done.
raise StopIteration
self.len -= 1
return self._extract(self.d.table[i])

__next__ = next

def _extract(self, entry):
return getattr(entry, self.kind)

def __len__(self):
return self.len

class DictKeysIterator(DictIterator):
kind = "key"

class DictValuesIterator(DictIterator):
kind = "value"

class DictItemsIterator(DictIterator):

def _extract(self, entry):
return entry.key, entry.value

Monday, October 13, 2008

First impressions of darcs

This week, I've been playing around with the relatively little known distributed version control system, darcs. (That stands for David's Advanced Revision Control System.)

Darcs is based on David Roundy's, its creator, theory of patches. Simply put, darcs' fundamental type is a difference between two trees, a patch.

Creating a simple repo was quick and painless with "darcs initialize". I recorded a few patches easily, and was feeling quite happy about the fast pace with which darcs went about its business. Then, I decided to review my work. Apparently, darcs has no concept of a revision number; every "commit" is just a patch. This makes selecting patches to review rather difficult since everything is relative to the current state of the repo. Perhaps this isn't a problem in practice, though, because advanced patch matching (with regular expressions) is provided. Another thing I disliked was the lack of history in merging between repos. Although it is simple to do, no evidence besides the author's name in the log indicates that the patch was pulled.

Obviously, this is just a first step into the exciting darcs world; I'll continue to use it for some of my projects, and report back later.