Extra Cheese

Me: Gary Bernhardt

Email: gary.bernhardt at gmail

Work: BitBacker

RSS Archive: 2008
2007
2006
May 30

Processes spawn faster than threads?

In general, processes take longer to start than threads. This makes sense if you think about it - a thread lives within the memory space of its parent process, so it takes less work to set one up. (This is a gross oversimplification, but to be honest I find the details of process management incredibly uninteresting in 2008.) I assumed that this difference would hold for the Python processing module. Apparently it doesn't, at least on Mac OS X. Surprise!

Spawning 100 children with Thread took 1.04s
Spawning 100 children with Process took 0.60s

The above result is for starting and joining the children serially. I get the same results in all of these variations:

  • Starting them all at once, then joining them all at once.
  • Using 10 children or 1,000 children.
  • Having each child sleep for one second (to ensure that they're all actually alive at the same time).

I don't know whether this is due to goodness in OS X, or processing, or fork(), or just Unix in general. In any case, it's very good news. I'd dismissed processing for use on the client side of BitBacker because "process management is hard and they're too heavyweight." Clearly at least one of those complaints is invalid; maybe the other is as well. It would be a wonderful relief if I could use processes. I'm going to need parallelization of one form or another soon, and I'm definitely not going to start sprinkling threads around. Only madness lies down that path.

Here's the code that generated those results, in case you're interested:

import time, threading, processing
for cls in [threading.Thread, processing.Process]:
    start = time.time()
    for _ in range(100):
        child = cls(target=lambda: None)
        child.start()
        child.join()
    print 'Spawning 100 children with %s took %.2fs' % (
        cls.__name__, time.time() - start)

Dec 28

Human-Readable Encryption Keys

For BitBacker, we use 128-bit AES encryption, which means our keys are really long and annoying - 32 characters long when printed in hex. And not only do the users sometimes have to type them in, but they have to write them down on paper. (We can't store the key on our servers because then we'd be able to read the user's files; and we obviously can't trust it to their hard drive because that's what we're backing up.)

Somehow, we have to present these random 128-bit keys to the user, and I think I've found a pretty good way. We use RFC 1751, which defines a "Convention for Human-Readable 128-bit Keys" - basically just a mapping of blocks of bits to strings of English words. Here's an example in Python using the RFC 1751 module in PyCrypto:

>>> key = os.urandom(16) # Generate 16 random bytes (128 bits)
>>> bin_to_hex(key) # Show the key in hex (32 characters)
'61aa60e43a5e7fdb4b86a4897b52a0dc'
>>> y = RFC1751.key_to_english(key)
>>> y # Show the pass phrase version of the key
'BUSY BARN RUB DOLE TAUT TOOK ALTO PRY KIT WALL MUG CURT'
>>> # The transformation is always reversible
>>> bin_to_hex(RFC1751.english_to_key(y))
'61aa60e43a5e7fdb4b86a4897b52a0dc'

The keys are still *very* long, of course, and this is unavoidable for our application. But when translated to words, I think it's easier to write them down or type them in without making a mistake. The image below shows BitBacker giving me a pass phrase. (This feature hasn't even gone into beta yet - it's little more than a mockup. So please don't judge it too harshly!)

Screen shot of BitBacker's pass phrase handling

When the user clicks "Continue" here, BitBacker actually makes him re-enter the generated pass phrase he wrote down. To be honest, BitBacker's pass phrase handling is quite annoying. But that's a heck of a lot better than losing your pass phrase, which would make your backups inaccessible! This is the one place in all of BitBacker that isn't optimized for "least user annoyance". Encryption keys are just way too important to mess around with, and I think that most existing software is far too lax with them (including BitBacker's competitors).

(This was derived from a comment I left on Jeff Atwood's "Software Registration Keys" post.)


Dec 22

My blog woes have been soothed

It seems I've mostly solved my "blog woes". I got some quite helpful replies (still visible on Blogger, although the comments didn't come over to my new blog). I also got emails from Will Guaraldi about PyBlosxom, and from Lloyd Dalton about blog_my.

I took at least a brief look at each system mentioned in the comments and emails, but I decided on PyBlosxom. If you're reading this in a web browser, what you're seeing is PyBlosxom rendering a theme I ported from Tumblr, with all of my old Blogger blog's content imported. Quite frankensteinian indeed, as far as blogs go.

It turns out that my impression of PyBlosxom's size when I wrote my "blog woes" post was a bit off - I didn't realize just how little functionality resides in the core. It's pretty slim, but with a decent selection of plugins. I only needed tags, wbgarchives, and metadate, but there are plenty more for those who want more features. With the tag and metadate plugins, I managed to keep my blog posts in almost exactly the format I've always used, so that was nice.

PyBlosxom nicely solves my biggest concern, which I didn't explicitly state in my original post: I want to keep all of the files related to my blog in a Mercurial repository. I've succeeded in that - my entire blog is in Mercurial now. That includes configuration files, the .htaccess file, the template, the entries, and even the queue of unfinished entries. If I ever need to, I should be able to move the blog to another host in a matter of minutes. Not that I ever intend to leave WebFaction (note: that's an affiliate link), which is where it's happily hosted now.

With that all out of the way, hopefully I can quit the detestable practice of metablogging, which I'd managed to avoid for my entire first year. Thanks to everyone who made a suggestion, and special thanks to the PyBlosxom developers.


Dec 19

Blog woes

Last night, I switched my blog to tumblr; this morning, I switched it back to Blogger. Over the last two days, I've spent a ridiculous amount of time and effort trying to make the switch without breaking any links. I'll spare the details, but it involved a lot of mod_rewrite, a PHP script written by Henrik Nyh that proxies all requests to tumblr, and a huge list of URLs mapping the old Blogger ones to the new tumblr ones.

I found the proxy-with-a-PHP-script thing distasteful, but the lack of decent tag support was the thing that ultimately made me give up. I have a Python-specific feed that gets aggregated by the unofficial planet python, so whatever I switch to needs to be able to generate tag-specific feeds. Tumblr does let you add tags to your posts, but it doesn't seem to actually do anything with them.

So, I'm in search of blogging software. I want something that:

  • Is written in Python or Ruby. Python because I know it very well; Ruby because I want to know it better.
  • Is simple. I'm not interested in anything built in Django or TurboGears or Rails. Preferably, it would be something that runs on my local machine, generating the static files that make up the live blog.
  • Allows custom URLs but has sane defaults. (I.e., no exposed serial integer keys.)
  • Supports tags and can generate subfeeds based on them, as well as human-readable post lists that are filtered by them. (For example, with my current blog you can look at only Python posts if you want to.)
  • Does not involve a database. (Not even SQLite.)
  • Reads the posts out of plain HTML files, which I will write by hand.

I've looked around for something like this, but everything seems to be either big (e.g., Pyblosxom and Typo) or someone's weekend project that never got touched again. I'm sure that the bigger ones are quite good at what they do, but I just want something that takes my plaintext files and generates an appropriate URL structure. It can do it offline or online - I don't care - but it's got to be simple and require no complicated installation or configuration.

So, any ideas, or am I starting a new project?


Sep 07

Globals and cargo culting

Matt Wilson wants a module's functions to log to one logger, but he can't change their interface and he doesn't want to use a global variable. This is the kind of thing that decorators are very good at, for better or worse. Here's a decorator that will do the job:

def with_logger(fn):
    def new_fn(*args, **kwargs):
        logger = get_singleton_logger()
        return fn(logger=logger, *args, **kwargs)
    return new_fn

And here's how to use it to define a function that gets a logger instance without the caller passing it in:

>>> @with_logger
... def add(x, y, logger):
...     logger.warning('x + y = %i' % (x + y))
...
>>> add(1, 2)
WARNING:foo:x + y = 3

This seems to be exactly what Matt was looking for: there are no globals, a logger gets injected, and the function's interface hasn't changed. But is it a good idea?

No, it's a ridiculous idea! It's just a reimplementation of global variables! All I've done is come up with a complicated scheme for injecting a single logger instance into every function in the module. But that's exactly what a global does! This is something that programmers have done over and over again in the name of OO. Everyone wants globals, but they go to great lengths to hide it.

Here are three possible ways to solve the original logger problem:

  1. Use a singleton, and have each function retrieve the instance that way.
  2. Use a decorator that injects the instance into the function's argument list each time. In with_logger, I combined this with a singleton.
  3. Just use a global.

If you choose (1) or (2), the joke's on you. You're still using a global, but now you have two problems: global state and a complicated method for managing it. There's no need for that, because we already have a simple method for injecting instances into a module's functions: globals!

Of course, sometimes singletons or decorators like with_logging make sense, but only when they actually do something. If all they do is allow multiple functions to access a single long-living instance, they're dangerous and needlessly complex. In almost all cases, singleton and related techniques are nothing more than cargo cult programming.


Jul 23

When JSON isn't JSON

JSON is so simple that you can specify it on an index card, but we still can't get it right. For example, here's what happens when simplejson and python-cjson talk about slashes:

# simplejson correctly decodes cjson's data
>>> print simplejson.loads(cjson.encode('/'))
/
# cjson fails to decode simplejson's data
>>> print cjson.decode(simplejson.dumps('/'))
\/

In this case, the problem is that cjson doesn't handle backslashes correctly. There are two ways to say "/" in JSON: "/" and "\/". When encoding, simplejson always escapes slashes, but cjson never does:

>>> print simplejson.dumps('/')
"\/"
>>> print cjson.encode('/')
"/"

The reverse is also true: simplejson knows how to decode "\/", but cjson decodes it incorrectly:

>>> print simplejson.loads('"\/"')
/
>>> print cjson.decode('"\/"')
\/

So there you go: simplejson and cjson don't interoperate. This bit me when I tried to move BitBacker from simplejson to cjson for performance reasons. The live alpha server had a few thousand records encoded with simplejson, all of which included slashes. When I switched to cjson, everything broke because every "/foo/bar" entry in the database came back as "\/foo\/bar".

As far as I'm concerned, this problem with JSON is actually an argument for simple data formats like JSON. If we can't get full interoperability between something as stupidly simple as JSON, how did anyone ever expect WS-* to work?


Apr 07

Are your tests lying to you?

If you've written a test for a module, and the module is changed in the future, there are three things that can happen:

  1. The test keeps passing because nothing is broken. (Good.)
  2. The test fails because something is wrong. (Great - this is the test's job!)
  3. The test keeps passing, but it silently stops testing the thing it claims to (BAD, BAD, BAD!).

Scenario 3 above is very dangerous, and it's a major problem in testing. What you have in that situation is a lying test: it says "I'm testing feature x," but actually passes without doing so. In other words, you have a test that no longer warns you if you break something.

If you've not been bitten by this, it might not be an obvious problem. To make it a little more clear, let's look at a toy example (in Python, of course!) Here's a silly WebClient class and its test.

class WebClient:
    """An HTTP client that supports both SSL and plain connections"""
    def __init__(self):
        self.use_ssl = False

    def get(self, url):
        # Hand any request off to external functions
        if self.use_ssl:
            return get_with_ssl(url)
        else:
            return get_without_ssl(url)

def test_web_client():
    # Make sure everything works with normal HTTP
    client = WebClient()
    assert client.get('/') == expected_data #defined elsewhere

    # Make sure everything works with SSL as well
    client.use_ssl = True
    assert client.get('/') == expected_data

This works fine - the test passes and it tests what it claims to. But what happens if someone renames the use_ssl attribute later?

class WebClient:
    def __init__(self):
        self.using_ssl = False

    def get(self, url):
        # Hand any request off to external functions
        if self.using_ssl:
            return get_with_ssl(url)
        else:
            return get_without_ssl(url)

Take a look back at the test. It's no longer testing what it claims to, because "use_ssl" no longer means anything to WebClient. The test still passes, though - it's just that neither of the two get() calls actually uses SSL.

This is a serious problem - you need to be able to trust your tests, but for all you know your tests are giving you false positives. The question, then, is how can we detect this type of mistake? Well, there is a simple method that will catch at least some of them. What you need is a meta-test: a test that ensures that the tests aren't lying to you. It's really not that bad; here's the pseudocode:

for each test in the suite:
    for each line of code that isn't an assertion:
        remove that line of code (but not the rest)
        run the test and make sure that it fails

Basically, this meta-test is ensuring that every line of code in the test is required: removing any line should cause the test to fail. This sounds complicated, but it only has to be implemented once. Once it exists as a nose plugin, for example, you can use it without writing any extra code.

Let's look at how this would affect the example. Here's the testing code again:

def test_web_client():
    # Make sure everything works with normal HTTP
    client = WebClient()
    assert client.get('/') == expected_data #defined elsewhere

    # Make sure everything works with SSL as well
    client.use_ssl = True
    assert client.get('/') == expected_data

The meta-test will step through, removing each relevant line and making sure that the test fails. The only executable lines that aren't assertions are 3 and 7. When it removes line 3, the test will fail because "client" won't be defined. So that iteration of the meta-test passes. When it removes line 7, the test will still pass. Because the test passes with a line removed, the meta-test will fail. The meta-test has detected the fact that line 7 isn't necessary, which is a red flag that says "this test might lie to you later!"

It's important to note that the meta-test will fail even when the test is working. It really is a meta-test: it's only testing the test. This is a good thing. It tells you when you've written a crappy test - a test that isn't paying enough attention.

Let's return to the example and try to fix it. To make the meta-test pass again, the test could be changed to be more sensitive to WebClient's state:

def test_web_client():
    # Make sure everything works with normal HTTP
    client = WebClient()
    assert client.get('/') == expected_data #defined elsewhere
    assert client.use_ssl == False

    # Make sure everything works with SSL as well
    client.use_ssl = True
    assert client.get('/') == expected_data
    assert client.use_ssl == True

Now the meta-test passes, and the original test_web_client is more resilient to silent failures. If someone renames WebClient's use_ssl attribute, the test won't silently stop testing like it did before. Instead, line 5 will raise an exception and the test will fail.

Of course, this isn't foolproof. If you added line 10 but not line 5, you wouldn't be doing yourself any good (figuring out why is left as an exercise for the reader :). The meta-test would still pass, though, and you would still have a test that may lie to you in the future. So this meta-testing method isn't a magic bullet that will force you to write good tests. For a careful tester, though, it throws up a red flag for tests that might be susceptible to very subtle errors.

(Nitpicker's corner: Yes, the problem in this test was caused by questionable design in WebClient itself. Using an instance variable to control a class's behavior in this way is error-prone to begin with. This testing problem also arises in much more subtle situations, though; I have the scars to prove it.)


Mar 06

Zero to Slashdot in Three Days

The Genesis

A few days before PyCon, Brian suggested that we build a web app in one night. It took a little longer than that to polish it up, but we launched sucks-rocks.com on Tuesday. Since then, it's had over 40,000 page views and been slashdotted (OK... it was the Japanese Slashdot, but it's still a Slashdot.)

Sucks/rocks rates the terms you enter by doing web searches and counting results. For example, if you search for "Windows sucks" using Google, you'll get many more results than for "Windows rocks". The opposite is true for FreeBSD. From this, we can infer that people probably like FreeBSD more than Windows. The actual searches that are done by sucks/rocks are more complex than this, but they follow a similar pattern.

The Search Engine Arms Race

Once we started getting a lot of traffic, it was very hard for us to keep sucks/rocks going because we kept running out of searches. Here are the search APIs we used, in the order that we added them:

Search Engine Queries/Day Interface Suckiness of Results
Google 1,000 SOAP Low
Yahoo 5,000 REST Pretty low
live.com 10,000 SOAP IMMEASURABLY HIGH!

We started with Google, but ran out of queries before we even launched. We then used Yahoo, but ran out when 100shiki.com linked to us, forcing me to add support for live.com. Unfortunately, live.com's search results are terrible. Terrible! If you search sucks/rocks for "lord of the rings", you'll get a "?" back. This means that the engine whose results are cached (which is live.com, of course) reported that there were 0 "total results available". Great.

Now we have a cache of almost 60,000 searches, most of which are from live.com. Many of those are totally wrong, of course. My next task is to add a background thread that slowly replaces all of the cached live.com results with Yahoo results.

The Code

Sucks/rocks runs on top of web.py, but only uses it for URL dispatching. Paste does the HTTP serving, with WebFaction's Apache instance on the front end (disclaimer: the WebFaction link is an affiliate link). This simple setup handled about a million HTTP requests in four days, using less than 5% of the CPU almost all of the time (except when it was at the top of slashdot.jp).

Sucks/Rocks Traffic

Easy Come, Easy Go

With our slashdotting over, We've gone from 10,000+ pageviews per day to about 1,000. Slashdot giveth, and Slashdot taketh away. That's OK, because I need some time to push all of the crappy live.com results out of the cache anyway.

(Brian has also posted about sucks/rocks: 1, 2).


Feb 27

PyCon 2007: The Untold Stories

Most of the PyCon posts are about the sessions, so here are some of the interesting things I did outside of the scheduled talks. I have pictures for many of them thanks to Mike Pirnat's diligent photography.

Pagoda CMS

Brian, Chris, and Ian demoed Pagoda, their upcoming open-source CMS. It's very user-centric, and they're spending a lot of effort on the user experience. Even though I don't use CMSes, I'm excited about this project because I'm so sick of crappy UIs. Peoples' responses seemed positive, but I think some people were disappointed that Pagoda takes the easy-to-use approach rather than the kitchen-sink approach. That's ok; that's why we have Zope - the kitchen sink is there for the taking!

Python Is Basically DOS, Right?

I headed up to my room to grab my hoodie, and on the way back I was in the elevator with a 40ish couple. They asked me what this conference was about; I told them it was about Python, which is a programming language. The guy asked me whether "that's anything like DOS". It was kind of funny, but mostly just jarring. After being in close quarters with lots of smart programmers for 2 days, it was weird to suddenly talk to someone whose computer experience apparently began and ended around 1990.

The Mysterious Ellipsis

Dave, Mike, and I were at the hotel's bar, and the topic of Python's ellipsis operator ("...") came up some how. From the grammar in the slicing docs, we could tell that the ellipsis could appear in slices, but we couldn't trick Python into taking it without throwing an exception. I figured it out later - in a slice, the "..." token is just translated into an "Ellipsis" object:

>>> class Foo:
...     def __getitem__(self, x):    
...         return x
... 
>>> f = Foo()
>>> f[1:2:3]
slice(1, 2, 3)
>>> f[...]
Ellipsis
>>> f[1, 2, ..., 100]
(1, 2, Ellipsis, 100)

Apparently, it's mostly used for numeric stuff like Numpy. I definitely understand Python's slicing much better after that confusing night.

Mischief on The Open Space Board

Brian and Chris posted a "Python in The Adult Entertainment Industry" card on the open spaces board with my name on it. It was up for about 20 minutes before Brian pointed it out to me and I took it down. Hopefully, I escaped without too many prominent Python hackers associating me with pornography.

I have to wonder whether anyone saw that card and was actually interested in going to the session. Maybe Chris and Brian's silliness prompted an interesting discussion of Python and porn somewhere...

RESTDB

The open space I actually did lead was on "REST, Databases, and RESTful Databases" rather than pornography. Unfortunately, I dove into explaining RESTDB right at the start. It turned out that not everyone was familiar with REST, or convinced of its usefulness, or both. So, we ended up talking about REST for the second half. I think the session would've been more useful to everyone involved if we'd discussed REST first, then moved on to RESTful databases. I'm not sure how much everyone else got out of it, but I learned a lot about how to explain what RESTDB is and why we might want it.

Django vs. The World

I didn't fly back until Monday, so I was still there on Sunday night. Most of the people who were still there were staying for the sprints, so the conference area was pretty quiet as everyone quietly hacked away (with the exception of the Wii room).

I was on my way back to the "quiet room", which was full of Django guys. On my way there, a big group of people appeared and asked where the Django guys were. I pointed them towards the quiet room and joined them on their way there. The group was made up of TurboGears guys, Pylons guys, Paste guys, and some that I didn't recognize. They busted into the Django room and caused some friendly commotion, with one notable result being this post on Ian Bicking's blog. I'm pretty sure that EWT's bathtub full of alcohol (pictured to the right) was a factor in this incident.

Magic URL Mapping

After the ruckus in the quiet room ended, I hacked up some crazy URL mapping code based on an experiment Brian did a while back. Here's a controller defined using it:

class UserController(_/'users'/User):
    def get(self, user):
        return dict(
            email=user.email,
            name=user.name)

The _/'users'/User part defines the controller's URL, and User is actually a RESTDB resource type. So, for example, if someone requests /users/Bob, this controller will be invoked and the Bob RESTDB resource will automatically be retrieved and passed in. This works for multiple records, so you could also have more complex controllers like:

class BlogController(_/'users'/User/'blogs'/Blog):
    def get(self, user, blog):
        assert blog.user == user # yep!

BlogController would be called for URLs like /users/Bob/blogs/TheBobBlog and, once again, both Bob and his blog would automatically be pulled out of the database. Of course, it's fully RESTful (hence the get method).

Keep in mind that this is just a silly experiment; please don't freak out because I'm overloading division to produce a URL mapping object that I then subclass. (Although, to be honest, the code isn't that bad; it's only about 60 lines long.)

Overall, PyCon was awesome, and I'm really glad I went. It's going to be in Chicago next year, so I won't have to lose two full days to travel (awesome!)