January 19, 2008 will

Netstring theory

I have create a google code project for my Python netstring module.

So what is a netstring? A netstring is a way of encoding strings of data in a file or network stream. The classic way of doing this is to terminate the string with a special character, such as carriage return, line-feed or a null byte. But this means that when reading the encoded data you have to check every character in the stream to see if it is the terminator character -- which can be inefficient. It also makes it impossible to encode a string that contains the terminator character, because it will be incorrectly interpreted as the end of the string. Netstrings solve both these problems by encoding the size of the string up-front.

A string encoded as a netstring consists of the length of the string in ASCII, followed by a colon, the string itself and a comma character. For example, here is my name encoded as two netstrings:

4:Will,7:McGugan,

It's a very simple protocol, but it can simplify writing file formats (Not everything need be XML!) and encoding network streams. For the official documentation on netstrings see http://cr.yp.to/proto/netstrings.txt.

You can install the netstring module with following command:

easy_install netstring

You can individual encode netstrings with the netstring.encode function, but it will probably be simpler to use the netstring.FileEncoder class which takes a writable file-like object as the only parameter, then call the write method to encode and write netstrings to the file. For network streams I would suggest writing a small proxy class that implements a write method that calls the sockets send method.

Decoding the netstring file / stream is done with the netstring.Decoder class. Simply create a Decoder object then call its feed method with data from the stream. The feed method is a generator that yields strings as they are decoded. Note that the data you feed to the decoder need not be whole netstrings, it could be a portion of a netstring or a group of netstrings -- the decoder will buffer the data until it has encoded a full string, or strings. This means that feed may yield zero or mode strings, depending on the data you give it.

The following pseudo-code shows how you might use netstrings as the basis of a network stream (which is the purpose they were intended for).

import netstring

decoder = netstring.Decoder()

# Assumes that 'sock' is a previously opened socket
while True:
    data = sock.recv(1024)
    if not data:
        break
    for packet in decoder.feed(data):
        handle_packet(packet)

The license is Public Domain, even though Google Code claims otherwise (there was no setting for PD).

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Nick Moffitt

The reason that there is no "Public Domain" option in google is that it's a strictly US-centric concept and it is not clear that even saying "I hereby put this work in the public domain." has the effect that you would expect in all US jurisdictions. In fact, it's possible that in many places this would be seen as an attempt to avoid certain responsibilities of Copyright (I know, I know...) and would revert immediately back to "all rights reserved" (which is the opposite of what you are trying to do).

Your best bet is to just slog through the broken worldwide copyright system and use a license like this one: http://sam.zoy.org/wtfpl/

It's hard to find that ambiguous in *any* jurisdiction. If you really want, you can probably disclaim warranty by adding "1. WHATEVER HAPPENS ISN'T MY FAULT."

gravatar
Tane Piper

You should check out sharesource (http://sharesource.org). They offer both SVN and Mercurial, and a lot more licence types (although I don't see public domain - but you could always request it to be added).

The library itself looks interesting, and I may look at integrating it into my Django app.

gravatar
larrytheliquid

The colon before the string is necessary as a known end marker for the string length integer. However, once the string begins the absolute length of it is known. Hence, the end of the current string, and thus the beginning of the next string is also known.

With this being the case, it seems that the terminating comma is unnecessary. Besides just being nicer to look at, I'm thinking you may be using the comma for error checking... though it still feels superfluous.

Disclaimer: I have no experience with network streams and am just inquiring out of interest.

gravatar
Will

Larry, your analysis is correct. The terminating comma is superfluous, its there to conform to the spec. I'm guess it is a form of syntactical sugar for mentaly parsing netstrings. There is some advantage in using it for error checking, so that the decoder can be sure that it is actualy being fed netstrings, and not some other data.

gravatar
Andy Goth

Find more discussion of netstrings here: http://wiki.tcl.tk/15074