Since Net Communities have their new developer, I can post the coder test I wrote to find a replacement. The test was not meant to be particularly challenging, and we only requested that developers do one out of the three problems. Most developers did all three--presumably in an effort to impress--but nobody got all three correct, first time around. That was probably my fault, because the third problem had a subtle issue that is easy to overlook.

I'll quote the problems here, including typos. Occasionally the way I phrased things was slightly ambiguous, but I didn't penalize if the candidate misunderstood a requirement due to my use of language.

1. Write a script that takes a file containing a number of tags words (one per line) and sorts them by the number of times they occur in the file (descending order), but exclude words that only occur once. Ignore the case of the words, and filter out any puncutation characters. The output should be the word, followed by a space and the number of times it occurs in the file.

Example input:

microsoft

apple

microsoft.

Apple?

security

microsoft

internet

Example output:

microsoft 3

apple 2

This is something I would expect any programmer to be able to solve in any language. What I was looking for is simplicity in the solution. If a developer solves a simple problem like this with complex code, then he* would probably solve a complex program with unmaintainable code. And although efficiency was not a requirement for any of the problems, I was disappointed by the number of developers that didn't know when to use a dictionary / set over a list. Many candidates failed on this problem because they didn't fulfill all the requirements, not because the code was buggy but because the a requirement was simply ignored. This left me in a bit of a quandary, because I suspected that most of the candidates could have done it adequately if prompted--but then shouldn't a developer pay attention to details?

2. Write a script that removes blank paragraphs from a piece of html. A blank paragraph should be considered to be a <p> </p> tag containing only white space.

I was looking for a regular expression here. Even if a developer hasn't had a great deal of experience with regular expressions, he should be able to recognize that this is a situation where an R.E. would come in handy. One developer parsed the HTML to find blank paragraphs, which was hard to fault because it did work--but I tend to prefer the simplest solution that works, and a handful of lines is generally preferably to several pages of code.

3. Write a decorator that stores the result of a function call and returns the cached version in subsequent calls (with the same parameters) for 5 minutes, or ten times -- whichever comes first.

This was the one that everyone failed. The main issue was that to do it properly you need to create an immutable key out of the parameters to the function to be cached. But that was only the final hurdle, many candidates failed on simpler issues. In fact, many 'solutions' didn't actually run at all! This surprised me--if you have plenty of time to write the code and there is potentially a job riding on it, don't you test your solution?

Although nobody got that last problem correct, we did interview some good developers with a mix of skills and abilities. Most of whom were very passionate about working Django, and passion is a good thing in any field.

Being on the other side of an interview was an interesting experience for me, I hope I wasn't too much of a jerk! Thanks to all who applied.

If anyone wants to post a solution to any of the problems, they are more than welcome--but I'm afraid the job has been filled.

* I hate to type 'he or she', please assume thats what I mean.

This blog post was posted to It's All Geek to Me on Saturday January 17th, 2009 at 6:09PM
 

28 Responses to "Python Coder Test"

  • January 17th, 2009, 9:26 p.m.

    Question #1:


    from operator import itemgetter
    import string

    filename = "input.txt"

    fo = open(filename, "r")
    data = fo.readlines()
    fo.close()

    set_data = {}
    exclude = set(string.punctuation)

    for line in data:
    line = line.strip() # remove newlines
    line = line.lower() # ignore case
    line = ''.join(ch for ch in line if ch not in exclude) # ignore punctuation
    if line in set_data:
    set_data[line] = set_data[line] + 1
    else:
    set_data[line] = 1

    for i in sorted(set_data.items(), key=itemgetter(1), reverse=True):
    if i[1] == 1:
    continue # skip single entries
    print "%s %s" % (i[0], i[1])

    The "ignore punctuation" bit came from stack overflow.

  • andre
    January 17th, 2009, 9:30 p.m.

    #1 is easy

    #2 I don't know a lot of HTML. Isn't it allowed to have e.g attribute strings contain tags? In that case it you would have to resort to a parser to be sure you were cleaning only HTML markup and not attribute strings.

    #3 Would require pickling the parameters in order to deal with mutable data.

  • ido
    January 17th, 2009, 9:34 p.m.

    lets see:

    1)

     
    import re
    from collections import defaultdict
    from operator import itemgetter

    not_letters = re.compile(r'[^a-zA-Z]')

    def clean(line):
    return not_letters.sub('',line.strip().lower())

    tags = defaultdict(lambda: 0)
    f = open(filename,'r')
    for line in f:
    tags[clean(line)] += 1
    f.close()

    for tag, count in sorted(filter(lambda x: x[1] != 1, tags.items()), key=itemgetter(1), reverse=True):
    print tag, count

    2)

    import re
    empty_p = re.compile(r'\s*',re.MULTILINE) # subsitute \s with ' 'if you don't include tabs and newlines as white space)
    input = emtpy.sub('',input)

    empty_p could also be re.compile(r'\s*) for a more general solution.

    3) hmm... now this is more interesting:

    import datetime
    import functools
    def cache(time=datetime.timedelta(minutes=5), max_count=10):
    CACHE={}
    def decorator(f):
    @functools.wraps(f)
    def wrapped(*args, **kwargs):
    key = repr(f) + '|' + repr(args) + '|'+ repr(kwargs)
    if key in CAHCE:
    first_access, count, res = CACHE[key]
    if first_access + max_time = max_count:
    del CACHE[key]
    else:
    CACHE[key] = (first_access, count + 1, res)
    return res
    res = f(*args, **kwargs)
    CACHE[key] = (datetime.datetime.now(), 1, res)
    return res
    return wrapped
    return decorator



    I hope the formatting won't be way off...

  • ido
    January 17th, 2009, 9:37 p.m.

    hmm... wordpress seems to cut html formatting from comments.

    so i guess this won't work as well:

    def f(x):
    pass

  • January 18th, 2009, 12:55 a.m.

    For #2, I'd expect a regexp from a general developer, or one with perl experience - but I'd push harder on someone claiming python experience :-) I've moved a number of perl programmers over to python and one sign of being "early" in the transition is using regexps for things with more robust and readable functions available - str.endswith, os.path.basename, and for this particular example, BeautifulSoup:


    from BeautifulSoup import BeautifulSoup
    soup = BeautifulSoup("...")
    for paratag in soup.findAll(name="p"):
    if not paratag.string or not paratag.string.strip():
    paratag.extract()
    #end if
    #end for
    print soup

  • Dave Kirby
    January 18th, 2009, 1:05 a.m.

    Here is my solution to the first problem:

    $ cat test.txt | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | grep -v '^ *1 '

    3 microsoft
    2 apple

    The second problem is trivial with a regex, although as andre says, you could have attribute strings with tags in.


    The third problem is tricky - strictly speaking pickling will not work, since it is not guaranteed that pickling the same parameters twice will always produce the same string - this is something that has bitten me in the past. In this case it will work most of the time, but you will get an occasional cache miss.

  • Dave Kirby
    January 18th, 2009, 1:15 a.m.

    I forgot that uniq has an option to remove non-duplicated entries, so the grep at the end is unnecessary:

    $ cat test.txt | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -cd | sort -nr
    3 microsoft
    2 apple

  • January 18th, 2009, 2:29 a.m.

    I am quite a Python newbie (and I have not come across writing decorators yet, so the 3rd problem is screwed) but the following are my solutions to the test.

    #1:


    from __future__ import with_statement

    from itertools import groupby
    from string import punctuation

    def groupwords(words):
    words = [normalize(word) for word in words]
    result = [(k, len(list(g))) for k, g in groupby(sorted(words))]
    return sorted(result, cmp=lambda x, y: cmp(y[1], x[1]))

    def normalize(word):
    return ''.join(c for c in word.strip().lower() if c not in punctuation)

    # Input/Output
    with open('input.txt', 'r') as f:
    groups = groupwords(f.readlines())
    for group in filter(lambda g: g[1] > 1, groups):
    print '%s %d' % (group[0], group[1])

    I like using with/as for the input file; using itertools.groupby to do the "uniq" thingy. Not sure about normalize, though: perhaps a regexp would be better?

    #2:

    from xml.dom import Node
    from xml.dom.minidom import parse

    dom = parse('input.html')
    paragraphs = dom.getElementsByTagName('p')
    for p in paragraphs:
    text = p.firstChild
    if (not text or
    (text.nodeType == Node.TEXT_NODE and not text.nodeValue.strip())):
    p.parentNode.removeChild(p)
    p.unlink()

    html = dom.toxml()
    dom.unlink()
    print html[html.index('?>')+2:]

    So, I stand by the idea behind my code, in that using a regexp for parsing HTML is, uhm, really not exactly what I would like to see if I were the interviewer. Libraries to parse and manipulate HTML do exist, and choosing the right tool would be a matter of additional context given to the problem. (This is the same reason why I liked the idea of using itertools instead of grouping "by hand".) Here I used minidom because it's simple and in the standard library, even if for producing output it's not exactly practical. If external libraries were allowed, I would have tried something like BeautifulSoup (which BTW has been indeed used by another commenter).

  • January 18th, 2009, 2:38 a.m.

    Uh, for Dave Kirby: the output format seems to prescribe that the word should be placed before the number of its occurrencies, not the other way around.

  • January 18th, 2009, 3:53 a.m.

    Having written this blog entry, is there no more 'context implied' information that you could have supplied?

    For example:

    In 1, Would it be correct to assume ASCII?

    In 2, might you have said beforehand, that a regex would do, or give some representative examples of input HTML?

    In 3, You say no one "did it properly". This flips your comments on problem 2, where you wanted "the simplest solution". Without telling the candidates you require them to be mind readers to produce the right level of detail.

    In summary, I think it is easy to write insufficient specs. you might have tried your problem on a few anonymous coders, taking care to remember what extra needed to be imparted to get the answers you were thinking of.

    - Paddy.

  • January 18th, 2009, 5:23 a.m.

    The third problem is indeed nontrivial. In the documentation of my decorator module I use a frozenset to ensure the function arguments are hashable:


    def memoize25(func):
    func.cache = {}
    def memoize(*args, **kw):
    if kw: # frozenset is used to ensure hashability
    key = args, frozenset(kw.iteritems())
    else:
    key = args
    cache = func.cache
    if key in cache:
    return cache[key]
    else:
    cache[key] = result = func(*args, **kw)
    return result
    return functools.update_wrapper(memoize, func)

  • ido
    January 18th, 2009, 11:12 a.m.

    Michele, your implementation has a problem with unhashable objects:


    @memoize25
    def f(x):
    return x['v']*x['v']

    >>> f({'v':5})
    TypeError: dict objects are unhashable

    doing it with repr() as I have done it, allows using unhashable objects, but has the drawback that objects need to implement __repr__ as a meaningful function, so that if X and Y differ in semantics and/or functionality, repr(X) != repr(Y).
    which of-course can't always be guaranteed.

    I'm not sure that there is a correct implementation which does not assume some sort of predefined structure on the input.


    oh, and for all those using full blown html parsers to solve problem #2, when the problem is simple, the additional cost of parsing the html, and building objects around it (a-la BeautifulSoup or ElementTree) , just isn't worth it.

  • Dave Kirby
    January 18th, 2009, 12:17 p.m.

    """Uh, for Dave Kirby: the output format seems to prescribe that the word should be placed before the number of its occurrencies, not the other way around."""

    My bad - I shouldn't write code at 1am. It is easily fixed by adding
    | awk '{print $2, $1}'
    to the end of the pipline.

    For the python version of the same problem, I would use the string translate method to strip out punctuation - it is faster and simpler than processing a character at a time then joining the list back into a string:


    import string
    from collections import defaultdict
    from operator import itemgetter

    counts = defaultdict(int)
    trans = string.maketrans('', '')
    for line in open('test.txt'):
    word = line.lower().translate(trans, string.punctuation+string.whitespace)
    counts[word] += 1
    #end for

    for word, count in sorted(counts.iteritems(), key=itemgetter(1), reverse=True):
    if count > 1:
    print word, count
    #end if
    #end for


    The #end if/for comments are because the indentation gets mangled by the blog - there is a script 'pindent.py' included in the python distro that should re-indent code marked with these comments.

  • Roger
    January 18th, 2009, 12:35 p.m.

    If you have any examples of the good vs bad code that you could put up it would be interesting. Just to put myself on the firing line here's a quick solution for the first one:


    import string

    def readword(file):
    return file.readline().strip().lower().translate(None, string.punctuation)

    words = {}
    wordfile = open("inputfile.txt")

    word = readword(wordfile)

    while(word):
    words[word] = words.get(word, 0) + 1
    word = readword(wordfile)

    sortedWords = sorted(words.items(), key=lambda(k, v) : (v, k), reverse = True)

    for word in sortedWords:
    if(word[1] > 1):
    print("%s %d" % (word[0], word[1]))


    microsoft 3
    apple 2

  • Ralph Corderoy
    January 18th, 2009, 12:55 p.m.

    Knowledge of the standard library can make #1 be succinct. Note, that #1's spec. is incomplete. For tags occurring the same number of times, it isn't defined in what order they should be output. I've gone for alphabetical.

    In order to preserve indentation, I've replaced some of the spaces at the start of the line with underscores.


    import string
    import collections
    import fileinput

    lower = ''.join(map(chr, range(0x100))).lower()
    delete = string.whitespace + string.punctuation

    freq = collections.defaultdict(int)
    for l in fileinput.input():
    ___ l = l.translate(lower, delete)
    ___ freq[l] += 1

    words = collections.defaultdict(list)
    for w, f in freq.iteritems():
    ___ if f != 1:
    _______ words[f].append(w)

    for f in sorted(words, reverse = True):
    ___ for w in sorted(words[f]):
    _______ print w, f

  • January 18th, 2009, 4:12 p.m.

    I've added pre tags arround the code. It should make this thread easier to read for future visitors! Just wait for my new blog system that has syntax highlighting in the comments. :-)

    Dave Kirby, show off! Your Python solutions is nice a succinct though.

    _Mark_ The beautiful soup solution is very nice, and more general.

    Paddy3118, Any assumption would be fine if it wasn't stated explicitly in the problem. I wouldn't penalize someone on a technicality. I only 'marked' the tests on a fail or pass basis. For three, the solutions that got close would treat floats and ints as the same, and unicode and strings as the same. I think that requirement is at least implicit from the description of the problem, but something only more experienced coders would pick up on.

    For the first problem, I'm surprised nobody used the setdefault method on dictionaries. A defaultdict class is cleaner though. Using itertools.groupby is another possibility that avoids the need for a dictionary at all.

    For problem 2, any of the solutions that actually parsed the html are better from a technical point of view. It depends on the context though, sometimes a simple RE will suffice.

    For problem 3, my solution for generating the key was to use repr, as others have done. The simplest way to generate a key, I think, is just do do repr( (args, kwargs) ). I had also thought that picking, but Dave Kirby suggests that wont work -- which is a new one on me! I wonder if there are other Python serializer mechanisms can guarantee consistent output with the same input.

    Another issue in problem 3, is where to you store the dictionary that contains the cached output? Most solutions have one dictionary and use the function as part of the key. I don't like that solution much, it feels a little 'dirty'. IMHO it is better to store the dictionary as a function attribute of the function being cached.

  • ido
    January 18th, 2009, 5:02 p.m.

    Actually, as long as you don't store the cache dictionary in "module level", you will get a new cache dictionary for each function you wrap with the decorator. adding the function as part of the key is redundant, but I have added it for good measure. (just so you could reconstruct the call from the key, if you wanted to)

    but I do like adding the cache to the function as an attribute, it has the same semantics, without the redundant part.

  • Dave Kirby
    January 18th, 2009, 6:15 p.m.

    This python bug report gives more information on the pickling issue - it was closed as invalid, since pickled string equality is not guaranteed:

    http://mail.python.org/pipermail/python-bugs-list/2004-January/021745.html

    The main problem is object references - if you pickle two identical object (or the same object twice), but there are a different number of references to the objects then you may get a different pickle.

    >>> import cPickle as pickle
    >>> a = {}
    >>> pickle.dumps(a)
    '(dp1\n.'
    >>> pickle.dumps({})
    '(d.'

    this does not seem to happen with the pure python version of pickle, but that is likely to be so slow that it is useless for memoizing (and equality is still not guaranteed).

  • January 18th, 2009, 9:08 p.m.

    I'll also note that for all that a regexp solution is supposed to be "simpler", we've seen only one (that apparently misunderstood the question) and from the comments, none of the suggestions that a regexp would be simpler considered things like html comments, cdata, or whether nbsp is whitespace... I've found it quite common that when you've got a simple regexp for something - you've missed some part of the problem :-)

    (Also, like lambdas, regexps don't have docstrings...)

  • ido
    January 18th, 2009, 10:41 p.m.

    I'm guessing you are talking about my comment, unfortunately, wordpress deletes everything between angle brackets. I'll post it again substituting angle brackets with curly ones.


    import re
    empty_p = re.compile(r'{p}\s*{/p}') # subsitute \s with ' 'if you don't include tabs and newlines as white space)
    input = emtpy_p.sub('',input)


    empty_p could be generalized to be more robust:

    empty_p = re.compile(r'{\s*p\s*}\s*{\s*/\s*p\s*}',re.IGNORECASE)


    this will catch also p tags which are uppercase or contains spaces inside the tag.

    the regexp solution IS simpler, given that you can speak regexp fluently, and it has the advantage of being quicker than other solutions.
    but it falls down to the problem, I won't use regexp in a situation where the the structure needed extraction is complicated, when I need to do multiple tasks with the html, or when the problem can't be solved with a regular language automata.
    And remember that most of the times, you write html parsers for specific sites, where you know how the html behaves.

    and while I'm at it, I'll re-post my decorator which the wordpress-monster ate...


    def cache(max_time=datetime.timedelta(minutes=5), max_count=10):
    CACHE = {}
    def decorator(f):
    @functools.wraps(f)
    def wrapped(*args, **kwargs):
    key = repr(f)+'|'+repr(args)+'|'+repr(kwargs)
    if key in CACHE:
    first_access, count, res = CACHE[key]
    if datetime.datetime.now() - first_access >= max_time or count >= max_count:
    del CACHE[key]
    else:
    CACHE[key] = (first_access, count + 1, res)
    return res
    res = f(*args, **kwargs)
    CACHE[key] = (datetime.datetime.now(), 1, res)
    return res
    return wrapped
    return decorator

  • ido
    January 18th, 2009, 10:42 p.m.

    damn, pre tags, don't work :(

  • Ralph Corderoy
    January 19th, 2009, 11:40 a.m.

    @Dave Kirby: Can't the ''.translate() also do the ''.lower() at the same time? (Also, the delete string is being built each time around the loop.) I'd also be interested to see your approach if the output had to be sorted by descending frequency and then ascending tag.

  • January 19th, 2009, 8:48 p.m.

    Just for completeness: ido, thank you for reposting that regexp - it is simple, and matches the most-naive interpretation of the request; it's just that *any* simple regexp will get the wrong answer in the face of html comments or cdata sections (which I won't try to enter here, since there's no preview to see if it eats them.) regexps have their place - my point is just that it isn't as broad a place as people seem to think :-)

  • Dave Kirby
    January 20th, 2009, 12:21 a.m.

    @Ralph Corderoy: yes, .translate() could be used to do the lowercase conversion as well - I didnt do that originally because it would not work with unicode, but in hindsight my code would not work with unicode anyway, especially if there are unicode punctuation characters (I think that is true of all the other solutions here).

    The unicode type also has a translate method, but it works differently - it takes a dictionary of characters to translate, and does not do deletions.

  • nes
    January 21st, 2009, 5:26 p.m.

    Because I am in golfing mood:

    from itertools import groupby;filter(lambda p:p[1]>1,((x[0],len(list(x[1]))) for x in groupby(sorted(e.strip('.? ').lower() for e in 'microsoft apple microsoft. Apple? security microsoft internet'.split()))))

  • nes
    January 21st, 2009, 5:32 p.m.

    ups forgot to sort:

    from operator import itemgetter;from itertools import groupby;sorted(filter(lambda p:p[1]>1,((x[0],len(list(x[1]))) for x in groupby(sorted(e.strip('.? ').lower() for e in 'microsoft apple microsoft. Apple? security microsoft internet'.split())))),key=itemgetter(1),reverse=True)

  • Ralph Corderoy
    January 22nd, 2009, 1:05 p.m.

    For golf, you could import longname as l to save space later.

  • Mustafa
    January 26th, 2009, 12:01 a.m.

    My solution to #1 was very similar the the one in the first comment, so I came up with this:

    import string

    raw_tags = ["".join([c for c in tag if c not in string.punctuation]).lower().strip() for tag in open('tags').readlines()]

    tags = [(tag, len(raw_tags) - len([a for a in raw_tags if a != tag]) ) for tag in set(raw_tags)]

    print "\n".join(["%s %s"% (k,v) for k, v in \
    sorted([(tag, len(raw_tags) - len([a for a in raw_tags if a != tag]))\
    for tag in set(raw_tags)], key=lambda x:(x[1], x[0]), reverse=True) if v != 1])



    Is it elegant?

Leave a Comment

You can use bbcode in the comment: e.g. [b]This is bold[/b], [url]http://www.willmcgugan.com[/url], [code python]import this[/code]
Preview Posting...
Previewing comment, please wait a moment...
Will McGugan

My name is Will McGugan. I am an unabashed geek, an author, a hacker and a Python expert – amongst other things!

You are reading my tech blog. See the homepage for my other blogs.

Search for Posts
Possibly related posts
Tags
Popular Tags
 
Archives
2010
 
Recent Comments
http://www.iclshoes.com/alexander-mcqueen-c-13.html [iclshoes.com] http://www.iclshoes.com/jimmy-choo-shoes-c-2.html [iclshoes.com] http://www.iclshoes.com/ [iclshoes.com] http://www.zentai-mart.com/Play-Costumes-c-5.html [zentai-mart.com] http://www.zentai-mart.com/Latex-Catsuits-Clothes-c-3.html [zentai-mart.com] http://www.zentai-mart.com/PVC-Catsuits-Clothes-c-6.html [zentai-mart.com] http://www.zentai-mart.com/ [zentai-mart.com] http://www.hereshoes.com/miu-miu-shoes-c-31.html [hereshoes.com] http://www.hereshoes.com/giuseppe-zanotti-c-43.html [hereshoes.com] http://www.hereshoes.com/lanvin-shoes-c-50.html [hereshoes.com] ...
- Christian Louboutin on Turning website favicons in to 3D
What are the charmings of?a href="http://www.iclshoes.com [iclshoes.com]cl shoes/a?They are quality,comfort and style.The a href=http://www.iclshoes.com/jimmy-choo-shoes-c-2.html [iclshoes.com]Jimmy Choo shoes/a are made from ...
- Christian Louboutin on Turning website favicons in to 3D
Thanks a lot for that: I had first tried sudo aptitude purge adobe-flashplugin then sudo aptitude install flashplugin-nonfree but that ...
Andre, the name is derived from the class name (camel case converted to lower case with underscores). But, you can ...
is this using some kind of name mangling to map sorethumb:rounded_corners_edged to the RoundedCornersEdged class? that seems a little unnecessary ...
 
© 2008 Will McGugan.

A technoblog blog, design by Will McGugan