January 17, 2009 will

Python Coder Test

Since Net Communities have their new developer, I can post the coder test I wrote to find a replacement. The test was not meant to be particularly challenging, and we only requested that developers do one out of the three problems. Most developers did all three--presumably in an effort to impress--but nobody got all three correct, first time around. That was probably my fault, because the third problem had a subtle issue that is easy to overlook.

I'll quote the problems here, including typos. Occasionally the way I phrased things was slightly ambiguous, but I didn't penalize if the candidate misunderstood a requirement due to my use of language.

1. Write a script that takes a file containing a number of tags words (one per line) and sorts them by the number of times they occur in the file (descending order), but exclude words that only occur once. Ignore the case of the words, and filter out any puncutation characters. The output should be the word, followed by a space and the number of times it occurs in the file.

Example input:

microsoft

apple

microsoft.

Apple?

security

microsoft

internet

Example output:

microsoft 3

apple 2

This is something I would expect any programmer to be able to solve in any language. What I was looking for is simplicity in the solution. If a developer solves a simple problem like this with complex code, then he* would probably solve a complex program with unmaintainable code. And although efficiency was not a requirement for any of the problems, I was disappointed by the number of developers that didn't know when to use a dictionary / set over a list. Many candidates failed on this problem because they didn't fulfill all the requirements, not because the code was buggy but because the a requirement was simply ignored. This left me in a bit of a quandary, because I suspected that most of the candidates could have done it adequately if prompted--but then shouldn't a developer pay attention to details?

2. Write a script that removes blank paragraphs from a piece of html. A blank paragraph should be considered to be a <p> </p> tag containing only white space.

I was looking for a regular expression here. Even if a developer hasn't had a great deal of experience with regular expressions, he should be able to recognize that this is a situation where an R.E. would come in handy. One developer parsed the HTML to find blank paragraphs, which was hard to fault because it did work--but I tend to prefer the simplest solution that works, and a handful of lines is generally preferably to several pages of code.

3. Write a decorator that stores the result of a function call and returns the cached version in subsequent calls (with the same parameters) for 5 minutes, or ten times -- whichever comes first.

This was the one that everyone failed. The main issue was that to do it properly you need to create an immutable key out of the parameters to the function to be cached. But that was only the final hurdle, many candidates failed on simpler issues. In fact, many 'solutions' didn't actually run at all! This surprised me--if you have plenty of time to write the code and there is potentially a job riding on it, don't you test your solution?

Although nobody got that last problem correct, we did interview some good developers with a mix of skills and abilities. Most of whom were very passionate about working Django, and passion is a good thing in any field.

Being on the other side of an interview was an interesting experience for me, I hope I wasn't too much of a jerk! Thanks to all who applied.

If anyone wants to post a solution to any of the problems, they are more than welcome--but I'm afraid the job has been filled.

* I hate to type 'he or she', please assume thats what I mean.

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Kevin

Question #1:


from operator import itemgetter
import string

filename = "input.txt"

fo = open(filename, "r")
data = fo.readlines()
fo.close()

set_data = {}
exclude = set(string.punctuation)

for line in data:
line = line.strip() # remove newlines
line = line.lower() # ignore case
line = ''.join(ch for ch in line if ch not in exclude) # ignore punctuation
if line in set_data:
set_data[line] = set_data[line] + 1
else:
set_data[line] = 1

for i in sorted(set_data.items(), key=itemgetter(1), reverse=True):
if i[1] == 1:
continue # skip single entries
print "%s %s" % (i[0], i[1])

The "ignore punctuation" bit came from stack overflow.

gravatar
andre

#1 is easy

#2 I don't know a lot of HTML. Isn't it allowed to have e.g attribute strings contain tags? In that case it you would have to resort to a parser to be sure you were cleaning only HTML markup and not attribute strings.

#3 Would require pickling the parameters in order to deal with mutable data.

gravatar
ido

lets see:

1)

 
import re
from collections import defaultdict
from operator import itemgetter

not_letters = re.compile(r'[^a-zA-Z]')

def clean(line):
return not_letters.sub('',line.strip().lower())

tags = defaultdict(lambda: 0)
f = open(filename,'r')
for line in f:
tags[clean(line)] += 1
f.close()

for tag, count in sorted(filter(lambda x: x[1] != 1, tags.items()), key=itemgetter(1), reverse=True):
print tag, count

2)

import re
empty_p = re.compile(r'\s*',re.MULTILINE) # subsitute \s with ' 'if you don't include tabs and newlines as white space)
input = emtpy.sub('',input)

empty_p could also be re.compile(r'\s*) for a more general solution.

3) hmm... now this is more interesting:

import datetime
import functools
def cache(time=datetime.timedelta(minutes=5), max_count=10):
CACHE={}
def decorator(f):
@functools.wraps(f)
def wrapped(*args, **kwargs):
key = repr(f) + '|' + repr(args) + '|'+ repr(kwargs)
if key in CAHCE:
first_access, count, res = CACHE[key]
if first_access + max_time = max_count:
del CACHE[key]
else:
CACHE[key] = (first_access, count + 1, res)
return res
res = f(*args, **kwargs)
CACHE[key] = (datetime.datetime.now(), 1, res)
return res
return wrapped
return decorator



I hope the formatting won't be way off...

gravatar
ido

hmm... wordpress seems to cut html formatting from comments.

so i guess this won't work as well:

def f(x):
pass

gravatar
_Mark_

For #2, I'd expect a regexp from a general developer, or one with perl experience - but I'd push harder on someone claiming python experience :-) I've moved a number of perl programmers over to python and one sign of being "early" in the transition is using regexps for things with more robust and readable functions available - str.endswith, os.path.basename, and for this particular example, BeautifulSoup:


from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("...")
for paratag in soup.findAll(name="p"):
if not paratag.string or not paratag.string.strip():
paratag.extract()
#end if
#end for
print soup

gravatar
Dave Kirby

Here is my solution to the first problem:

$ cat test.txt | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | grep -v '^ *1 '

3 microsoft
2 apple

The second problem is trivial with a regex, although as andre says, you could have attribute strings with tags in.


The third problem is tricky - strictly speaking pickling will not work, since it is not guaranteed that pickling the same parameters twice will always produce the same string - this is something that has bitten me in the past. In this case it will work most of the time, but you will get an occasional cache miss.

gravatar
Dave Kirby

I forgot that uniq has an option to remove non-duplicated entries, so the grep at the end is unnecessary:

$ cat test.txt | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -cd | sort -nr
3 microsoft
2 apple

gravatar
Giulio Piancastelli

I am quite a Python newbie (and I have not come across writing decorators yet, so the 3rd problem is screwed) but the following are my solutions to the test.

#1:


from __future__ import with_statement

from itertools import groupby
from string import punctuation

def groupwords(words):
words = [normalize(word) for word in words]
result = [(k, len(list(g))) for k, g in groupby(sorted(words))]
return sorted(result, cmp=lambda x, y: cmp(y[1], x[1]))

def normalize(word):
return ''.join(c for c in word.strip().lower() if c not in punctuation)

# Input/Output
with open('input.txt', 'r') as f:
groups = groupwords(f.readlines())
for group in filter(lambda g: g[1] > 1, groups):
print '%s %d' % (group[0], group[1])

I like using with/as for the input file; using itertools.groupby to do the "uniq" thingy. Not sure about normalize, though: perhaps a regexp would be better?

#2:

from xml.dom import Node
from xml.dom.minidom import parse

dom = parse('input.html')
paragraphs = dom.getElementsByTagName('p')
for p in paragraphs:
text = p.firstChild
if (not text or
(text.nodeType == Node.TEXT_NODE and not text.nodeValue.strip())):
p.parentNode.removeChild(p)
p.unlink()

html = dom.toxml()
dom.unlink()
print html[html.index('?>')+2:]

So, I stand by the idea behind my code, in that using a regexp for parsing HTML is, uhm, really not exactly what I would like to see if I were the interviewer. Libraries to parse and manipulate HTML do exist, and choosing the right tool would be a matter of additional context given to the problem. (This is the same reason why I liked the idea of using itertools instead of grouping "by hand".) Here I used minidom because it's simple and in the standard library, even if for producing output it's not exactly practical. If external libraries were allowed, I would have tried something like BeautifulSoup (which BTW has been indeed used by another commenter).

gravatar
Giulio Piancastelli

Uh, for Dave Kirby: the output format seems to prescribe that the word should be placed before the number of its occurrencies, not the other way around.

gravatar
Paddy3118

Having written this blog entry, is there no more 'context implied' information that you could have supplied?

For example:

In 1, Would it be correct to assume ASCII?

In 2, might you have said beforehand, that a regex would do, or give some representative examples of input HTML?

In 3, You say no one "did it properly". This flips your comments on problem 2, where you wanted "the simplest solution". Without telling the candidates you require them to be mind readers to produce the right level of detail.

In summary, I think it is easy to write insufficient specs. you might have tried your problem on a few anonymous coders, taking care to remember what extra needed to be imparted to get the answers you were thinking of.

- Paddy.

gravatar
Michele Simionato

The third problem is indeed nontrivial. In the documentation of my decorator module I use a frozenset to ensure the function arguments are hashable:


def memoize25(func):
func.cache = {}
def memoize(*args, **kw):
if kw: # frozenset is used to ensure hashability
key = args, frozenset(kw.iteritems())
else:
key = args
cache = func.cache
if key in cache:
return cache[key]
else:
cache[key] = result = func(*args, **kw)
return result
return functools.update_wrapper(memoize, func)

gravatar
ido

Michele, your implementation has a problem with unhashable objects:


@memoize25
def f(x):
return x['v']*x['v']

>>> f({'v':5})
TypeError: dict objects are unhashable

doing it with repr() as I have done it, allows using unhashable objects, but has the drawback that objects need to implement __repr__ as a meaningful function, so that if X and Y differ in semantics and/or functionality, repr(X) != repr(Y).
which of-course can't always be guaranteed.

I'm not sure that there is a correct implementation which does not assume some sort of predefined structure on the input.


oh, and for all those using full blown html parsers to solve problem #2, when the problem is simple, the additional cost of parsing the html, and building objects around it (a-la BeautifulSoup or ElementTree) , just isn't worth it.

gravatar
Dave Kirby

"""Uh, for Dave Kirby: the output format seems to prescribe that the word should be placed before the number of its occurrencies, not the other way around."""

My bad - I shouldn't write code at 1am. It is easily fixed by adding
| awk '{print $2, $1}'
to the end of the pipline.

For the python version of the same problem, I would use the string translate method to strip out punctuation - it is faster and simpler than processing a character at a time then joining the list back into a string:


import string
from collections import defaultdict
from operator import itemgetter

counts = defaultdict(int)
trans = string.maketrans('', '')
for line in open('test.txt'):
word = line.lower().translate(trans, string.punctuation+string.whitespace)
counts[word] += 1
#end for

for word, count in sorted(counts.iteritems(), key=itemgetter(1), reverse=True):
if count > 1:
print word, count
#end if
#end for


The #end if/for comments are because the indentation gets mangled by the blog - there is a script 'pindent.py' included in the python distro that should re-indent code marked with these comments.

gravatar
Roger

If you have any examples of the good vs bad code that you could put up it would be interesting. Just to put myself on the firing line here's a quick solution for the first one:


import string

def readword(file):
return file.readline().strip().lower().translate(None, string.punctuation)

words = {}
wordfile = open("inputfile.txt")

word = readword(wordfile)

while(word):
words[word] = words.get(word, 0) + 1
word = readword(wordfile)

sortedWords = sorted(words.items(), key=lambda(k, v) : (v, k), reverse = True)

for word in sortedWords:
if(word[1] > 1):
print("%s %d" % (word[0], word[1]))


microsoft 3
apple 2

gravatar
Ralph Corderoy

Knowledge of the standard library can make #1 be succinct. Note, that #1's spec. is incomplete. For tags occurring the same number of times, it isn't defined in what order they should be output. I've gone for alphabetical.

In order to preserve indentation, I've replaced some of the spaces at the start of the line with underscores.


import string
import collections
import fileinput

lower = ''.join(map(chr, range(0x100))).lower()
delete = string.whitespace + string.punctuation

freq = collections.defaultdict(int)
for l in fileinput.input():
___ l = l.translate(lower, delete)
___ freq[l] += 1

words = collections.defaultdict(list)
for w, f in freq.iteritems():
___ if f != 1:
_______ words[f].append(w)

for f in sorted(words, reverse = True):
___ for w in sorted(words[f]):
_______ print w, f

gravatar
Will

I've added pre tags arround the code. It should make this thread easier to read for future visitors! Just wait for my new blog system that has syntax highlighting in the comments. :-)

Dave Kirby, show off! Your Python solutions is nice a succinct though.

_Mark_ The beautiful soup solution is very nice, and more general.

Paddy3118, Any assumption would be fine if it wasn't stated explicitly in the problem. I wouldn't penalize someone on a technicality. I only 'marked' the tests on a fail or pass basis. For three, the solutions that got close would treat floats and ints as the same, and unicode and strings as the same. I think that requirement is at least implicit from the description of the problem, but something only more experienced coders would pick up on.

For the first problem, I'm surprised nobody used the setdefault method on dictionaries. A defaultdict class is cleaner though. Using itertools.groupby is another possibility that avoids the need for a dictionary at all.

For problem 2, any of the solutions that actually parsed the html are better from a technical point of view. It depends on the context though, sometimes a simple RE will suffice.

For problem 3, my solution for generating the key was to use repr, as others have done. The simplest way to generate a key, I think, is just do do repr( (args, kwargs) ). I had also thought that picking, but Dave Kirby suggests that wont work -- which is a new one on me! I wonder if there are other Python serializer mechanisms can guarantee consistent output with the same input.

Another issue in problem 3, is where to you store the dictionary that contains the cached output? Most solutions have one dictionary and use the function as part of the key. I don't like that solution much, it feels a little 'dirty'. IMHO it is better to store the dictionary as a function attribute of the function being cached.

gravatar
ido

Actually, as long as you don't store the cache dictionary in "module level", you will get a new cache dictionary for each function you wrap with the decorator. adding the function as part of the key is redundant, but I have added it for good measure. (just so you could reconstruct the call from the key, if you wanted to)

but I do like adding the cache to the function as an attribute, it has the same semantics, without the redundant part.

gravatar
Dave Kirby

This python bug report gives more information on the pickling issue - it was closed as invalid, since pickled string equality is not guaranteed:

http://mail.python.org/pipermail/python-bugs-list/2004-January/021745.html

The main problem is object references - if you pickle two identical object (or the same object twice), but there are a different number of references to the objects then you may get a different pickle.

>>> import cPickle as pickle
>>> a = {}
>>> pickle.dumps(a)
'(dp1\n.'
>>> pickle.dumps({})
'(d.'

this does not seem to happen with the pure python version of pickle, but that is likely to be so slow that it is useless for memoizing (and equality is still not guaranteed).

gravatar
_Mark_

I'll also note that for all that a regexp solution is supposed to be "simpler", we've seen only one (that apparently misunderstood the question) and from the comments, none of the suggestions that a regexp would be simpler considered things like html comments, cdata, or whether nbsp is whitespace... I've found it quite common that when you've got a simple regexp for something - you've missed some part of the problem :-)

(Also, like lambdas, regexps don't have docstrings...)

gravatar
ido

I'm guessing you are talking about my comment, unfortunately, wordpress deletes everything between angle brackets. I'll post it again substituting angle brackets with curly ones.


import re
empty_p = re.compile(r'{p}\s*{/p}') # subsitute \s with ' 'if you don't include tabs and newlines as white space)
input = emtpy_p.sub('',input)


empty_p could be generalized to be more robust:

empty_p = re.compile(r'{\s*p\s*}\s*{\s*/\s*p\s*}',re.IGNORECASE)


this will catch also p tags which are uppercase or contains spaces inside the tag.

the regexp solution IS simpler, given that you can speak regexp fluently, and it has the advantage of being quicker than other solutions.
but it falls down to the problem, I won't use regexp in a situation where the the structure needed extraction is complicated, when I need to do multiple tasks with the html, or when the problem can't be solved with a regular language automata.
And remember that most of the times, you write html parsers for specific sites, where you know how the html behaves.

and while I'm at it, I'll re-post my decorator which the wordpress-monster ate...


def cache(max_time=datetime.timedelta(minutes=5), max_count=10):
CACHE = {}
def decorator(f):
@functools.wraps(f)
def wrapped(*args, **kwargs):
key = repr(f)+'|'+repr(args)+'|'+repr(kwargs)
if key in CACHE:
first_access, count, res = CACHE[key]
if datetime.datetime.now() - first_access >= max_time or count >= max_count:
del CACHE[key]
else:
CACHE[key] = (first_access, count + 1, res)
return res
res = f(*args, **kwargs)
CACHE[key] = (datetime.datetime.now(), 1, res)
return res
return wrapped
return decorator

gravatar
ido

damn, pre tags, don't work :(

gravatar
Ralph Corderoy

@Dave Kirby: Can't the ''.translate() also do the ''.lower() at the same time? (Also, the delete string is being built each time around the loop.) I'd also be interested to see your approach if the output had to be sorted by descending frequency and then ascending tag.

gravatar
_Mark_

Just for completeness: ido, thank you for reposting that regexp - it is simple, and matches the most-naive interpretation of the request; it's just that *any* simple regexp will get the wrong answer in the face of html comments or cdata sections (which I won't try to enter here, since there's no preview to see if it eats them.) regexps have their place - my point is just that it isn't as broad a place as people seem to think :-)

gravatar
Dave Kirby

@Ralph Corderoy: yes, .translate() could be used to do the lowercase conversion as well - I didnt do that originally because it would not work with unicode, but in hindsight my code would not work with unicode anyway, especially if there are unicode punctuation characters (I think that is true of all the other solutions here).

The unicode type also has a translate method, but it works differently - it takes a dictionary of characters to translate, and does not do deletions.

gravatar
nes

Because I am in golfing mood:

from itertools import groupby;filter(lambda p:p[1]>1,((x[0],len(list(x[1]))) for x in groupby(sorted(e.strip('.? ').lower() for e in 'microsoft apple microsoft. Apple? security microsoft internet'.split()))))

gravatar
nes

ups forgot to sort:

from operator import itemgetter;from itertools import groupby;sorted(filter(lambda p:p[1]>1,((x[0],len(list(x[1]))) for x in groupby(sorted(e.strip('.? ').lower() for e in 'microsoft apple microsoft. Apple? security microsoft internet'.split())))),key=itemgetter(1),reverse=True)

gravatar
Ralph Corderoy

For golf, you could import longname as l to save space later.

gravatar
Mustafa

My solution to #1 was very similar the the one in the first comment, so I came up with this:

import string

raw_tags = ["".join([c for c in tag if c not in string.punctuation]).lower().strip() for tag in open('tags').readlines()]

tags = [(tag, len(raw_tags) - len([a for a in raw_tags if a != tag]) ) for tag in set(raw_tags)]

print "\n".join(["%s %s"% (k,v) for k, v in \
sorted([(tag, len(raw_tags) - len([a for a in raw_tags if a != tag]))\
for tag in set(raw_tags)], key=lambda x:(x[1], x[0]), reverse=True) if v != 1])



Is it elegant?

gravatar
WTF!?
str.replace('<p> </p>', ‘ ’)

Where did I fail?
gravatar
Will McGugan
 """<p>
</p>
""".replace('<p> </p>', '')
gravatar
Alex Rohde
 #Test 1
input_set = [ ("").join(re.findall(r'\w+',piece)).lower() for piece in input_set ]
answer = [ name + " " + str(input_set.count(name)) for name in set(input_set) if input_set.count(name) > 1 ]
print (answer)
gravatar
Thangappan
 import string
import sys 
import os
exclude = str(string.punctuation) # This is to filter the punctuation characters in the string.
linesep = os.linesep
CountDetails = dict() # The dictionary which holds the word and its count.
try:
    (file ) = sys.argv[1]
except IndexError:
    print "Usage:[Script] [ Tag File Name]"
    sys.exit(2)
def CheckAndCount(line):
    
    ''' Checking each line as a word and do the following things 
            - Filtering the punctuation marks in the word
            - convert into lower 
            - Storing the each word in the dictionary 
    '''
    line = line.split()[0] # Forcing to get one word 
    word = '' # string for storing the replaced string 
    for chr in line:
        if chr not in exclude:
            word += chr 
    word = word.lower() # Converting into lower 
    if not CountDetails.get(word,0): # Increment the counter.
        CountDetails[word] = 1
    else:
        CountDetails[word] = CountDetails[word] + 1
    
fh = open(file,"r")
for line in fh:
    line = line.strip()
    if line:
       CheckAndCount(line) # Assuming only one word will be there in the line.
fh.close()
   
print CountDetails
wfh = open("%s.out" % file, "w")
for key,value in CountDetails.iteritems():
    if value != 1:
        wfh.write("%s : %s%s" % (key,value,linesep))
wfh.close()
I have just tried it. Let met know the status
gravatar
islight
#1
c = open(“input.txt”).read()
a =
b = filter(lambda x:x>1, sorted(, cmp=lambda x,y:-cmp(x,y)))
for w in b: print w,w
gravatar
TestCode
c = open(“input.txt”).read()
a = [string.translate(x,None,string.punctuation) for x in c.lower().split()]
b = filter(lambda x:x[1]>1, sorted([(x,a.count(x)) for x in set(a)], cmp=lambda x,y:-cmp(x[1],y[1])))
for w in b: print w[0],w[1]
gravatar
TestCode
import datetime
import functools
def cache(time=datetime.timedelta(minutes=5), max_count=10):
CACHE=cache.CACHE={}
def decorator(f):
@functools.wraps(f)
def wrapped(*args, **kwargs):
key = repr(f) + '|' + repr(args) + '|'+ repr(kwargs)
if key in CAHCE:
first_access, count, res = CACHE[key]
if first_access + max_time = max_count:
del CACHE[key]
else:
CACHE[key] = (first_access, count + 1, res)
return res
res = f(*args, **kwargs)
CACHE[key] = (datetime.datetime.now(), 1, res)
return res
return wrapped
return decorator
gravatar
the jeffster


#1

 import string
from collections import Counter
 
 
def tagwords():
    tagcounter = Counter()
    
    with open("tagwords.txt", "r") as wordfile:
        words = list(filter(None, wordfile.read().split("\n")))
 
    for word in words:
        tagcounter.update([
            "".join(c for c in word if c not in string.punctuation).lower()
        ])
            
    data = dict(tagcounter)
       
    for key, value in dict(tagcounter).items():
        if value == 1:
            data.pop(key, None)
 
    return data
    
if __name__ == "__main__":
    print (tagwords())

i might post the others later if i feel like doing them, hows my code look?
gravatar
Mike
don't know why this was tempting.. (#1)
 import re                                                                       
from collections import Counter, OrderedDict                                    
                                                                                
cnt=Counter()                                                                   
with open('./t') as f:                                                          
    #--- strip punctuation and lowercase ---#                                   
    for kw in [re.sub(r'\n|\?|\.|:','', x).lower() for x in f.readlines()]:        
        cnt[kw] += 1                                                            
#--- reverse sort the dict by value  ---#                                       
od=OrderedDict(sorted(cnt.items(), key=lambda t: t[1],reverse=True))            
#--- print if count > 1 ---#                                                    
print '\n'.join([k+" "+str(v) for k,v in od.iteritems() if v > 1])