December 2, 2016 will

Strings hiding in plain sight

It's not often I come across something in Python that surprises me. Especially in something as mundane as string operations, but I guess Python still has a trick or two up its sleeve.

Have a look at this string:

>>> s = "A"

How many possible sub-strings are in s? To put it another away, how many values of x are there where the expression x in s is true?

Turns out it is 2.

2?

Yes, 2.

>>> "A" in s
True
>>> "" in s
True

The empty string is in the string "A". In fact, it's in all the strings.

>>> "" in "foo"
True
>>> "" in ""
True
>>> "" in "here"
True

Turns out the empty string has been hiding every where in my code.

Not a complaint, I'm sure the rationale is completely sound. And it turned out to be quite useful. I had a couple of lines of code that looked something like this:

if character in ('', '\n'):
   do_something(character)

In essence I wanted to know if character was an empty string or a newline. But knowing the empty string thang, I can replace it with this:

if character in '\n':
    do_something(character)

Which has exactly the same effect, but I suspect is a few nanoseconds faster.

Don't knock it. When you look after the nanoseconds, the microseconds look after themselves.

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Adam Barnes

I'd argue this is an anti-feature.

Looking at if character in '\n':, I'd consider "" matching a bug.

Man this is a nice comment box so far. If I have to sign up to disqus or something I'm gonna be mad.

gravatar
Anon Anon

Why not just do if character == '\n':then it wont match an empty string

Also, did you have to sign up to disqus? I'll guess I'll figure out by replying.

gravatar
Anonymous

It's set theory, not an anti-feature.

gravatar
Will McGugan

Looking at if character in '\n':, I'd consider "" matching a bug.

It is odd looking. But these edge cases tend to be well thought out in Python by some very smart people. So I'm guessing there is some solid thinking behind it.

Man this is a nice comment box so far. If I have to sign up to disqus or something I'm gonna be mad.

The comment system is home grown. Did it work well? There may be some glitches left.

gravatar
Terry Jan Reedy

String containment is about substrings (slices), NOT about single characters: (sub in string) == any(sub == string[i:len(sub)] for i in range(len(string) - len(sub)).

gravatar
mborus

Interesting find. Thanks for posting it. Personally I'd prefer the redability of ('', '\n') so that it's clear that you're checking for empty chars which is not obvoius when using the string method.

I also have a problem with your assumtion

Which has exactly the same effect, but I suspect is a few nanoseconds faster. Don't knock it. When you look after the nanoseconds, the microseconds look after themselves.

Did you measure the speed improvement or guess?

I did a small test (Python3.5.2, 32bit) and the behaviour is inconsistent. With the code below, exact matches to the set are slightly faster than checking the string.

import timeit
char = ''

for i in range(10):
    start = timeit.default_timer()
    for i in range(0,100000):
        if char in '\n':
            pass
    print ('string:{}'.format(timeit.default_timer()-start))

    start = timeit.default_timer()
    for i in range(0,100000):
        if char in ('', '\n'):
            pass
    print ('set   :{}'.format(timeit.default_timer()-start))
gravatar
Will McGugan

Good point! You're right, I didn't test it.

Your test doesn't try all the possible inputs, which may perform differently. I tweaked it to try the empty string, a newline, and another character.

import timeit
char = ''

for i in range(10):
    start = timeit.default_timer()
    for i in range(0,1000000):
        '' in '\n'
        '\n' in '\n'
        'f' in '\n'
    print ('string:{}'.format(timeit.default_timer()-start))

    start = timeit.default_timer()
    for i in range(0,1000000):
        '' in ('', '\n')
        '\n' in ('', '\n')
        'f' in ('', '\n')
    print ('set   :{}'.format(timeit.default_timer()-start))

It does look like the string version is a tiny bit faster (and I do mean tiny).

You're definitely right about the readability. I'd only use this in a very tight loop and with a comment.

BTW the ('', '\n') is just a tuple. Didn't occur to me to try with a set.

gravatar
Albert Hopkins

I actually did know about the empty string is contained in any string thing, but for me.

if character in ('', '\n'):

is more clear (to the reader than)

if character in '\n':

Because though the latter works and may be a few "nanoseconds" faster, the former makes it clear to the user that the effect is intentional.

gravatar
Will McGugan

No argument there.

I used it in a tight cpu bound loop and with a comment. Still feels a little dirty...

gravatar
Inyeol Lee

As an elaboration, guess HOW MANY empty strings in 'abc'.

>>> 'abc'.replace('', '_')
'_a_b_c_'
gravatar
Will McGugan

That is peculiar. Replacing the empty string with something else, actually makes more empty strings.