August 25, 2008 will

A Python Challenge. Better than rentacoder?

Python programmers love a challenge, so I thought I would throw this one to the lazyweb.

I have an app that generates a large sequence of files, the nature of which is unimportant. What is important is generating a location to store them. To keep the directory structure manageable there should be no more than 100 files plus sub-directories in any given directory. For instance if there are 99 files in a directory, then there is only room for one more file or a directory -- bringing the total to 100. There is an additional requirement that the directory structures should be as flat as possible, i.e. foo/bar for the nth item is preferable over foo/bar/baz. To summarize the requirements:

  • No more than a total of 100 files plus folders in any given directory.
  • Directory structure should be as flat as possible.
  • Algorithm should be theoretically unbounded, i.e work for any length of sequence.
  • Character length of the path is unimportant.
  • Entries should be in Python (but feel free to post solutions in other languages for comparison).

Entries should be a function called make_path, which takes a single parameter, n, which is the number of the file in the sequence (starting at 0). The return value should be a path to the file, such as foo/bar/baz (where baz is a file in directory foo/bar). Feel free to add any extension to the filename, to make it clearer that the path references a file. You can't rely on the function being called in sequence, so don't keep any kind of persistent state across calls to make_path (nor any kind of file access).

Wordpress comments tend to mess up formatting with Python code, so please post your entries on Pygments.org or a similar site and a comment to let me know the url.

The winner gets the respect of his/her peers and will be owed a favor by me, because I figured it would be more fun to write this blog entry than solve it myself! ;-) All entries grant the rights for anyone to freely use and modify the code.

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Alex

http://pygments.org/demo/957/

"Directory structure should be as flat as possible", I had to assume this needed to be satisfied for an infinitely large number of files. (Otherwise, for example, if only 99 files are being stored, no child directories should be created).

My solution is inspired by binary heap storage, where the index of a node i's parent is at (i - 1) / 2. In this case it's n-ary heap storage, where 'n' is the number of directories per node.

In deciding how many directories/files should be stored in each node (summing to 100), I tried all combinations for a large number of files (test code included above). The ideal distribution is always 49 directories and 51 files. I'd be interested if someone can give a mathematical reason for this.

gravatar
S. Lott

See http://pygments.org/demo/958/.

For n files per directory, we transform the entry number into a base-n number. In this case, it's base 100. Each "digit" in the base-100 number is a node on the path. Each level will have no more than 100 numbers. The structure will be as flat as possible given the number of files.

Note that low-numbered files will be in a flatter structure and later files will be in a deeper structure. Each order of magnitude base n will lead to deeper paths.

The base100.reverse() operation is only present to make the paths match the numbers in an obvious way. Removing this step will make things somewhat faster and somewhat less obvious.

gravatar
Adam Collard

http://pygments.org/demo/959/

Convert number into path segments of powers of 50 (i.e. the number converted to base 50).

Directory structure will contain 50 files and 50 sub directories

gravatar
Bill Mill

S. Lott, your code requires the user to move files before they're overwritten? For example, the first file, initially named f01, needs to be moved before it's overwritten by the 100th file which creates a directory named f01; to where should it be moved?

gravatar
Bill Mill

s/first/second/ in my comment

gravatar
Χρήστος Γεωργίου

I would go like the above (50 subdirs / 50 files, although the 49 / 51 ratio sounds intriguing), but I would use the more… uh… familiar 2**7 base, and 64/64 (or 63/65) ratio. I know, it's Python, it doesn't matter if we do divmod(n, 50) or ((n & 0x7f); n >>= 7), but still… old habits.

gravatar
S. Lott

Excellent point, Bill Mill. Thanks.

path= [ "d%02d" % d for d in base100[:-1] ] + [ "f%02d"%(base100[-1],) ]
return os.path.join( *path )

The directories must be distinguished from the files.

gravatar
Mike

Is it permissible for the make_path function to restructure the directory structure for the previously added files? For example, if the directory is getting unbalanced, can files be moved around to rebalance?

gravatar
Will

Mike, actually no. Think of make_path like a one way hash function. The output for each value of n should never change. I'll update the post to clarify..

gravatar
Rodney

S. Lott - aren't you assuming a constant ratio of files/subdirectories for each directory level? Suppose you have a higher ratio of subdirectories within a directory at lower depths in the path and decrease this ratio as you go deeper into the directory structure.

gravatar
Craig Smith

How big are the files? I finally read the Jim Gray paper "To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem" which concludes:

"As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors."

So if your files are smaller than 256k do consider a sql datastore. If your files are bigger than a megabyte good luck with the file system hashing. If the files are between those sizes then the nature of the files is important.

gravatar
Rodney

This version varies the number of files in a directory as a function of directory depth.

gravatar
Rodney

http://pygments.org/demo/963/

Let's try to get the pygments.org page on the post this time.

gravatar
S. Lott

Rodney asks "S. Lott - aren’t you assuming a constant ratio of files/subdirectories for each directory level?"

The requirements are "...the directory structures should be as flat as possible, i.e. foo/bar for the nth item is preferable over foo/bar/baz."

So, I'm not assuming -- I'm forcing. I assure that the files to be packed as densely as possible. Symmetry of the distribution doesn't enter into it, AFAIK. I'll create the densest possible packing of files (100 per directory) for 100 directories before going to another level of the path.

That's why it looks (to me, anyway) like a simple transformation of the file number (n) to a path/file# based strictly on a base-100 representation of n.

gravatar
Alex

S.Lott, I think you are ending up with 200 files and directories per directory. Especially when you say "I’ll create the densest possible packing of files (100 per directory) for 100 directories before going to another level of the path.". If you add 100 files to a directory, that directory cannot sustain another path level below it.

gravatar
Rodney

Consider this: If you have a total of 195 files to place in your tree, have the topmost directory contain 99 regular files and 1 directory. The subdirectory
is organized to contain at least 96 regular files and at least 1 directory. Now your 195 files can be split into 99 at the top directory and 96 in the subdirectory. (total pl= 1*99+2*96)

Now suppose you've got 50,000 files to place in your tree. Turns out, for the shortest average path, you'll want to place something like 100 directories at your top level, and in each of those subdirectories you'll want 95 regular filea and 5 subdirectories. That gives you 9500 files accessable in the 2nd level, and up to 95*5*99=47025 (>40500=50000-9500) files accessable in the 3rd level. Total pl= (1*0+ 2*9500+3*40500) If I were to place 96 or more files in each 2nd level subdirectory, I wouldn't be able to get all 50,000 files in the top 3 levels. You'll need at least 85 directories at the top level to ensure you can access all files within the first 3 levels, from what I've figured.

If you were to use this second layout to contain the 195 files in the first situation the total path length would be 1*0+2*195 = 390. Which is greater than 291 = 1*99+2*96, achieved using the layout designed for 195 files.

So really, in order to achieve the densest packing for n files, the number of subdirectories in a directory is going to be a function of n.

Now if you really don't know how many files your tree has to contain up front, my suggestion is to start with a high percentage of directories at the top level, and decrease this percentage by a factor at each level, with a minimum of two or three directories at the lower levels. This gives a tree that broadens quickly at the top but is able to contain large numbers of files at depths of 3 or 4. It can also contain several thousand files in the first
two levels.

gravatar
sage

def make_path(n) :
dirs,file = divmod(n,100)
filenameparts = ["99","%02d" % file]
while dirs != 0 :
dirs,lastdir = divmod(dirs,99)
filenameparts.insert(0,"%02d"%lastdir)
return os.path.join(filenameparts)

gravatar
sage

Obviously, it was a mistake to put code directly into the comments, so here is a link...

http://pastebin.com/f49417074

I'll about the mathematical fact that I consider 99 dirs/1 file better over every other repartitions. I'll point out that I replaced the file by a directory containing 100 files, giving a x100 file bonus for a max depth, losing a directory level for the 100 first files, quickly gained by the x100 bonus. (I hope at least one personn will understand this...)

max depth 0 : 0 files can be stored
max depth 1 : 100 files can be stored
max depth 2 : 9900 files can be stored
max depth 3 : 980100 files can be stored

max depth n : 100*(99^(n-1)) files can be stored

Note : This can be optimized, because the "00" directory is not used at the root, only for subdirs. I can optimize it "on demand", but I prefer that "not optimized solution" which I consider easier to understand that the one I could have posted. (max depth n : 10000*(99^(n-2)) files can be stored

Note bis : I didn't read any other solution, so this comment can be outdated by older and better solutions.

gravatar
sage

Ok, I just forgot a "*" in the os.path.join call...

http://pastebin.com/f644a29c5 is the right solution...

gravatar
S.Lott

Aha.

I think I finally get it. It's impossible to *maximally* pack the files in. Thanks for all the thoughtful comments on my code samples. I wasn't seeing the issue very clearly. There has to be a compromise somewhere. Space has to be left somewhere (and non-optimal space) or the upper limit has to be known so that space can be minimized.

The flattest, most-densely-packed structure of n files packed at 100 per directory has a n files, n/100 leaf directory entries, (n+n/100)/100 second-level directory entries up through log100(n+n/100) levels of directory entries. Since you don't know n in advance, you don't know how large n/100 or log100(n) is. It could be 1. Or, if the number of files is 1 million, you have 10,000 leaf level directory entries for the 1,000,000 files. You'll have 100 second-level directory entries, each of which is full of leaf-level directory names.

If there was some kind of order-of-magnitude cap, it would be easy to allocate the right number of levels and allow enough space for the directory entries.

For *maximally* dense, the structure has to be filled depth-first. The paths can't start out short and then get longer. For the million file case, we don't want to put any actual files in the top directories -- if we waste top-level directories on files, we'll need more subdirectories, making the structure not as flat as possible.

If we fill from the top down to keep the first few paths short, we have to guess about the number of subdirectory slots to reserve for later. No matter what we guess, it can't lead to *maximal* flatness.