January 13, 2019 will

Filesystem Magic with Python

I recently added a number of examples on working with files and directories with Python and PyFilesystem.

Counting Bytes of Python

The first example, is count_py.py which recursively sums up the number of bytes stored in a directory, and renders it in a "friendly" format.

import sys

from fs import open_fs
from fs.filesize import traditional


fs_url = sys.argv[1]
count = 0

with open_fs(fs_url) as fs:
    for _path, info in fs.walk.info(filter=["*.py"], namespaces=["details"]):
        count += info.size

print(f'There is {traditional(count)} of Python in "{fs_url}"')

This script takes a path or a URL. For instance python3 count_py.py ~/projects will sum up the bytes of Python in your projects folder.

For me, this gives the following output:

There is 227.1 MB of Python in "/Users/will/projects/"

This script accepts a path, but it will also work with any supported filesystem. So you could sum the bytes of Python in an archive or cloud server.

Finding Duplicates

Next up, is find_dups.py which will recursively find duplicate files (even if the filenames are different). It does this by calculating the md5 hash of all the files, and looking for files with matching hashes. It is possible that two different files have the same hash, but the chances are remote.

As always, this script works with local files, archives, or network filesystems like S3 or Google Cloud Storage.


from collections import defaultdict
import hashlib
import sys

from fs import open_fs


def get_hash(bin_file):
    """Get the md5 hash of a file."""
    file_hash = hashlib.md5()
    while True:
        chunk = bin_file.read(1024 * 1024)
        if not chunk:
            break
        file_hash.update(chunk)
    return file_hash.hexdigest()


hashes = defaultdict(list)
with open_fs(sys.argv[1]) as fs:
    for path in fs.walk.files():
        with fs.open(path, "rb") as bin_file:
            file_hash = get_hash(bin_file)
        hashes[file_hash].append(path)

for paths in hashes.values():
    if len(paths) > 1:
        for path in paths:
            print(f" {path}")
        print()

This could be made a little smarter. Rather than calculating hashes for every file (which is comparatively slow), it could first compare filesizes. I've left that as an 'exercise for the reader'!

Notice the get_hash function. This will work with a file opened on your local drive, in an archive, or streamed over the network. PyFilesystem hides the gory details from you, so you can write generic code.

Uploading files from your local drive

This next script is upload.py which opens a local file and uploads it to a server.

import os
import sys

from fs import open_fs

_, file_path, fs_url = sys.argv
filename = os.path.basename(file_path)

with open_fs(fs_url) as fs:
    if fs.exists(filename):
        print("destination exists! aborting.")
    else:
        with open(file_path, "rb") as bin_file:
            fs.upload(filename, bin_file)
print("upload successful!")

Here's how you might upload a file to an FTP server:

python3 upload.py starwars.mov ftp://ftp.example.org/uploads

Or to Amazon S3

python3 upload.py starwars.mov s3://mybucket/movies/

Deleting your .pyc files

Finally, rm_pyc.py is a super simple script that recursively deletes .pyc files.

import sys

from fs import open_fs


with open_fs(sys.argv[1]) as fs:
    count = fs.glob("**/*.pyc").remove()
print(f"{count} .pyc files removes")

More examples

I'll be adding more scripts to the "examples/" directory in the future. Let me know if you have any ideas for those, or requests...

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Laxman

what is the best way to read HDFS files with 2 million records in python3.