Filesystem Magic with Python
I recently added a number of examples on working with files and directories with Python and PyFilesystem.
Counting Bytes of Python
The first example, is count_py.py which recursively sums up the number of bytes stored in a directory, and renders it in a "friendly" format.
import sys
from fs import open_fs
from fs.filesize import traditional
fs_url = sys.argv[1]
count = 0
with open_fs(fs_url) as fs:
for _path, info in fs.walk.info(filter=["*.py"], namespaces=["details"]):
count += info.size
print(f'There is {traditional(count)} of Python in "{fs_url}"')
This script takes a path or a URL. For instance python3 count_py.py ~/projects
will sum up the bytes of Python in your projects folder.
For me, this gives the following output:
There is 227.1 MB of Python in "/Users/will/projects/"
This script accepts a path, but it will also work with any supported filesystem. So you could sum the bytes of Python in an archive or cloud server.
Finding Duplicates
Next up, is find_dups.py which will recursively find duplicate files (even if the filenames are different). It does this by calculating the md5 hash of all the files, and looking for files with matching hashes. It is possible that two different files have the same hash, but the chances are remote.
As always, this script works with local files, archives, or network filesystems like S3 or Google Cloud Storage.
from collections import defaultdict
import hashlib
import sys
from fs import open_fs
def get_hash(bin_file):
"""Get the md5 hash of a file."""
file_hash = hashlib.md5()
while True:
chunk = bin_file.read(1024 * 1024)
if not chunk:
break
file_hash.update(chunk)
return file_hash.hexdigest()
hashes = defaultdict(list)
with open_fs(sys.argv[1]) as fs:
for path in fs.walk.files():
with fs.open(path, "rb") as bin_file:
file_hash = get_hash(bin_file)
hashes[file_hash].append(path)
for paths in hashes.values():
if len(paths) > 1:
for path in paths:
print(f" {path}")
print()
This could be made a little smarter. Rather than calculating hashes for every file (which is comparatively slow), it could first compare filesizes. I've left that as an 'exercise for the reader'!
Notice the get_hash
function. This will work with a file opened on your local drive, in an archive, or streamed over the network. PyFilesystem hides the gory details from you, so you can write generic code.
Uploading files from your local drive
This next script is upload.py which opens a local file and uploads it to a server.
import os
import sys
from fs import open_fs
_, file_path, fs_url = sys.argv
filename = os.path.basename(file_path)
with open_fs(fs_url) as fs:
if fs.exists(filename):
print("destination exists! aborting.")
else:
with open(file_path, "rb") as bin_file:
fs.upload(filename, bin_file)
print("upload successful!")
Here's how you might upload a file to an FTP server:
python3 upload.py starwars.mov ftp://ftp.example.org/uploads
Or to Amazon S3
python3 upload.py starwars.mov s3://mybucket/movies/
Deleting your .pyc files
Finally, rm_pyc.py is a super simple script that recursively deletes .pyc
files.
import sys
from fs import open_fs
with open_fs(sys.argv[1]) as fs:
count = fs.glob("**/*.pyc").remove()
print(f"{count} .pyc files removes")
More examples
I'll be adding more scripts to the "examples/" directory in the future. Let me know if you have any ideas for those, or requests...
what is the best way to read HDFS files with 2 million records in python3.