January 15, 2019 will

PyFilesystem is greater than or equal to Pathlib

I was reading a post by Trey Hunner on why pathlib is great, where he makes the case that pathlib is a better choice than the standard library alternatives that preceded it. I wouldn't actually disagree with a word of it. He's entirely correct. You should probably be using pathlib were it fits.

Personally, however, I rarely use pathlib, because I find that for the most part, PyFilesystem is a better choice. I'd like to take some of the code examples from Trey's post and re-write them using PyFilesystem, just so we can compare.

Create a folder, move a file

The first example from Trey's post, creates a folder then moves a file into it. Here it is:

from pathlib import Path

Path('src/__pypackages__').mkdir(parents=True, exist_ok=True)
Path('.editorconfig').rename('src/.editorconfig')

The code above is straightforward, and hides the gory platform details which is a major benefit of pathlib over os.path.

The PyFilesystem version also does this, and the code is remarkably similar:

from fs import open_fs

with open_fs('.') as cwd:
    cwd.makedirs('src/__pypackages', recreate=True)
    cwd.move('.editorconfig', 'src/.editorconfig')

The two lines that do the work are somewhat similar -- you can probably figure them out without looking at the docs. The first line of non-import code may need some explanation. In PyFilesystem the abstraction is not a path but a directory. So open_fs('.') returns a FS object for the current working directory. It's this object which contains methods for making directories and moving files etc.

Create a directory if it doesn't already exist, write a blank file

This next example from Trey's post, creates a directory then creates an empty file if it doesn't already exist:

from pathlib import Path


def make_editorconfig(dir_path):
    """Create .editorconfig file in given directory and return filepath."""
    path = Path(dir_path, '.editorconfig')
    if not path.exists():
        path.parent.mkdir(exist_ok=True, parent=True)
        path.touch()
    return path

This function is tricky to compare, as it does things you might not consider doing in a project with PyFilesystem, but if I was to translate it literally, it would be something like the following:

def make_editorconfig(dir_path):
    """Create .editorconfig file in given directory and return filename."""
    with open_fs(dir_path, create=True) as fs:
        fs.touch(".editorconfig")
    return fs.getsyspath(".editorconfig")

The reason that you wouldn't write this code with PyFilesystem, is that you rarely need to pass around paths. You typically pass around FS objects which represent a subdirectory. It's perhaps not the best example to demonstrate this, but the PyFilesystem code would likely be more like the following:

def make_editorconfig(directory_fs):
    directory_fs.create(".editorconfig")

with open_fs("foo", create=True) as directory_fs:
    make_editorconfig(directory_fs)

Rather than a str or a Path object, the function excepts an FS object. An advantage of this that file / directory operations are sandboxed under that directory. Unlike the Pathlib version, which has access to the entire filesystem. For a trivial example, this won't matter. But if you have more complex code, it can prevent you from unintentionally deleting or overwriting files if there is a bug.

Counting files by extension

Next up, we have a short script which counts the Python files in a subdirectory using pathlib:

from pathlib import Path


extension = '.py'
count = 0
for filename in Path.cwd().rglob(f'*{extension}'):
    count += 1
print(f"{count} Python files found")

Nice and simple. PyFilesystem has glob functionality (although no rglob yet). The code looks quite similar:

from fs import open_fs

extension = '.py'

with open_fs('.') as fs:
    count = fs.glob(f"**/*{extension}").count().files
print(f"{count} Python files found")

There's no for loop in the code above, because there is built in file counting functionality, but otherwise it is much the same.

I think Trey was using this example to compare performance. I haven't actually compared performance of PyFilesystem's globbing versus os.path or pathlib. That could be the subject for another post.

Write a file to the terminal if it exists

The next example is a simple one for both pathlib and PyFilesystem. Here's the pathlib version:

from pathlib import Path
import sys


directory = Path(sys.argv[1])
ignore_path = directory / '.gitignore'
if ignore_path.is_file():
    print(ignore_path.read_text(), end='')

And here's the PyFIlesystem equivelent:

import sys
from fs import open_fs


with open_fs(sys.argv[1]) as fs:
    if fs.isfile(".gitignore"):
        print(fs.readtext('.gitignore'), end='')

Note that there's no equivalent of directory / '.gitignore'. You don't need to join paths in PyFilesystem as often, but when you do, you don't need to worry about platform details. All paths in PyFilesystem are a sort of idealized path with a common format.

Finding duplicates

Trey offered a fully working script to find duplicates in a subdirectory with and without pathlib. Coincidentally I'd recently added a similar example to PyFilesystem.

Here is Trey's pathlib version:

from collections import defaultdict
from hashlib import md5
from pathlib import Path


def find_files(filepath):
    for path in Path(filepath).rglob('*'):
        if path.is_file():
            yield path


file_hashes = defaultdict(list)
for path in find_files(Path.cwd()):
    file_hash = md5(path.read_bytes()).hexdigest()
    file_hashes[file_hash].append(path)

for paths in file_hashes.values():
    if len(paths) > 1:
        print("Duplicate files found:")
        print(*paths, sep='\n')

And here we have equivalent functionality with PyFilesystem:

from collections import defaultdict
from hashlib import md5
from fs import open_fs

file_hashes = defaultdict(list)
with open_fs('.') as fs:
    for path in fs.walk.files():
        file_hash = md5(fs.readbytes(path)).hexdigest()
        file_hashes[file_hash].append(path)

for paths in file_hashes.values():
    if len(paths) > 1:
        print("Duplicate files found:")
        print(*paths, sep='\n')

The PyFilesystem version compares quite favourable here (in terms of lines of code at least). Mostly because there was already an iterator of paths method built in.

Conclusion

First off, I would like to emphasise that I'm not suggesting you never use pathlib. It is better than the alternatives in the standard library. Pathlib also has the advantage that it is actually in the standard library, whereas PyFilesystem is a pip install fs away.

I would say that I think PyFilesystem results in cleaner code for the most part, which could just be down to the fact that I've been working with PyFilesystem for a lot longer and it 'fits my brain' better. I'll let you be the judge. Also note that as the primary author of PyFilesystem, there is obviously a bucket-load of bias here.

There is one area where I think PyFilesystem is a clear winner. The PyFilesystem code above would work virtually unaltered with files in an archive, in memory, on a ftp server, S3 etc. or any of the supported filesystems.

I'd like to apologise to Trey Hunner if I misrepresented anything he said in his post!

Use Markdown for formatting
*Italic* **Bold** `inline code` Links to [Google](http://www.google.com) > This is a quote > ```python import this ```
your comment will be previewed here
gravatar
Nick

The PyFilesystem code above would work virtually unaltered with files in an archive, in memory, on a ftp server, S3 etc. or any of the supported filesystems.

======

That's the winner

gravatar
Evgeny Kovalchuk

It's nice to have such convenience methods like "count" or "files" right out of the box and I wish that one day pathlib will have them too.

Counting files by extension

I agree, it is not cool to do this thing with pathlib. But there's a shorter way (not sure if it's better though).

Instead of

extension = '.py'
count = 0
for filename in Path.cwd().rglob(f'*{extension}'):
    count += 1
print(f"{count} Python files found")

we can use

extension = '.py'
count = 0
count = sum(1 for _ in Path('.').rglob(f'*{extension}'))
print(f"{count} Python files found")

Finding duplicates

Well, it's fair to say, that

def find_files(filepath):
    for path in Path(filepath).rglob('*'):
        if path.is_file():
            yield path

file_hashes = defaultdict(list)
for path in find_files(Path.cwd()):
    file_hash = md5(path.read_bytes()).hexdigest()
    file_hashes[file_hash].append(path)

can be replaced with

file_hashes = defaultdict(list)
for path in filter(Path.is_file, Path.cwd().rglob('*')):
    file_hash = md5(path.read_bytes()).hexdigest()
    file_hashes[file_hash].append(path)

Also, I am not a big fan of Path.cwd(). I still prefer Path('.') more in cases, when I need to chain something else to it.

Write a file to the terminal if it exists

Hm-m, I imagine, how pathlib's code would look like with Python 3.8's PEP 572 Will this work?

with Path(sys.argv[1]) as fs:
    if (ignore_path := fs / '.gitignore').is_file():
        print(ignore_path.read_text(), end='')

By the way, notice that pathlib can be used with context managers as well :)

gravatar
Pawel

So basically, bacause pathlib outputs paths as strings when repr is called you thought that it actually passes strings around and not Path objects and you didn't bother to check if pathlib has context managers, which of course it does, so you can use it the same way as you use PyFilesystem.

gravatar
Paul Prescod

Have you considered implementing a more "object oriented" API for PyFilesystem where it would be:

ignore_path.is_file()

Instead of:

fs.isfile(".gitignore")

I was trying to write some code which would automatically handle paths from Pathlib and PyFilesystem and it was unclear how to do that. I was hoping that if my code was handed an object with a method called "open" I could just use that to open the object as a stream, but I don't think that PyFilesystem has such an object.

The truth in 2020 is that Pathlib defines the interface for how to work with path-like objects. Incompatible libraries are swimming upstream.

gravatar
Will McGugan

@Paul PyFilesytem has a different abstraction that Pathlib, but it is just as object oriented. I would argue that is_file is not something you ask of a path, it is something you ask the filesystem. By abstracting the filesystem you gain a lot of freedom about how and where the data is stored, whereas the pathlib approach is tied to the OS filesystem.

Bear in mind that PyFilesystem pre-dates Pathlib by nearly 6 years. One doesn't preclude the other.

gravatar
Paul Prescod

Thank you for the response Will. And thank you for PyFilesystem, which I consider a brilliant and important innovation. I am only contributing because I would like to see it be more widely adopted.

By "object oriented" I mean that a single object includes both the address and all information needed to "dereference" (i.e. open the file).

In the pathlib model, one variable contains what must be represented as a tuple in PyFilesystem.

In the majority of cases that I am accustomed to, there is no additional benefit to keeping the two pieces of information separate.

Let's imagine for the sake of argument that I want to take the entire body of this block and turn it into a function:

with open_fs('.') as fs:
    for path in fs.walk.files():
        file_hash = md5(fs.readbytes(path)).hexdigest()
        file_hashes[file_hash].append(path)

The signature for that function must take two arguments, fs and path. Why? Why not a single argument that represents a "file address".

def doit(fs, path):
    for path in fs.walk.files():
        file_hash = md5(fs.readbytes(path)).hexdigest()
        file_hashes[file_hash].append(path)

with open_fs('.') as fs:
    doit(fs, path)

In my opinion, keeping the information together is better. But moving beyond the realm of taste and opinion: pathlib is part of the standard library. There now exists a stdlib "interface" for file-address objects, just as there exists an "interface" for open file objects, for sequences, for web server gateways, etc.

PyFilesystem could keep its own API but it would be good if there were an adapter which allowed it to conform to the interface so that people could write code which would support both Pathlib and PyFilesystem at the same time.

gravatar
Will McGugan

If you have an operation that works on a directory, then you only need to pass in one thing. You can open a subdirectory with opendir and pass that, no need to pass a path. If you have an operation that works on data, its better to pass in an open file-like object -- which gives you the freedom to locate your data anywhere, os file, memory file, network etc.

pathlib is part of the standard library

There exists numerous third party libraries that compete with the standard library, and are generally preferred. In fact it's often said that the standard library is "where modules go to die".

That said, of course I would want compatibility with the standard library if possible, but a Pathlib adapter for PyFilesystem is essentially unworkable. Pathlib represents a standard for OS paths and operations. There is much of Pathlib that has no analogue with PyFilesystem, and vice versa. What should home() and cwd() do for a memory based filesystem or a zipfile? Path objects can be converted to a str, but a PyFilesystem Path object may not be able to return a path that is understood by the OS. There are hosts of other issues.

It just wouldn't be possible to provide a PyFilesystem Path object that would work everywhere a regular Path object is expected. It may work some of the time, but you could never be certain that a function that works with a regular Path object, wouldn't break with a PyFilesystem adapter path.

So it's not that I haven't considered it, its been discussed many times. If you do need to get a Path for a location on a PyFilesystem you can call getsyspath to get the OS path, if one exists.

gravatar
Wolfgang Goritschnig

Thanks for that lib - it's a perl!

gravatar
Will McGugan

Glad you like it, Wolfgang!

gravatar
Evgeny Kovalchuk

2 years later, I am here again! Now I am appreciating much more how seamlessly PyFilesystem works with zip-archives! It is just SO CONVENIENT! Thank you for this!

gravatar
Will McGugan

😁👍