Many desktop computers have Gigabytes of free disk space. Many computers in a local network waste a lot of storage this way, especially if there is a central file server, too. Why can't we use the space already available on hard drives to construct one distributed file system? MogileFS is closest, but not fully distributed, since it relies on a central MySQL database for the index. The ultimate solution would be a system that scales from three computers in a flat share to thousands of desktops in a global company. Let's think about a fully distributed file system.

There are already distributed key-value databases (short: keystore) and we understand them reasonably well, so it makes sense to build a file system on those. On an API level these databases look like a dict (a Hashmap in Python). The two basic operations are to put a (key, value) pair into the store and to get a value identified by a key. Additionally deleting keys and testing, if it contains a key could also be possible. So what does it take to implement a file system on a dict?

Hierarchy

The first naive idea is that every file has a unique path in the filesystem, so every path is a key and the file content is the value. The problem with this approach is, that listing the files of a directory is expensive. Thus we need to store the file contents under another key. The SHA256 checksum of file contents should be unique enough, so let's use this as the key for file contents.

A directory can be seen as a special file, which contains a list of (name, key) pairs. Each name describes a file or subdirectory that can be retrieved from the keystore using the associated key. To store meta data about files and directories, we can simply extend the pairs to (name, key, meta data) tuples. To store this list we can use the same mechanism as with files: The checksum of the content is the key. A value prefix identifies a value as file or directory, so it is no problem to have nested directories and thus a hierarchy.

Since every change of a value changes the key, the data is effectivly immutable (as long as every participant behaves correctly). This also means that every change to a file, demands a change to every parent directory up to the root. Since the root key also changes every time, we store a special pair ('root', key-of-current-root-dir) into the database. Thus every change to the file system boils down to an atomic exchange of the special root entry.

File system

An additional layer can now easily provide file system properties. The meta data needs to be structured for permissions for example. Additional functionality like symbolic links or file renaming can be implemented on this level. Reading or writing parts of files is another task. File systems also manage side effects. For example writing to a file changes its meta data, because the modification time is updated. And finally error codes need to be issued according to the specification of the operating system.

On Linux and OS X it is possible to implement a File System in User Space (Fuse). Another layer can translate Fuse calls to our file systems API. Some additional work is needed to fake some features our file system doesn't support so far. For example user and group ids (uid and gid attributes) could be stored as meta data, but the data is only as safe as the place, where it is stored. The file system itself can't keep it safe, since anybody may connect to the keystore. Thus our file system shouldn't pretend any permission and authentication features. Since Fuse needs these attributes anyways, they can be set to some global default values.

Dict substitutions

This file system can be initialized with a dict and provide an in-memory file system via Fuse. It is also possible to use other objects, as long as they provide the same API (put,get,delete,contains). One could reuse this file system and just implement a dict, because it is easier to implement the dict API, compared to the badly documented Fuse API.

The Berkeley DB is a dict on the disk, so the data is persistent. Another possibility is Scalaris, which is a distributed in-memory keystore and with this sleight of hand we have a fully distributed file system.

A Python implementation of this file system can be found at http://github.com/beza1e1/kvfs (this version at the time of this writing). It can be mounted via Fuse and works with dict, BerkeleyDB and Scalaris.

Testing other distributed keystores like MemCached, Voldemort, Cassandra or LightCloud should also be interesting. Sadly there is no mature solution available.

The distributed stores out there [are] currently pretty half-baked at best right now. Your comfort-level running in prod may vary, but for most sane people, I doubt you’d want to. ― Leonard Lin

Problems and ideas

The current implementation using Scalaris is quite slow. I can think of a lots of tricks like caching to improve performance, but the architecture itself could be revised. An additional "inode" type (apart from file and directory) could introduce an additional layer of indirection between list of files and content-addressable data. The file system becomes stateful by using file handles referencing inodes and (hopefully) faster, because the path hierarchy isn't traversed on every request. Also files could be partitioned with inodes, which should make updates faster, because less data needs to be updated. The downside is that the immutability of the hierarchy and atomic changes get lost.

Another problem with the current architecture is that hardlinks are impossible. Let there be two files A and B, where B was created as a hardlink to A. Modifying A would not change B, since the directory entry of B would still point to the old checksum-key. Currently a hardlink is just a (very fast) copy, which means it isn't a hardlink but a bug. The inode-idea above could also solve this problem.

© 2009-06-25