tag(1): del.icio.us-style file tagging

I’ve been using the social bookmarking service, del.icio.us, for a while now, and have watched similar tagging features move from site to site as a flexible way to lightly categorize various resources: bookmarks, blog posting, photos, etc. It’s therefore a little strange that no one has written a simple utility to give you similar category construction capabilities on your typical Unix-like file system. In this post, I propose a simple, portable design, contrast my design choices against some other possibilities, and then provide an initial implementation of that design in Perl.

Tagging

If you’ve not used del.icio.us, Technorati’s tag system, or any of the Web tagging systems, then a brief introduction is worthwhile. If you’re already familiar with tagging, then this section will alert you to my personal terminology around tagging operations and the resulting metadata.

By tagging, we mean the association of one or more keywords, called tags, with an entity. The tag set is the set of tags associated with the entity, which for the purposes of this post, is always a file.

Once you’ve accumulated a set of tagged files, the next step is to examine sets of files with common tags: each such set is a query. The link set is the set of all possible queries against the set of tags. That is, the set of files tagged with the same set of tags is a member of the link set. A particular member of the link set can be calculated dynamically, or the entirety of the set can be calculated in advance.

Querying tags

del.icio.us appears to calculate the per-tag queries on the fly, which is acceptable as the primary interface is the web page presented to the reader. (Caching is probably involved.) We instead have a choice: we can use our command, tag(1), or we can use the filesystem itself. Our choice is accompanied by a tradeoff between time (to compute queries) and space (inodes consumed). Although, as we will see, the space consumption is significant, the filesystem approach presents an easy interface for exploration.

By default, the link set is constructed in $HOME/tag:

$ ls ~/tag/
code/        doc/         linux/       rm/          thread/      utility/
conference/  greenline/   note/        solaris/     umem/

If we look in ~/tag/code, we see

$ ls ~/tag/code/
gumemd.pl@  pause.c@    tag.pl@     tagging/    umem/       utility/

symbolic links to files with that tag, and directories representing additional tags. This directory hierarchy is the link set: all possible queries are precalculated:

$ ls ~/tag/code/umem/
gumemd.pl@

If we go to the original file, tag(1) will tell us the tags on the file:

$ tag gumemd.pl
code umem

We can also query using the tag command, but it is merely walking the directory structure we constructed.

$ tag -q code
/home/sch/old-home/gumemd.pl
/home/sch/old-home/pause.c
/home/sch/play/tag/tag.pl

(This operation is relatively fast, as we would expect:

$ time tag -q code
/home/sch/old-home/gumemd.pl
/home/sch/old-home/pause.c
/home/sch/play/tag/tag.pl

real    0m0.068s
user    0m0.057s
sys     0m0.009s

Timing run performed on a 1.8 GHz Athlon64 running Solaris 10 and Perl 5.8.4.)

Adding tags is simple:

$ tag -a perl gumemd.pl
$ tag gumemd.pl
code umem perl
$ ls ~/tag/perl
code/       gumemd.pl@  umem/

As is deleting them:

$ tag -d perl gumemd.pl
$ ls ~/tag/perl
/home/sch/tag/perl: No such file or directory

So we have a design for basic tagging of files.

Tags and file system metadata consumption

It’s worth noting that precalculating the entire link set and storing it in a file system consumes file system metadata combinatorically, but that this consumption is capped by the (typical) limit on path length. Generally, we make M symbolic links for the n distinct tags on a single file, with M given by

$$ M(n) = \sum_{r = 1}^{n} \frac{n!}{(n – r)!} $$

That’s 325 links for 5 tags, plus as many directories again, for a total of 650 metadata units. On file systems that preallocate a fixed size pool of inodes, this kind of feature could lead to metadata exhaustion.

Tagging and file system operations

The key file system operation in our tagging system is renaming: how do we update the link and tag sets when a particular file is relocated within the filesystem? Deletion is a subset of this case; copying can be treated similarly to a set of tag-add operations on a previously untagged file.

If we just use mv(1), and have the link set and tag set stored externally to the file, then we end up with a depressing result:

ptagmodelnonnoattr tag(1): del.icio.us style file tagging

The link set points to and the tag set is associated with the old location, and operations on these objects will fail. The file in its new location has lost all of its tags-derived metadata, and so has dropped out of our tagspace.

On some Unix-like OSes, certain filesystems support file attributes, such as Solaris’s UFS and Linux’s ext3fs. With attributes, data associated with the file is associated with the file in such a way that filesystem operations leave the association invariant (unless they involve the attribute data specifically). This modifies our operation to something more acceptable:

ptagmodelnonattr tag(1): del.icio.us style file tagging

Although the link set is now stale, and we would have to rebuild the entire link set to remove the stale links to the old location, and to generate the correct links to the new one.

It is clear that what we want to occur is simultaneous updating of the link and tag sets with each operation: we want transactional operations.

ptagmodeltrans tag(1): del.icio.us style file tagging

One way to achieve a transactional operation is to get beneath the rename(2) system call, and to update both the link set and the tag set within an underlying tagging file system. It would be straightforward to implement a translucent file system that offered a namespace for the link set, and used an underlying on-disk filesystem to store the appropriate data. But, like our declining to use attributes earlier, we pass on an in-kernel implementation because of the associated portability costs. Instead, we have to encapsulate the various steps into an apparent transaction, by offering rename and remove operations in the command itself.

We implement encapsulated operations in the prototype version by wrapping the standard Unix-like commands: cp(1), mv(1), and rm(1). The syntax is

$ tag --mv srcfile dstfile

and similarly for the other commands.

Implementation

As I noted earlier, the code is a draft implementation and is grossly undertested. There are numerous improvements that could be made to the implementation—the tying and untying for multiple tag operations is an obvious performance sink. Architecturally, the script should be separated into a module providing tagging functionality and a command that presents the module’s functionality as an elegant command line interface. (It would be a reasonable position to move tagging into a small C library, such that a simple common tagging infrastructure could be made available. We could alternatively commit to a specific DBM file format and offer access from multiple symbolic link-capable languages.)

Implementation improvements would include: recognizing when an operation (tag retrievals in particular) is on an in-the-link-set symbolic link, resynchronizing/repairing databases and link hierarchies, better documentation (although the script does embed minimal documentation using POD), a test suite, and use of ExtUtils::MakeMaker or Module::Installer.

As always, I’m interested in your comments, suggestions, and references to similar work.

[T: Solaris ]

  • http://www.schulzone.org josh

    That’s a great idea for browsing filesystems.

  • http://www.socio-kybernetics.net/saurierduval/ saurier duval

    Brilliant idea!

  • http://www.martiansoftware.com Marty Lamb

    Great idea! I’ve tinkered with a different approach, but your filesystem-based approach makes tags/queries accessible to existing programs. Very slick.

    You might be able to eliminate your combinatorial explosion of tag subdirectories by using a virtual filesystem (http://lwn.net/Articles/13325/). Such an approach would come at the cost of platform-specificity, but since you’re using symlinks already, I suspect that’s acceptable.

    Nice work!

    • Marty
  • http://www.thespoke.net/MyBlog/travisowens/MyBlog.aspx Travis Owens

    Well actually Microsoft’s upcoming file system addon for Longhorn WinFS supports the ability to tag files by category, and much more.

    It’s a shame it won’t be ready for Longhorn and will instead be an add-on.

    Hopefully WinFS will finally allow Windows NTFS to have unix style soft/hard links as “shortcuts” are a half arsed at best (even tried to open a shortcut in your text editor or mp3 player?).

  • http://www.base4.net Alex

    This is an excellent idea. It reminds me of something I created called XTend. I represented additional tags as directories too, and by drilling it a directory the system added that tag to the filter. I am no longer working on XTend, I have moved onto something more like WinFS called Base4. But I invite you to take a look here: http://www.base4.net/XTend.htm, it might give you some ideas.

    That page links to 2 flash demos. http://www.base4.net/xtend/interface.htm http://www.base4.net/xtend/search.htm

    The second demo is especially relevant. XTend allowed you to filter a file system by relationship. One type of relationship I supported was from File to Tag (or what I called Concepts, note: this was before ‘Tagging’ became so hot). So this allowed you to filter the filesystem by tag or tags.

    I would love to hear your thoughts on XTend, even though I am no longer working on it.

  • Stephen

    @*: Thanks to all for the kind words. There will be a follow up post soon—I need to work through the various notes and comments on other people’s feeds/blogs/link collections.

    @Marty: you’re right that a filesystem implementation, where the directories can be synthesized as needed, can avoid some of the combinatorial problems, but it opens the system to a different type of resource exhaustion, as an antagonist can open deep paths and exhaust kernel memory. This moves the denial of service from per-filesystem to system wide. Ultimately, I think a cap on the depth of the link set is needed by any implementation.

    @Travis and Alex: I’m a bit ignorant about WinFS’s capabilities, but standardized user and application metadata conventions sound like the right direction for making categorization a meaningful addition to an operating system’s programming model. The XTend demos show that there’s a benefit to representing such attributes in file browsers, and not just via a command line interface.

    — Stephen

  • http://blogs.sun.com/cmh Claire Giordano

    Very, very cool, Stephen. So this is what keeps you up late at night. ;-)

  • Thrax

    I don’t understand why there are subdirectories in the tag directories? What’s in those? How does Tag know when to make them?

    I can see it would be handy with pipes, like %tag print_me|lpr And I guess renaming a tag would be easy when browsing the tag directory. I hear Flickr has introduced ‘for:person’ tags. Seems like a slippery slope to complexity.

  • Stephen

    @Claire: Actually fighting with making MathML work kept me up late, but thanks.

    @Thrax: The subdirectories contain links to the files that match both the tag of the current directory and the tag of each subdirectory. That is, they contain the two-tag matches. The subdirectories of those subdirectories are the three-tag matches, and so forth. tag(1) makes the subdirectories based on the set of tags on each file; adding a tag can create additional directories, while deleting a tag may cause the removal of directories (if that was the last use of the tag). Don’t rename the tag directories! This operation needs to be added to tag(1) to be safe: the tag sets of each file with the given tag (to be renamed) need to be updated as well.

    — Stephen

  • http://tv.debian.net/ Tommi Virtanen

    For an actual filesystem that did this, see

    http://citeseer.ist.psu.edu/gifford91semantic.html

  • Stephen

    @Tommi: Thanks for the reference. I’ll try to check it out in the next few weeks. — Stephen