Jay's blog

Don't Rely On mtime For Anything Important

Rachel Kroll posted "rsync's defaults are not always enough" on her well-known blog yesterday. Everything she said is true. It's a short post, just over 500 words. Go read it.

The gist is that venerable data transfer application rsync by default only compares the file size and modification time (mtime) to decide if a file has differed between the source and destination. There is an option to compare files by checksum instead. In the rsync man page, it shows the option --checksum, -c: "skip based on checksum, not mod-time & size".

In Rachel's case, a particular file's contents had changed, but the overall file size hadn't changed, and the mtime hadn't changed. The first part makes sense. If you swap one byte in a file with another byte, the file's contents have changed but its overall size has not. But how would the mtime stay the same? Probably shenanigans from a human being somewhere. But let's stop a moment and answer the question...

What is mtime?

Depending on the features of the file system being used, there may be lots of metadata attached to a file. Speaking broadly, most file systems used with *nix operating systems include things like the owning user's id, the owning group's id, read/write/execute permissions, atime, and mtime.

Atime stands for access time. Whenever a file is read, the atime is updated. This is theoretically great for audit logging, except that it doesn't tell you who accessed the file, and it creates a lot of disk activity overhead. Imagine having to write to disk every time you read from disk. That's why this option is usually disabled by default in most modern systems I've used.

The very fact that a configuration option can disable one of these should automatically give you trust issues.

Mtime, as previously established, stands for modification time. But what does that really mean? I'm running Pop!_OS and the oldest file on my system is /usr/share/doc/cron/THANKS from September 1st, 1994. But I bought this laptop earlier this year! I doubt there's a single part inside that was manufactured before 2023. SSDs didn't even exist in 1994. How could this file's mtime be from the mid-1990s?

Here's our first problem. Is mtime supposed to represent the last time the file was modified anywhere or when it was modified on this particular storage device? I guess that's up for interpretation. When you transfer files using rsync's archive mode (--archive, -a), it copies metadata from the source to the destination exactly, including the mtime. This is also true when files are untarballed. So, depending on how a file arrived onto your hard disk, its mtime may predate your device.

Our second problem is the touch command.

NAME
     touch - change file access and modification times

This command, available to every user to use on any file they can write to, is explicitly for modifying atime and mtime. Here's a bit of an ontological question. If I can update the modification time of a file without actually modifying the contents of the file, does mtime still have any real meaning?

The most common usage of touch is with default settings, touch filename, which simply updates mtime to the current date and time. However, it's not like it's hard to change mtime to whatever value you want.

jaysherby@framework13:~$ touch filename
jaysherby@framework13:~$ ls -lah filename
-rw-rw-r-- 1 jaysherby jaysherby 0 Jun  1 15:54 filename
jaysherby@framework13:~$ touch --date="1970-01-01 16:20:00 UTC" filename
jaysherby@framework13:~$ ls -lah filename
-rw-rw-r-- 1 jaysherby jaysherby 0 Jan  1  1970 filename

It's just that easy to forge a file's mtime. I suspect that's what happened to Rachel's system file. Why would someone change a file's contents making sure not to change its length and then modify its mtime to match the file before the change occurred? It seems very suspicious if not necessarily malicious.

Mtime is so notoriously unreliable that git, for example, keeps its own internal file modification timestamp records and notably does not update mtime values to match its internal records. I found this out once upon a time when I was curious about the oldest files in a repository and found that I had to parse git logs to do so.

What checksum does rsync use?

I was curious about this. It seems that when you pass one of the --checksum, -c flags, rsync defaults to using MD5, which is probably fine. But if you don't like it and would prefer something else, there's also the --checksum-choice=STR flag: "choose the checksum algorithm (aka --cc)". You can run rsync --version locally to see what algorithms your version of rsync supports. My local copy supports some xxhash algorithms, MD5, MD4, and SHA1.

Will this prevent corruptions during transfer, like BitTorrent?

No. The checksum is only run on files before transfer to determine if the destination file needs to be updated. It does not compare checksums after transfer has completed.

This is important for my use cases. I'm usually using rsync to move files from one place to another, not to back them up. That means I will delete the source files after the transfer is complete. I have been bitten by this before when files end up mysteriously corrupted. That's why I know so much about this aspect of the rsync tool.

Think about the complete path a file needs to traverse when it's transferred via rsync. To goes from the disk, to memory, through application code, over the network, through application code, to memory, to disk. There are several places in that pipeline where something could technically go wrong and I'd never know.

You can create your own utility or script to hash the files on both sides afterwards and compare them, or you can just run the rsync command again. I do the latter due to simplicity. If everything transferred fine the first time, nothing should need to be transferred on the second run. After that, I feel safe deleting the local copy of the files.

I don't think rsync or network issues were at fault for the occasional file corruptions I've experienced, but it's nice to be able to reasonably rule it out.

Don't use mtime for anything important. Its meaning is debatable, and it's a convenient fiction at best.

#rsync