CephFS and max file size

The task at hand was easy. Platform Engineering (PE) wants to re-create the filesystem for a NVram drive which holds a database. The database itself is around 9TB, but the size is nothing unusual for us. Plus we have procedures to take this database out of production for a prolonged period. All good. Or so I thought.

First confirm that the database is no longer in production. Confirmed. Shut it down. Done. Ask PE to give me a Ceph mount point on this host, which I can use to backup the entire directory with tar. Got it. There’s plenty of disk space on our Ceph cluster, this should not be a problem:

xyz.xyz.xyz.xyz:6789:/backend-temp-postgres-backup  1.6P  284T  1.3P  19% /mnt/ceph

tar

Soon I started the tar process:

cd /var/lib/postgresql/ && tar -cp --sparse --atime-preserve --acls --xattrs --one-file-system -v data > /mnt/ceph/data.tar

And was surprised that after a short while the process failed:

data/79481/1905107673.4
tar: -: Wrote only 4096 of 10240 bytes
tar: Error is not recoverable: exiting now

Huh? Still enough disk space on the Ceph mount. How big is the archive so far?

postgres@backend-10 ~ $ ls -ld /mnt/ceph/data.tar
-rw-r--r-- 1 postgres postgres 1099511627776 Mar 20 19:17 /mnt/ceph/data.tar
postgres@backend-10 ~ $ ls -ldh /mnt/ceph/data.tar
-rw-r--r-- 1 postgres postgres 1.0T Mar 20 19:17 /mnt/ceph/data.tar

Interesting. Exactly 1 TB. The Ceph documentation brings clarity:

CephFS has a configurable maximum file size, and it’s 1TB by default. You may wish to set this limit higher if you expect to store large files in CephFS. It is a 64-bit field.

And also: Setting max_file_size to 0 does not disable the limit. It would simply limit clients to only creating empty files. Ok, that doesn’t make much sense, but anyway. Can’t change this right now and although PE could give me a new storage mount point, I figured there’s another way to do this backup.

Split archive into multiple partitions

In my original approach I “just” redirected the tar output into the output file on the Ceph mount point. That works, up to 1 TB, as we have learned. One could pipe the output through the split utility and use this to create multiple partitions. But that is overhead.

tar was originally designed to handle tape archives. The name derives from “tape archive”. Part of this procedure is that sometimes files span across more than one tape, and therefore the result needs to be split into separate tapes - or files. After all, tapes have limited storage capacity. From the tar manpage:

-L, --tape-length=N
Change tape after writing Nx1024 bytes.  If N is followed by a size suffix (see the subsection Size suffixes below), the suffix specifies the multiplicative factor to be used instead of 1024.

This option implies -M.

-M, --multi-volume
Create/list/extract multi-volume archive.

Since we are no longer using tapes, but want to split this into files, we need the following additional options:

--tape-length: the length of each tape/file, in kB (1024 bytes)
--file: specify the output file(s)

A 50 GB file is 52428800 blocks, each 1024 bytes. When unpacking and restoring the archive later, the operating system needs to fetch the file from Ceph and therefore it should not be too large to reasonably fit into the available memory. Let’s not do 1 TB files splits.

For the output files we need to specify multiple files. Either specify each file one by one, or use a shell pattern:

postgres@backend-10 ~ $ echo data-{1..10}.tar
data-1.tar data-2.tar data-3.tar data-4.tar data-5.tar data-6.tar data-7.tar data-8.tar data-9.tar data-10.tar

This changes the tar commandline to:

cd /var/lib/postgresql/ && tar -cp --sparse --atime-preserve --acls --xattrs --one-file-system -v --tape-length=52428800 --file=/mnt/ceph/data-{1..500}.tar data

Start the shell pattern with 1, not with 0. And make sure the larger number in the shell pattern is big enough. Doesn’t matter if the number is too large, tar will only use as many filenames as it needs.

Once this operation is completed, sure enough the result is split into multiple files:

-rw-r--r-- 1 postgres postgres 824633722880 Mar 20 21:16 /mnt/ceph/data-1.tar
-rw-r--r-- 1 postgres postgres 824633722880 Mar 20 21:35 /mnt/ceph/data-2.tar
-rw-r--r-- 1 postgres postgres 824633722880 Mar 20 21:55 /mnt/ceph/data-3.tar
…
-rw-r--r-- 1 postgres postgres 408477992960 Mar 20 22:04 /mnt/ceph/data-x.tar

Restore from multiple archives

Now that we have a backup, make sure that we can use it. Let’s try to unpack it.

cd /restore/directory
tar -xpvv -f /mnt/ceph/data-1.tar -F 'echo /mnt/ceph/data-${TAR_VOLUME}.tar >&${TAR_FD}'

This shell line requires that the first tar archive (data-1.tar) is named 1. The -F option selects the next tape - filename - and builds on the assertion that tar will increase the variable ${TAR_VOLUME} for every newly requested tape. The first iteration is asking for tape number 2, therefore building the filename /mnt/ceph/data-2.tar. The second iteration asks for tape 3 and builds the filename /mnt/ceph/data-3.tar. And so on.

There’s surprisingly little information available about these variables and features.

Summary

The Unix/Linux utility tar can be used to archive and restore large amounts of data, even when the target can’t store single files that size. Splitting the files is possible with built-in methods. This avoids loading large amounts of files into the operating system memory, and polluting the cache.

Note on rsync

Why not use rsync you may ask.

This PostgreSQL data directory has 15k directories and 167k files. Remember, it’s about 9 TB of data. rsync is really slow at handling such a large amount of files on Ceph. Syncing the entire data to Ceph took almost 4 days, compared to less than one day for tar. Tried both approaches.

A simple rsync just to scan if all files are synced takes around 15 minutes. It has to read metadata for each file from Ceph, which results in close to 200k network operations.

Image Credit

Image by Gerd Altmann from Pixabay

tar#

Split archive into multiple partitions#

Restore from multiple archives#

Summary#

Note on rsync#

Image Credit#

tar

Split archive into multiple partitions

Restore from multiple archives

Summary

Note on rsync

Image Credit