If you haven’t heard about Amazon Glacier already, it’s definitely something to be excited about. Amazon Glacier is a service that makes it extremely affordable to store gigabytes upon gigabytes of data for the long term in the cloud. Your data is stored immediately, but retrieval requests take at least 4 hours to make your data available again for your downloading. Let’s back up a ton of files on Linux to Glacier.
For my purposes, I wanted to back up my entire music library, which is about 35 gigabytes of data. Due to the fact that Amazon doesn’t store folder structures, we’ll need to create archives of what we’d like to store. We’ll do that and split our archive into 200MB blocks out of convenience. I’m not currently aware of any per-file maximum size on Glacier, but it’s much easier to retry a 200MB upload rather than a 35GB one. Plus, this will make it much easier to script our uploads in a way that makes it easy to validate that our uploads completed properly. To zip up our music library, we’ll use
split, standard Linux utilities. I’ll be applying GZip compression to my tar archive, though this is not strictly necessary or even beneficial in our case, since most music files are already compressed using much more advanced algorithms than standard compression methods.
$ BACKUP_TIME="$(date +%Y%m%d%H%M%S)" $ tar cvzf - ~/Music | split --bytes=200MB - "music.backup.$BACKUP_TIME.tar.gz."
What this will do is create a tar.gz archive split every 200MB into files looking like
"music.backup.20130120120503.tar.gz.aa". The date is generated before we do any real work, and the
aa suffix will change on the archive count, so you’ll have archives ending in
ac, etc. If you’re as paranoid as you should be, you’d now encrypt these archive using a GPG key, but that’s outside of the scope of this tutorial for now.
Now that we have our data, we’ll use a Python script library called glacier-cmd to upload our files and perform other operations on our Glacier vaults. Let’s get that library installed:
$ sudo apt-get install python-setuptools git $ git clone git://github.com/uskudnik/amazon-glacier-cmd-interface.git $ cd amazon-glacier-cmd-interface $ sudo python setup.py install
Everything should now be installed for us to be able to start our backup process. All we need to do before we start uploading is to setup our configuration file and create our vault. First, create a file at
~/.glacier-cmd filled with the following contents:
[aws] access_key=YOUR_AWS_ACCESS_KEY access_secret=YOUR_AWS_ACCESS_SECRET [glacier] region=YOUR_AWS_REGION logfile=~/.glacier-cmd.log loglevel=INFO output=print
You’ll need to supply your AWS access/secret keys to proper variables as well as configure the AWS region. I’m in
us-west-1, but you may wish to store your information in a different region.
Last step before uploading: create our Glacier vault. Choose a really awesome globally-unique vault name and create it like so:
$ glacier-cmd mkvault "my-super-ridiculously-awesome-longterm-backup-solution"
Provided that it completes properly, we’re in business. Now, let’s get onto the uploading. Since I’ve split my file into 200MB chunks, I’d like to upload them one at a time, moving completed uploads into another folder. For this, I’ve devised a pretty quick
while loop to find my backup files, and run a series of commands on them.
$ find . -maxdepth 1 -type f -name "music.backup.*.tar.gz.*" | sort | while read file ; do echo "Uploading $(basename "$file") to Amazon Glacier." glacier-cmd upload --description "$(basename "$file")" \ my-super-ridiculously-awesome-longterm-backup-solution "$file" && \ mv "$file" "Completed Backups" done
I use a
find command to locate all files matching the
"music.backup.*.tar.gz" pattern in the current directory only. I pass each found file to the
while loop and it makes each file available as the
$file variable in the loop. Before each file is uploaded, I echo a status message to let me know which file I’m currently uploading. I then do the actual upload using the
glacier-cmd we installed before, setting the description of each uploaded file to the actual filename. If the upload completes without an issue, I then move the uploaded file into the “Completed Backups” folder so that I can know the file was uploaded successfully.
After your upload completes in a few days, you can use other commands of the
glacier-cmd utility to query the status of your vault. Remember that inventories take about 4 hours or so, so don’t expect to get a directory listing back so quickly. Welcome to long-term backup with Glacier!