Creating Long-Term Backups with Amazon Glacier on Linux

Amazon Glacier LogoIf you haven’t heard about Amazon Glacier already, it’s definitely something to be excited about. Amazon Glacier is a service that makes it extremely affordable to store gigabytes upon gigabytes of data for the long term in the cloud. Your data is stored immediately, but retrieval requests take at least 4 hours to make your data available again for your downloading. Let’s back up a ton of files on Linux to Glacier.

For my purposes, I wanted to back up my entire music library, which is about 35 gigabytes of data. Due to the fact that Amazon doesn’t store folder structures, we’ll need to create archives of what we’d like to store. We’ll do that and split our archive into 200MB blocks out of convenience. I’m not currently aware of any per-file maximum size on Glacier, but it’s much easier to retry a 200MB upload rather than a 35GB one. Plus, this will make it much easier to script our uploads in a way that makes it easy to validate that our uploads completed properly. To zip up our music library, we’ll use tar and split, standard Linux utilities. I’ll be applying GZip compression to my tar archive, though this is not strictly necessary or even beneficial in our case, since most music files are already compressed using much more advanced algorithms than standard compression methods.

$ BACKUP_TIME="$(date +%Y%m%d%H%M%S)"
$ tar cvzf - ~/Music | split --bytes=200MB - "music.backup.$BACKUP_TIME.tar.gz."

What this will do is create a tar.gz archive split every 200MB into files looking like "music.backup.20130120120503.tar.gz.aa". The date is generated before we do any real work, and the aa suffix will change on the archive count, so you’ll have archives ending in aa, ab, ac, etc. If you’re as paranoid as you should be, you’d now encrypt these archive using a GPG key, but that’s outside of the scope of this tutorial for now.

Now that we have our data, we’ll use a Python script library called glacier-cmd to upload our files and perform other operations on our Glacier vaults. Let’s get that library installed:

$ sudo apt-get install python-setuptools git
$ git clone git://github.com/uskudnik/amazon-glacier-cmd-interface.git
$ cd amazon-glacier-cmd-interface
$ sudo python setup.py install

Everything should now be installed for us to be able to start our backup process. All we need to do before we start uploading is to setup our configuration file and create our vault. First, create a file at ~/.glacier-cmd filled with the following contents:

[aws]
access_key=YOUR_AWS_ACCESS_KEY
access_secret=YOUR_AWS_ACCESS_SECRET

[glacier]
region=YOUR_AWS_REGION
logfile=~/.glacier-cmd.log
loglevel=INFO
output=print

You’ll need to supply your AWS access/secret keys to proper variables as well as configure the AWS region. I’m in us-west-1, but you may wish to store your information in a different region.

Last step before uploading: create our Glacier vault. Choose a really awesome globally-unique vault name and create it like so:

$ glacier-cmd mkvault "my-super-ridiculously-awesome-longterm-backup-solution"

Provided that it completes properly, we’re in business. Now, let’s get onto the uploading. Since I’ve split my file into 200MB chunks, I’d like to upload them one at a time, moving completed uploads into another folder. For this, I’ve devised a pretty quick find/while loop to find my backup files, and run a series of commands on them.

$ find . -maxdepth 1 -type f -name "music.backup.*.tar.gz.*" | sort | while read file ; do
    echo "Uploading $(basename "$file") to Amazon Glacier."
    glacier-cmd upload --description "$(basename "$file")" \
        my-super-ridiculously-awesome-longterm-backup-solution "$file" && \
        mv "$file" "Completed Backups" 
done

I use a find command to locate all files matching the "music.backup.*.tar.gz" pattern in the current directory only. I pass each found file to the while loop and it makes each file available as the $file variable in the loop. Before each file is uploaded, I echo a status message to let me know which file I’m currently uploading. I then do the actual upload using the glacier-cmd we installed before, setting the description of each uploaded file to the actual filename. If the upload completes without an issue, I then move the uploaded file into the “Completed Backups” folder so that I can know the file was uploaded successfully.

After your upload completes in a few days, you can use other commands of the glacier-cmd utility to query the status of your vault. Remember that inventories take about 4 hours or so, so don’t expect to get a directory listing back so quickly. Welcome to long-term backup with Glacier!

18 thoughts on “Creating Long-Term Backups with Amazon Glacier on Linux

  1. Nice blog post.

    I wanted to have a practical example that illustrates the usage of Glacier which I found here :-).

    If I were you however I would create a longer script that makes groups of mp3′s that are almost 200MB in total and then make a tarbal out of it. Because if you get corruption in one of the archives your whole mp3 collection is gone. Just my 2 cents.

  2. That’s for sure. Another thing to add would definitely be file-encryption, too. I don’t want someone looking at my data ;)

    I might just have to do another tutorial on this to fill these missing gaps. Getting tarballs of only 200MB would be kind of difficult to do, I wouldn’t know how to do that without getting too complicated in the script. I’d probably have to write a Python script which would find the files and group them in 200MB chunks.

    Is there a way to group them in 200MB chunks without making it one big, long, split tarball?

  3. Thanks, very useful.

    To keep it updated, the new version changes the name of “access_secret” to “secret_key”

  4. Hello,

    We are proud to present Glaciate Archive.
    It’s the only browser based Multi-Platform Amazon Glacier client (service) that has built in search,
    metadata management, retrieval calculator, pausable downloads, automatically updating inventory list, upload/download notifications and least but not least user management with the possibility of setting credit limits per user.

    You can check it out at http://www.glaciatearchive.com and request your demo account

    _
    Tonis Leissoo
    Glaciate Archive

  5. I’m using version a2d5763 of glacier-cmd from Tue Mar 5 2013 and “access_key” and “access_secret” do not work anymore. You get “glacier-cmd: error: argument –aws-secret-key is required”.
    Using “aws-secret-key” and “aws-access-key” in ~/.glacier-cmd works.

  6. For a performance increase, try pigz instead of gzip.

    tar cvf – /mnt/s10_1 | pigz -4 | split -a 4 –bytes=200M – “s10_1.$BACKUP_TIME.tar.gz.”

    I also used ‘split -a 4′ because the amount of data I was splitting.

    For installing on RH based systems, try:

    # yum install python-setuptools git
    # pip-2.6 install boto –upgrade
    # git clone git://github.com/uskudnik/amazon-glacier-cmd-interface.git
    # cd amazon-glacier-cmd-interface
    # python setup.py install

    Also, as per Marc above, .glacier-cmd needs to look like this now:

    [aws]
    access_key=REMOVED
    secret_key=REMOVED

    [glacier]
    region=us-east-1 (or whatever)
    logfile=~/.glacier-cmd.log
    loglevel=INFO
    output=print

  7. Hi,

    I am using this tool but i am not able to upload data into glaceir.

    Also not getting any error.

    Steps:-

    1. I have made tar file
    2. Vault created
    3. then run the script as you guide.

    find . -maxdepth 1 -type f -name “files.tar*” | sort | while read file ; do
    echo “Uploading $(basename “$file”) to Amazon Glacier.” glacier-cmd upload –description “$(basename “$file”)” vaultname “$file” && mv “$file” “Completed Backups”
    done

    Please help

  8. Thanks for this but alas nothing happened for me – all I got was an ‘>’ and a flashing cursor for ages. I don’t think I understand what to do exactly with your series of commands – certainly trying to cut and paste them didn’t work, nor did typing it all out on one line. O well.
    What I would like to know that I can’t seem to do a successful search on, is how do I remove the ‘$ BACKUP_TIME=”$(date +%Y%m%d%H%M%S)”‘ reference from my system? It most likely does no harm, but I just want to remove it if possible.

    • There’s not much to do to remove the BACKUP_TIME variable, it’s only persisted for your current shell. This article was written mostly from a tutorial point-of-view targeted at people already comfortable with Linux scripting and automation.

  9. I wrote some python code to keep file integrity instead of splitting the tar with split. Files will likely be smaller than 200 since I only record the input size, not the output. I look forward to any comments:
    [code]#!/usr/bin/env python
    import logging
    import argparse
    import tarfile
    from pprint import pprint
    import os
    import os.path as pa

    def parseInputs():
    parser = argparse.ArgumentParser()
    #inputFilename,numBytes=setUpParsing(parser)
    parser.add_argument("-d","--directory",help="Directory containing files to compress",default="music",required=True)
    parser.add_argument("-s","--size",help="Number of bytes after text to parse in Megabytes (MB)", default=200, type=int)
    parser.add_argument("-o","--output",help="The output filename template", default="compressed")
    parser.add_argument("-v","--verbose",help="Turn verbose on. (1-5, 5 is least verbose)", default=1, type=int)
    args=parser.parse_args()
    return ( args.verbose, args.size, args.directory, args.output)

    class tarlimit:
    """Creates multiple tar files from a given input directory
    limiting the size of each tar file to "size" MBs. Additionally,
    it breaks files along file boundaries, which would be safer
    should a corruption occur than using tar xyz | split method.
    results will be autoincremented starting at 0000.tz
    NOTE: this will always deliver a size less than "size" MB
    since it solely calculates the input file size and assumes compression
    will deliver a smaller size.
    possible decompression method:
    for ii in *.bz2; do tar xvf $ii; done
    """

    loggingLevels={1: logging.DEBUG,
    2: logging.INFO,
    3: logging.WARNING,
    4: logging.ERROR,
    5: logging.CRITICAL}

    def __init__(self, verboseLevel, size, directory, output):
    #debug(), info(), warning(), error(), critical()
    self.logger=logging.getLogger('tarlimit')
    if (verboseLevel in range(1,6)):
    logging.basicConfig(format='%(levelname)s:%(message)s',level=self.loggingLevels[verboseLevel])
    else:
    logging.basicConfig(format='%(levelname)s:%(message)s',level=logging.DEBUG)
    self.logger.critical("Incorrect logging level specified!")

    self.directory = directory
    self.size = size
    self.outputfiletemplate = output
    self.logger.debug("Using root directory {0}".format(self.directory))

    def run(self):
    """Get directory structure and start adding sizes up."""
    directorylist = os.walk(self.directory, onerror="self.exitgracefully")
    sumsize = 0
    numberofiterations = 0
    lowerbound = 0
    filenametocompress = self._createfilename(numberofiterations)
    self.logger.info("Using compressed file: {0}".format(filenametocompress))
    tar = tarfile.open(filenametocompress, "w:bz2") # open first file
    for root, dirs, files in directorylist:
    totalpathnames = (pa.join(root,name) for name in files) # note using generator, not list
    for filetoadd in totalpathnames:
    sumsize += pa.getsize(filetoadd)
    tar.add(filetoadd)
    self.logger.debug("Adding file {0}".format(filetoadd))
    if round(sumsize/1.0e6) > self.size:
    tar.close()
    self.logger.debug("Input size is {0} MB".format(round(sumsize/1.0e6)))
    numberofiterations += 1
    sumsize = 0
    filenametocompress = self._createfilename(numberofiterations)
    tar = tarfile.open(filenametocompress, "w:bz2") # open next file
    self.logger.info("Using compressed file: {0}".format(filenametocompress))
    #write the file out
    tar.close()
    logging.shutdown()
    return numberofiterations

    def _createfilename(self,iter):
    return "{0}{1:04d}.tar.bz2".format(self.outputfiletemplate,iter)

    def exitgracefully(self, errorname):
    self.logger.debug("Exiting with error {0}".format(errorname))
    self.logger.debug("Error filename {0}".format(errorname.filename))
    return 0

    if __name__ == "__main__":
    #psid=parseSatID(filename=inputFilename, numberOfBytesToUse=numBytes)
    verboseLevel, numberOfBytes, filename, outputtmplt = parseInputs()
    psid=tarlimit(verboseLevel, numberOfBytes, filename, outputtmplt )
    dat = psid.run()
    print("output {0} files.".format(dat))

    [code]

  10. Pingback: Amazon AWS Glacier backup og filarkiv løsning er markedets billigste.

  11. Hi I am not able to proceed with python setup.py install on centos distro I had tried on more than three server and on all the server this setup.py ends up with below error any help would be appreciated.

    PYTHON VERSION
    Python 2.4.3 (#1, Jan 9 2013, 06:47:03)
    [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2

    {‘__file__’:setup_script, ‘__name__’:’__main__’}
    File “setup.py”, line 45
    with open(“README.rst”) as f:
    ^
    SyntaxError: invalid syntax

    • forget it i think i used /usr/bin/python2.6 and got rid of the error hope its installed now let u know folks in case of any help

  12. Through the years, there have been numerous devices and options developed for males seeking to enhance the size of their manhood. Nonetheless, not all devices and options are created equally.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>