Survey Archives

Tuesday, November 19th, 02013 at 14:41 UTC

One of the big project we work on is an online survey system. We have a system which we can easily deploy pretty much anywhere that Python and Django are supported. At the moment we are making full use of Heroku for our hosting, but in the future we might move to other platforms because of price, speed or geography.

The nice thing about the survey software is that it only exists and work for a short window of time during the collection phase. Once that is done, we create a static report for the customers along with separate online access for the derived data and graphs. This should be the end of our involvement and the use of the original raw survey data. Storage is cheap, so we are opting to keep the anonymous data archived incase we need to revisit it for reports in the future, updates, improvements to the system that need for the source data to be reanalyzed.

Our original plan was pretty solid. We’d keep a copy of all the pieces in the chain. We save the survey data into a database, so let’s keep a copy of the database dump. That way we can always re-import it and breath life back into the survey. For small data explorations, rebuilding the database seems like over kills, so we’re also going to save a copy of the CSV data dump. This is most commonly what we use for reports and other derived data sets. We want to also keep a copy of the JSON report syntax just incase we find a tiny typo. We can fix it without having to re-calculate any information. This means we’re also keeping the original survey syntax file too. This is the file that describes the survey, software models and the underlying database structure. It is smart to keep this so we can quickly start a new survey based on these questions or make minor edits for a survey in the future.

These are all text files, so after compressing and encrypting they don’t take-up much space, and space is cheap! Having all the pieces of the puzzle make it very easy to replicate the state of the survey some time later.

Then we realized that we were actually missing the biggest piece – the actual code! The survey software itself has settled down and we’re not making any major changes, but since it is Django there are new versions appearing as well as using South to handle database migrations. We quickly realized that as we continue development, no matter how small, we can no longer load our old data into the current software. The “shape” of the database isn’t the same any more. We always want to make use of our most current, bug-free and optimized version of the survey and report generating software, but the delta between the data we have and the actual code always grows larger.

Imaging having backed-up and archived all your word-processing files from the year 02000. They are safely stored away and now you want to re-open those archives and see what treasures await. Hm… you don’t seem to have a version of that software anymore more. No matter you have the newest – which doesn’t seem to import or open your ancient version of the files. But you have the original 3.5 inch floppy disk! Now to find the A:/ Drive and a system which can run OS/2.

Having just the data isn’t enough!

We realized we need a way to bootstrap our way from the old system into the newest. All our code is in a version control system, so we try to tag each release of the version of the survey software that is deployed live. This allows us to checkout that tag, but we worry that if we forget we’ll have problems. Our current best solution is to archive the entire virtual environment that is running the survey software, no matter what the version.

If this is part of our archiving workflow, we’ll have the database, the code that was running, the input and output files as well as files we can work with such as CSV and SPSS. Then if for some reason, we need to go back to that survey, we have away to bootstrap ourselves to get to our current version. Using  version control, we can start-up an instance of the old version we have in the archive, load the data into the database while it is the same “shape”, then migrate our way up to present day. That is a much easier solution than trying to retrofit current code to import and load old data.

Luckily we haven’t needed to do this yet, but this is our current archival and re-activating strategy. Save a snapshot of the whole state of the system as it was when it turned off. It’s cheap, easy and allows us a direct path to get brought-up to the current production systems.

The biggest hurdle will be if we every break away and switch systems. Migrating completely away from one platform like Django to something else like Rails would mean that our pathway to migrate to the current production code would be more difficult, not impossible had we written some converting code between the two systems. Hopefully we’ve done our best to backup, archived and re-activate our system from a cold-start. Giving us the confidence to develop fast and worry less about compatibility issues between instances of surveys as long as we have an established migration path.

Stop and think about your important data. As you back it up, are your projects so agile and lean that you are potentially sacrificing the ability to even get at that old data? Can you save a snapshot of the entire system, code, virtual environment and all the necessary files to get it going again?