Over the past few months the Permanent engineering team has been hard at work making fundamental improvements to the way people upload data to their archives. A lot of these changes work behind the scenes, but they were a massive undertaking that is core to the Permanent mission so we wanted to share a little bit about what got done, why, and what we thought about when we designed the new architecture.
The Before Times
Last year the Permanent code base was having troubles with uploads. This was a serious problem because uploading content for permanent storage is kind of an important part of what people want to do when they sign up for Permanent!
The symptoms were clear enough: user uploads would sometimes fail to process correctly — occasionally even silently. This could even mean that certain files never made it to permanent storage! The situation needed quick attention, so the team got to work with a two phase plan:
Phase 1: Patch the current system to make it more reliable right away — and (also expose errors more effectively so the the engineering team can respond more quickly if something seems wrong).
Phase 2: Design a new system for the long term… one in which each file makes it into storage right away before anything else can possibly go wrong.
Understanding the Old
Since Permanent is all about legacy, let’s start with a quick description of the legacy system. Permanent’s code base is divided into a few core components:
- The front end web application itself, written in Angular, packaged up, and hosted on S3.
- An internal API, written in PHP and running on AWS, which is directly called by the front end. This does things like accept data and send data that you see on your screen.
- A series of processes, written in PHP that are triggered by the API via a message queue. We call these “Task Runners”. These processes are where the “permanence” of permanent comes from — routing data around behind the scenes to make sure it is safe and secure.
Up until this week, there was also an essential little service called “Uploader”.
Uploader was a separate, simple, service that had two jobs: (1) accepting files directly and (2) telling the task runners to process those files.
This meant the original upload process went something like this:
- User agent tells the API it wants to upload a file. This generates a new record in the permanent database.
- User agent then sends the file to the Upload Service, which stores it temporarily in an ephemeral drive (in other words, the file is still in transit).
- The upload service kicks off some processes: thumbnail generation, file format conversion, metadata extraction, and other database updates.
- Once all of those processes finish successfully — and ONLY once those processes finish successfully — the file is then put in its permanent storage location.
If you are an engineer you might see the trouble here… What happens if something goes wrong in step 2 or 3? Suddenly the file never makes it to it’s safe, sound, cryogenically frozen storage! A recipe for disaster!
Designing the New
The key to making uploads live up to the promise of Permanent was to design a new process where all uploaded files were put into their final, safe location as soon as possible.
The way we see it, there’s nothing sooner than “as the first step.”
The new flow relies on Amazon S3 — a popular cloud storage service with built in redundancy and file protection — as the first destination for any upload. Specifically, we actually have the user agent upload files directly to S3 so they never touch a permanent server until AFTER the files have arrived in a safe location.
To help with this, we built a little micro service whose job is to authenticate with Amazon and reserve the storage space. If you are writing software that involves user uploads, you may want to consider spinning up a copy yourself.
The flow looks like this now:
- User agent tells the API it wants to upload a file. This generates a signed request to S3 (a secret, single-use key that the user agent can pass along to Amazon)
- The user agent uploads to S3 *directly*. After the file safely arrives in S3 — and ONLY after that — it goes on to tell the Permanent API that the upload is complete. This is the moment when database entries are generated for the new record. The file is already way safer than before, and there’s no risk that the system will be confused about the state of its existence!
- The Permanent API immediately puts copies of the file in various permanent locations, and tells the user that the file upload is complete.
- Only then (after the file is safe and sound, and the user has been told of this great victory) does the Permanent API go on to trigger the various processing task runners.
The result is that we now have an upload process where the promise of permanence is met first, and does not rely on the successful completion of complex operations (which will ALWAYS fail on occasion, no matter how well tested a system is).
And the world even has a nifty little open source uploader micro service to boot!
I didn’t make note of when I first saw the changes, but here at the end-user experience, I certainly observed the visual and time-related changes to the system.
Thanks!