Can your business keep winning while it's down?
Let me tell you about a time when Duolingo won, even though it was down.
Learning new things is fun, and I strive to do so every day, and although I have used Duolingo for more than a year, I've just recently completed a 365-day streak.
I believe this is because of mainly two things:
Let's skip the first point, as we humans all have different goals/habits.
Improving on what we have:
We all know that having an application down is very difficult to handle. It is almost crippling. In such situations, users would usually be redirected to a (hopefully branded) static site informing them that the application is down for maintenance and maybe even saying when it's expected to be back online. Of course, a better situation would be having parts of the application down instead of entirely out of service. However, the "partially functional" subject is out of the scope of this post.
So what has Duolingo architecture contributed to letting Duolingo win even when they are down? Or maybe an even more important question: how do they keep track of my daily streak, even when the application is down for maintenance? Because if you don't know how important the daily streak is, have a look at this bride on her wedding day.
In architecting we use a lot of good terms that can be translated into any other business/solution inside and outside of the cloud, some of which we can touch upon for this subject are:
- Automatically recover from failure
- Design for higher availability
- Manage change through automation
All these bullet points are mentioned under the Reliability pillar in the six pillars of the AWS well-architected framework.
In short, a reliable solution can perform its intended function correctly and consistently when it's expected to.
When I dug into it, I was delighted to learn that implementing a streak-protecting feature for when the app is down is not that complicated at all!
So how does the improved Duolingo architecture help track my streak?
Originally Duolingo planned a microservice, but that introduced many additional dependencies for them. As I read their engineering blog (listed below), they had decided on AWS S3. It is simple to use, has high availability, and has a cool feature called: Server access logs.
Server access logging provides detailed records for the requests made to a bucket.
They would then have the users' clients query for a file or multiple files in S3. In this way, they could group the requesting clients by which file they requested and have a rolling update for one group after another. These groups of users' clients would then be listed to have their streak repaired when the Duolingo app is up and running again.
That is so simple yet so creative! How cool is that?
Here are some of the articles that the engineering team Duolingo had made, which helped in the making of this very own article.
... and the second one talks about protecting the streak.
How would you've solved it? Also, Duolingo, looking back at this with all the insight your team has now, would you recommend some other solutions if you were to do this again?