Just Eat have somewhere around 800 components and about 95 engineering teams at the time of writing.
Keeping the beast running, and it's customers fed is pretty important since we make most of our money from orders. If you can't place an order, we can't make money, so up-time and reliability are pretty crucial to the business. Over the years we've added small rituals and strategies that we found really helpful in keeping things running.
Lets start with how we deploy things into production in the first place.
Deployments, canaries, and release warranty
Our process to deploy changes to production is usually pretty dull, which is just the way we like it. When you merge a change into master it ticks through a very straight forward pipeline. Our CI builds the latest master code and then trigger our deployment pipeline which deploys the code onto a couple of QA environments representing different types of settings. When it's on a QA environment, any automated tests will run and assuming they pass it'll then do the same thing with a staging environment. The fun comes when it gets to production.
Once we've gone through QA and Staging, we then have a button to press to start the production deployment. This allows us to spend any extra time verifying or checking changes if we need to. When you hit the button to approve the production deployment it then goes into a canary state, it'll deploy into production and will then point 20% of the production traffic to the canary instances. The amount is actually configurable per-component but most teams use 20% as a starting point.
We then leave it running in canary for a while, how long will depend on how critical the component is. Some critical components will sit in canary over night or over the weekend. This lets us collect some data and monitor metrics to make sure things are working as we expect and that we definetely haven't broken anything. Finally, when we've hopped through any other canary steps , again - configured on a component basis, we promote the canary to full production. That redirects 100% of the traffic to our new instances and tears down the old deployments.
We also practice a "release warranty", this is where we actively monitor the deployment for a short time after it's been released to catch anything we may have missed before. The warranty period is usually about 20 minutes, but again it's on a per-component basis depending on how important that feature is to the platform.
All in all, seeing a release out the door can be done in just over an hour if your builds and tests are fast. Some legacy components take longer, but that's ok, we'd rather they were don't carefully with lots of time to spot issues.
That process is all well and good, but sometimes things do go wrong. When they do go wrong we've got a pretty good process for getting through incidents.
PagerDuty's going mental, your half asleep, and your spouse is pissed because "that bloody thing" woke them up at 2am as well. What happens next? Once you've taken the abuse and escaped the bedroom to let your partner go back to bed, the incident process kicks off. It's pretty well drilled into us before we go on-call for the first time, and it's not complicated either.
If the alerts are from automated checks then the on-call team member is likely the first responder and it's up to them to triage and figure out if it's "just a blip" or something that needs real attention. Usually we're trying to figure out if something is actually broken, or if it's just an overly sensitive check. If something is broken and it's going to have an impact they'll raise the alarm with the SOC (Service Operations Centre) team.
The alternative to this is that it's something reported by the customer care teams, customers, or the SOC have noticed it themselves. In this situation they confirm it's an issue and immedietely page the relevant teams and raise a Production Incident (PI) ticket in Jira.
SOC will raise the PI if there is a legitimate issue, that will automatically create a video call link and a room in slack. Then we get to work, the next goals after triage are:
- Mitigate the issue, find a way to resolve the issue so it's not impacting customers or restaurants directly.
- Diagnose the issue, often we do this as part of mitigating the problem but sometimes we don't get that luxuary. When we can mitigate early we will and we'll then diagnose after once the fire is quenched.
- Learn and improve, once we understand what the issue was we then move to make our long term fixes and share the knowledge where relevant.
Getting a PI assigned to you also earns you a spot at the daily leadership meeting. These happen every weekday morning and review all the incidents from the past 24 hours (and the weekend, if it's Monday). Our engineering leadership get a chance to ask questions about the incident, these are never to assign blame or shout at people, it's purely to understand the trends in issues going on across the organisation. It's also a place where the leadership get to understand any ongoing risks and give the relevant people a nudge when there are long standing issues. I'd enjoy these meetings a lot more if they weren't the first thing in my calendar every time I have to attend one. Aside from that they're great learning opportunities and they work really well to keep the leadership plugged into the nitty gritty aspects of our tech.
The OpEx is a simple monthly all-hands meeting for all engineers. The host walks us through the monthly operational performance against our reliability goals, some teams will present notable incidents from the past month. These are usually a 1-2 minute explanation of what happened, why, and what we've learned from it. These slots are a fantastic way for the organisation as a whole to learn about notable foot-guns. We wrap up with a couple of slots of major operational announcements, these cover a wide range of things from up coming audits, to notes on upcoming change freezes for major holidays.
I think this meeting was originally setup by our (now ex-) CIO, Dave Williams, who left the organisation a few months ago. They're now hosted by a Director of Engineering who pulls together the content and gets speakers from incidents over the last month.
The best thing about the monthly OpEx is that it's a monthly meeting / call, and you can sit there and soak everything in without any pressure to contribute if you're not on the lineup that week. The entire OpEx enforces our no-blame approach to incidents and we've learned an awful lot from them.
We have a number of other processes and rituals, but those are the corner stones of reliability at Just Eat. At the end of it all, we want to work on a stable and reliable platform that doesn't wake us up every night. Over the last 3-4 years we've made great stides forward in our operations and reliability, to the point where we're now removing many of the processes we put in place to help get us to this stage. They've served their purpose now, and we're able to move faster with higher reliability without some of them. I don't see these ones going away any time soon though, and that's probably for the best.