There’s a saying in the tech industry that the world’s digital infrastructure is held together by duct tape. From my limited experience working as a software engineer over the past few years, this seems to be true more often than not.
The global Microsoft computer outage in mid-July — caused by a bug in a widely-used cybersecurity program — is a perfect example of how one broken component can topple the precarious Jenga tower of code that keeps millions of vital systems running. It’s easy to forget how reliant we are on these machines until thousands of flights are grounded, emergency phone lines are knocked out of service for hours at a time, or doctors are forced to record patient data on pen and paper.
So, why did this happen?
In a nutshell, every computer that crashed was running a program built by the company CrowdStrike, which sells cybersecurity services to other businesses. Their program detects hacking attempts by monitoring the data that a computer transmits and receives over the Internet. Here’s the problem: in order to watch all of this data, CrowdStrike’s program had to have access to some code that exists on every computer called the “kernel”, which is necessary for any computer to function properly.
If you think of a computer like a car, the applications that you use day-to-day — like Zoom or Google Chrome — are the steering wheel and pedals. They’re the controls. The hardware, like the CPU or hard drive, are the engine and wheels. They’re the foundation. The kernel is everything in between: the gears, axles, brakes, hydraulics, and so on. Without the kernel, you wouldn’t be able to operate your computer, much like how you wouldn’t be able to drive your car if it didn’t have those intermediary bits of machinery.
Because the kernel manages all of a computers’ Internet connections, it’s the perfect place for a company like CrowdStrike to embed their security software. If their program detects a virus being downloaded, for example, it can alert the user or even shut down the affected connection by using the kernel. There weren’t any issues with this setup until CrowdStrike released a faulty update on the night of July 19th. Once installed, this update would crash their own security software, which would in turn crash the kernel — crippling the computer from the inside out. This error wasn’t caught until it had been installed on millions of devices all over the world, which led to this:
Over $5 billion were lost because of the downtime. Ironically, it was a cybersecurity company that accidentally created one of the most destructive viruses of all time. These types of incidents — from outages to security breaches — happen surprisingly often, and they can affect almost anyone who uses the Internet.
It’s simultaneously fascinating and worrying to consider how much of your personal information you’ve disclosed to a variety of miscellaneous companies online. All of your passwords, credit card numbers, banking details, and other types of sensitive information are stored on servers somewhere in the world, so it’s worth asking exactly how these digital systems work, how they’re all linked together, and why they sometimes fail.
Internet troubles
When I was about six years old, my dad showed me something on the computer called the command line. The command line was this little black window that housed a small blinking cursor, a bit of text, and nothing else. It looked like the displays from one of those old boxy computers made in the 1970s, but in plain black and white instead of lime green. I tried clicking around it with my mouse, but to my dismay, nothing happened. My dad explained. Before the days of drag-and-drop icons and customizable wallpapers, you had to manually type in commands just to navigate your computer — hence the command line. This blew my mind.
People in the past must have been geniuses! I thought.
Thankfully, it turned out not to be too complicated. Within a few hours, my dad had taught me a few basic commands that I could use to fiddle around with the computer. If I typed in “mkdir”, I could summon a folder from nothing. If I wanted to delete it, I could simply type in “rm” and it would vanish just as quickly as I had created it. It was magic. Those commands were great fun, but the “ping” command quickly became my favorite out of all those I learned. Though it bears no relation to ping-pong, it’s interesting because it can help us understand an actual technique that hackers use to take down websites.
The way it works is simple. First, you type in the word “ping”, then follow it by any website you want, like “google.com”. The computer will start sending little packets of data to Google’s servers to see if it gets a response, and if it does, the command line will let you know when it hears back. You can use this command to check if there’s a problem with your WiFi, but hackers can also use a similar technique to bring down entire servers.
They do this by scaling up the amount of data that their computer sends — by a lot. Responding to a couple packets of data per second isn’t a problem for most websites, but what about a couple hundred? A couple thousand? A couple million? Ultimately, the more data they send, the more of a problem it is for the website. If a website isn’t equipped to handle sudden floods of traffic, it can become overwhelmed and crash.
This is called a distributed denial-of-service attack, or a DDoS for short. DDoS attacks are what brought down massive sites like Netflix and Amazon in 2016, and they’re also what crashed most of Estonia’s web infrastructure back in 2007. At best, a successful attack might mean that you’re not able to watch movies for a couple hours. At worst, it could mean not being able to access your bank account online for days at a time.
You might be wondering why DDoS attacks don’t happen more frequently. In fact, they’re attempted much more often than you might think. Cloudflare, a company that protects websites from cyberattacks, reported that they mitigated almost nine million DDoS attacks in 2023. It’s worth asking how Cloudflare does this, because it can teach us a lot about how the internet works behind the scenes.
To explain, I’ll turn to an analogy that I hope many of you are familiar with: alcohol. Visiting a website is kind of like ordering a drink at a bar. When you type in a URL like “facebook.com”, you’re being served a cocktail of ones and zeroes that happen to present themselves in the form of a usable website. Your internet service provider, like Comcast or AT&T, is the bartender. Just like how a bartender knows to reach for the ginger beer when you order a Moscow mule, your internet service provider knows where to fetch Facebook’s data from by using something called the Border Gateway Protocol, which is like a massive address book of servers distributed all over the world.
This arrangement works if the bartender only has to fill a couple orders at a time, but what happens if a hundred drunk college kids storm in and demand ten beers apiece? Either the bartender will start to panic or the bouncer will have to kick them out. In this case, Cloudflare is the bouncer. They stand in between you and your favorite websites to ensure that malicious or excessive traffic is stopped before it can do real damage. Although Cloudflare isn’t the only company that provides this service, they’re one of the most prominent players in the network security industry — estimated to monitor almost 16% of global web traffic alone.
DDoS attacks are just the tip of the iceberg when it comes to hacking, and although cyberattacks are always a concern for software engineers, it’s common for systems to fail simply because of poor design. Take the botched launch of the U.S. HealthCare.gov website in 2013, for example. The website crashed almost immediately when nearly three million Americans tried to sign up for health insurance on the first day that it was available. In its entire first week, it’s estimated that only one percent of people interested in purchasing health coverage were able to enroll on HealthCare.gov.
Perhaps the biggest issue was that its development team grossly underestimated the number of people that would attempt to use the website. A day before it was launched, a test done by government contractors showed that HealthCare.gov slowed down when faced with only several thousand concurrent users — nowhere close to the hundreds of thousands of people that would simultaneously access the site in the coming days. One of the reasons for this was because of how inefficient it was at serving users information from its database. Per a 2014 Time Magazine report:
HealthCare.gov had been constructed so that every time a user had to get information from the website's vast database, the website had to make what's called a query into that database. Well-constructed, high-volume sites, especially e-commerce sites, will instead store or assemble the most frequently accessed information in a layer above the entire database, called a cache.
Back to our alcohol analogy, caching is akin to pre-batching cocktails before service. When customers ask for one of the pre-made drinks, the bartender can just give it to them as opposed to mixing one from scratch every time. This is the same with websites; instead of needing to synthesize random bits and pieces of data strewn across a database, the cache provides a neat repository of pre-arranged data for the website to quickly serve to a user.
The trade-off here is that the user won’t be served fresh data from the cache because it might only be updated, say, every hour. Having a large delay between updates might not be great for a website that needs to show real-time data, like a sports website covering every play of the Super Bowl, but it can work well for sites that consistently show the same information, like an FAQ page.
The problem is that adding a feature like caching takes work, and when there are tight deadlines, us programmers sometimes try to minimize the work that has to be done by taking shortcuts. Whether that means not documenting code, not testing code thoroughly, writing code that doesn’t scale well, or in the case of HealthCare.gov, all three of these — we end up with something called “technical debt”.
It’s a figurative debt that accumulates over time from these poor decisions, and it implies that someone down the line will have to pay to absolve the sins that the software development team committed during the early stages of development. HealthCare.gov embodied technical debt in the most literal sense. Although its initial budget was just shy of $100 million, it ended up costing American taxpayers an astounding $1.7 billion. While the total cost cannot be entirely attributed to its initial technological failure, it’s likely that implementing a more scalable system architecture from the beginning could have saved the U.S. government significant time and money.
Understanding technical debt
So, why not just write good code from the start? You ask. It can’t be that hard.
This is a completely fair question, and I think that the best way to answer it is to show you firsthand the thought process of a software engineer building an app from scratch. Technical debt isn’t always a result of laziness, but often comes from genuine oversights or from the use of outdated software in a rapidly evolving landscape of programming languages and tools.
I want you to imagine that the year is 2010 and that you’re a programmer working on the very first version of Instagram. The other engineers on the team have already laid the groundwork for you; they’ve written the code that allows users to upload photos, follow their friends, and scroll endlessly through their feed. Your only job for the week is to make sure that users can comment on each others’ posts.
Easy enough, you think.
A tinge of stress hits you as you realize that launch day is only a couple weeks away. Not only do you need to add this feature quickly, you also need to ensure that thousands of people will be able to post comments when the app is live. Uh oh. Even though it’s almost midnight and you’re still at the office, you decide to spend a minute at the whiteboard on your way out to plan everything you’re going to code the next day.
The user interface might be the first thought that comes to your mind. At minimum, you’ll need to add a little box under each post that allows a user to type in some text. When they hit the submit button, their comment should be saved to a database. Now, when a user views a post, you can retrieve all of that post’s comments from the database to display underneath it. This is where it gets tricky. There are a couple of ways that you can store comments in the database, and depending on what method you choose, you’ll either be in for a relatively pleasant development process later on or a world of pain.
For simplicity’s sake, we can think of the database as a big Excel spreadsheet. Each column will store some unique information about each post, like the date it was posted or the user who uploaded it. The database will need to store similar information about each comment, so we’ll split our spreadsheet into two pages: posts and comments.
The first way that you could store comments would be to add a column to the posts page called “comments”. That column would store a list of unique ID numbers, where each ID corresponds to a different comment. These IDs function a bit like phone numbers in that every phone number in your contacts corresponds to a different person. When you text one of your friends to see if they’re in town, you’re using their unique number to asynchronously retrieve information about them, much like how ID numbers allow the database to look up information that’s stored across different pages.
In computer science, this method of storing comments would be classified as a one-to-many relationship. On one post, we store many comments. Let’s compare that to the inverse method: using a many-to-one relationship. On each comment, we’ll store the ID of its corresponding post. That way, when we want to retrieve all of a post’s comments, we can just filter all comments in the database by those that have a certain post ID attached to it.
But wait, you say. Isn’t the second method slower?
After all, we would need to filter through every single comment that has ever been posted — every time a user views a post. This is a valid concern, but consider the following scenario if we use a one-to-many relationship. Suppose a hypothetical user by the name of Hailey Welch posts an unexpectedly hilarious video on her account. The post ends up going viral, and by the next morning, over a million people have commented on it. In our Excel database, roughly one million comment IDs will be stored in a single bloated row. Not only will this be extraordinarily cumbersome to work with as a software engineer, it will drastically slow down the rate at which we can transfer information from the database to our app.
By storing comment IDs in a single list, we’ve lost the ability to break it into smaller chunks that can be loaded sequentially, forcing the database to load all one million comments at once. In other words, loading data using this method is like trying to eat a footlong Subway sandwich in one gigantic bite instead of patiently nibbling at it like a normal person.
Using a many-to-one relationship saves us from this problem. Because comments are being modularly stored in their own page instead of an inseparable list, we can ask the database to only send over, say, ten comments at a time. This technique is called “lazy loading”. If you’ve ever scrolled to the bottom of an Instagram comment section, you’ve probably noticed there’s a slight delay before the next comments appear. This is because the website needs a moment to pull the next batch of comments from the database, and it will get very nervous if you keep swiping repeatedly while it tries to do this. If you ever become frustrated because it’s taking too long, please know that the website has had a long day and is trying its best.
In this example, we’ve barely scratched the surface of how commenting is realistically implemented on massive social media sites, but the point is to emphasize the importance and difficulty of getting the foundations of a system right. As we saw with the HealthCare.gov website, one wrong technical decision at a fundamental level can have disastrous consequences. Optimally, no user should notice anything amiss, but this is much easier said than done.
Working on software at a large scale is like building a thousand-piece jigsaw puzzle. It can be a meticulous, slow, frustrating process, but I think that’s also where many programmers find the attraction. There's a certain fun in studying each little piece, flipping it over and over again to find the correct orientation, and sometimes, scrapping the whole thing and just starting from scratch. Every piece of software in existence has been built on this principle of constant revision, and it’s remarkable that despite the occasional hiccup, the systems that we use day-to-day hold up as well as they do. As long as there are people writing code, there will be tech debt; it’s the inevitable cost of living in an age that is as technologically advanced as the one we’re in now. Tech debt might not ever be completely eliminated, but it’s what props up both the necessities and the luxuries of life in the modern world.
View our sources for this piece here.
Ouch. You made me feel old. I had to learn how to type in commands when I was in elementary school. It was the only way to play hangman.
Learned a lot. Good article.