1. Integration Points
Every integration point will eventually fail in some way, and you need to be prepared for that failure.
Integration point failures take several forms, ranging from various network errors to semantic errors. You will not get nice error responses delivered through the defined protocol; instead, you’ll see some kind of protocol violation, slow response, or outright hang.
Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug at the application layer because most of them violate the high-level protocols. Packet sniffers and other network diagnostics can help.
Failure in a remote system quickly becomes your problem, usually as a cascading failure when your code isn’t defensive enough.
Defensive programming via Circuit Breaker, Timeouts (see Timeouts), Decoupling Middleware, and Handshaking (see Handshaking) will all help you avoid the dangers of integration points.
2. Chain Reactions
Chain reactions are sometimes caused by blocked threads. This happens when all the request-handling threads in an application get blocked and that application stops responding. Incoming requests will get distributed out to the applications on other servers in the same layer, increasing their chance of failure.
What effect could a chain reaction have on the rest of the system? Well, for one thing, a chain reaction failure in one layer can easily lead to a cascading failure in a calling layer.
Remember This
- Recognize that one server down jeopardizes the rest.
A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail. A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.
- Hunt for resource leaks.
Most of the time, a chain reaction happens when your application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.
- Hunt for obscure timing bugs.
Obscure race conditions can also be triggered by traffic. Again, if one server goes down to a deadlock, the increased load on the others makes them more likely to hit the deadlock too.
- Use Autoscaling.
In the cloud, you should create health checks for every autoscaling group. The scaler will shut down instances that fail their health checks and start new ones. As long as the scaler can react faster than the chain reaction propagates, your service will be available.
- Defend with Bulkheads.
Partitioning servers with Bulkheads,, can prevent chain reactions from taking out the entire service—though they won’t help the callers of whichever partition does go down. Use Circuit Breaker on the calling side for that.
3. Cascading Failures- A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.
An obvious example is a database failure. If an entire database cluster goes dark, then any application that calls the database is going to experience problems of some kind. What happens next depends on how the caller is written. If the caller handles it badly, then the caller will also start to fail, resulting in a cascading failure. - Cascading failures often result from resource pools that get drained because of a failure in a lower layer. Integration points without timeouts are a surefire way to create cascading failures.
the calling layer was using 100 percent of its CPU making calls to the lower layer and logging failures in calls to the lower layer. A Circuit Breaker,, would really have helped here.Remember This
- Stop cracks from jumping the gap.
A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.
- Scrutinize resource pools.
A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.
- Defend with Timeouts and Circuit Breaker.
A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled point.
4. UsersRemember This
- Users consume memory.
Each user’s session requires some memory. Minimize that memory to improve your capacity. Use a session only for caching so you can purge the session’s contents if memory gets tight.
- Users do weird, random things.
Users in the real world do things that you won’t predict (or sometimes understand). If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. Look into fuzzing toolkits, property-based testing, or simulation testing.
- Malicious users are out there.
Become intimate with your network design; it should help avert attacks. Make sure your systems are easy to patch—you’ll be doing a lot of it. Keep your frameworks up-to-date, and keep yourself educated.
- Users will gang up on you.
Sometimes they come in really, really big mobs. When Taylor Swift tweets about your site, she’s basically pointing a sword at your servers and crying, “Release the legions!” Large mobs can trigger hangs, deadlocks, and obscure race conditions. Run special stress tests to hammer deep links or hot URLs.
5. Blocked ThreadsThe majority of system failures I have dealt with do not involve outright crashes. The process runs and runs but does nothing because every thread available for processing transactions is blocked waiting on some impossible outcome.
- Metrics can reveal problems quickly too. Counters like “successful logins” or “failed credit cards” will show problems long before an alert goes off.
Remember This
- Recall that the Blocked Threads antipattern is the proximate cause of most failures.
Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slowdown” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures antipatterns.
- Scrutinize resource pools.
Like Cascading Failures, the Blocked Threads antipattern usually happens around resource pools, particularly database connection pools. A deadlock in the database can cause connections to be lost forever, and so can incorrect exception handling.
- Use proven primitives.
Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue: it isn’t. Any library of concurrency utilities has more testing than your newborn queue.
- Defend with Timeouts.
You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid infinite waits in function calls; use a version that takes a timeout parameter. Always use timeouts, even though it means you need more error-handling code.
- Beware the code you cannot see.
All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself. Whenever possible, acquire and investigate the code for surprises and failure modes. You might also prefer open source libraries to closed source for this very reason.
Self-Denial Attacks
Self-denial is only occasionally a virtue in people and never in systems. A self-denial attack describes any situation in which the system—or the extended system that includes humans—conspires against itself.
The classic example of a self-denial attack is the email from marketing to a “select group of users” that contains some privileged information or offer.
Remember This
- Keep the lines of communication open.
Self-denial attacks originate inside your own organization, when people cause self-inflicted wounds by creating their own flash mobs and traffic spikes. You can aid and abet these marketing efforts and protect your system at the same time, but only if you know what’s coming. Make sure nobody sends mass emails with deep links. Send mass emails in waves to spread out the peak load. Create static “landing zone” pages for the first click from these offers. Watch out for embedded session IDs in URLs.
- Protect shared resources.
Programming errors, unexpected scaling effects, and shared resources all create risks when traffic surges. Watch out for Fight Club bugs, where increased front-end load causes exponentially increasing back-end processing.
- Expect rapid redistribution of any cool or valuable offer.
Anybody who thinks they’ll release a special deal for limited distribution is asking for trouble. There’s no such thing as limited distribution. Even if you limit the number of times a fantastic deal can be redeemed, you’ll still get crushed with people hoping beyond hope that they, too, can get a PlayStation Twelve for $99.
Scaling Effects
Anytime you have a “many-to-one” or “many-to-few” relationship, you can be hit by scaling effects when one side increases. For instance, a database server that holds up just fine when ten machines call it might crash miserably when you add the next fifty machines.Shared Resources
The following figure should give you an idea of how the callers can put a hurting on the shared resource.
- The most scalable architecture is the shared-nothing architecture. Each server operates independently, without need for coordination or calls to any centralized services. In a shared nothing architecture, capacity scales more or less linearly with the number of servers.
The trouble with a shared-nothing architecture is that it might scale better at the cost of failover. For example, consider session failover. A user’s session resides in memory on an application server. When that server goes down, the next request from the user will be directed to another server. Obviously, we’d like that transition to be invisible to the user, so the user’s session should be loaded into the new application server. That requires some kind of coordination between the original application server and some other device. Perhaps the application server sends the user’s session to a session backup server after each page request. Maybe it serializes the session into a database table or shares its sessions with another designated application server. There are numerous strategies for session failover, but they all involve getting the user’s session off the original server. Most of the time, that implies some level of shared resources.the shared resource will be allocated for exclusive use while a client is processing some unit of work. In these cases, the probability of contention scales with the number of transactions processed by the layer and the number of clients in that layer. When the shared resource saturates, you get a connection backlog. When the backlog exceeds the listen queue, you get failed transactions. At that point, nearly anything can happen. It depends on what function the caller needs the shared resource to provide. Particularly in the case of cache managers (providing coherency for distributed caches), failed transactions lead to stale data or—worse—loss of data integrity.
Remember This
- Examine production versus QA environments to spot Scaling Effects.
You get bitten by Scaling Effects when you move from small one-to-one development and test environments to full-sized production environments. Patterns that work fine in small environments or one-to-one environments might slow down or fail completely when you move to production sizes.
- Watch out for point-to-point communication.
Point-to-point communication scales badly, since the number of connections increases as the square of the number of participants. Consider how large your system can grow while still using point-to-point connections—it might be sufficient. Once you’re dealing with tens of servers, you will probably need to replace it with some kind of one-to-many communication.
- Watch out for shared resources.
Shared resources can be a bottleneck, a capacity constraint, and a threat to stability. If your system must use some sort of shared resource, stress-test it heavily. Also, be sure its clients will keep working if the shared resource gets slow or locks up.
Unbalanced Capacities
Whether your resources take months, weeks, or seconds to provision, you can end up with mismatched ratios between different layers. That makes it possible for one tier or service to flood another with requests beyond its capacity. This especially holds when you deal with calls to rate-limited or throttled APIs!
In the illustration, the front-end service has 3,000 request-handling threads available. During peak usage, the majority of these will be serving product catalog pages or search results. Some smaller number will be in various corporate “telling” pages. A few will be involved in a checkout process.
So if you can’t build every service large enough to meet the potentially overwhelming demand from the front end, then you must build both callers and providers to be resilient in the face of a tsunami of requests. For the caller, Circuit Breaker will help by relieving the pressure on downstream services when responses get slow or connections get refused. For service providers, use Handshaking and Backpressure to inform callers to throttle back on the requests. Also consider Bulkheads to reserve capacity for high-priority callers of critical services.what can you do if your service serves such unpredictable callers? Be ready for anything. First, use capacity modeling to make sure you’re at least in the ballpark. Three thousand threads calling into seventy-five threads is not in the ballpark. Second, don’t just test your system with your usual workloads. See what happens if you take the number of calls the front end could possibly make, double it, and direct it all against your most expensive transaction. If your system is resilient, it might slow down—even start to fail fast if it can’t process transactions within the allowed time (see Fail Fast)—but it should recover once the load goes down. Crashing, hung threads, empty responses, or nonsense replies indicate your system won’t survive and might just start a cascading failure. Third, if you can, use autoscaling to react to surging demand. It’s not a panacea, since it suffers from lag and can just pass the problem down the line to an overloaded platform service. Also, be sure to impose some kind of financial constraint on your autoscaling as a risk management measure.
Remember This
- Examine server and thread counts.
In development and QA, your system probably looks like one or two servers, and so do all the QA versions of the other systems you call. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle in production compared to QA.
- Observe near Scaling Effects and users.
Unbalanced Capacities is a special case of Scaling Effects: one side of a relationship scales up much more than the other side. A change in traffic patterns—seasonal, market-driven, or publicity-driven—can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a hot Reddit post or celebrity tweet causes traffic to suddenly flood websites.
- Virtualize QA and scale it up.
Even if your production environment is a fixed size, don’t let your QA languish at a measly pair of servers. Scale it up. Try test cases where you scale the caller and provider to different ratios. You should be able to automate this all through your data center automation tools.
- Stress both sides of the interface.
If you provide the back-end system, see what happens if it suddenly gets ten times the highest-ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If you provide the front-end system, see what happens if calls to the back end stop responding or get very slow.
Dogpile
When a bunch of servers impose this transient load all at once, it’s called a dogpile. (“Dogpile” is a term from American football in which the ball-carrier gets compressed at the base of a giant pyramid of steroid-infused flesh.)
A dogpile can occur in several different situations:
- When booting up several servers, such as after a code upgrade and restart
- When a cron job triggers at midnight (or on the hour for any hour, really)
- When the configuration management system pushes out a change
Remember This
- Dogpiles force you to spend too much to handle peak demand.
A dogpile concentrates demand. It requires a higher peak capacity than you’d need if you spread the surge out.
- Use random clock slew to diffuse the demand.
Don’t set all your cron jobs for midnight or any other on-the-hour time. Mix them up to spread the load out.
- Use increasing backoff times to avoid pulsing.
A fixed retry interval will concentrate demand from callers on that period. Instead, use a backoff algorithm so different callers will be at different points in their backoff periods.
Force Multiplier
Like a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.
Outage Amplification
On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately ninety minutes and had degraded service for about another ninety minutes.[11] In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:
First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster.[12]
Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.
The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.
The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.
Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.
Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagree with. In other words, normal activity resumed.
The most interesting aspect of this outage is the way it emerged from a conflict between the automation platform’s “belief” about the expected state of the system and the administrator’s belief about the expected state. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. Like HAL 9000, the automation systems were stuck between two conflicting sets of instructions.
A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure.
The nodes of the discovery system gossip among themselves to synchronize their knowledge of the registered services. They run health checks periodically to see if any of the services’ nodes should be taken out of rotation. If a single instance of one of the services stops responding, then the discovery service removes that node’s IP address. No wonder they can amplify a failure. One especially challenging failure mode occurs when a service discovery node is itself partitioned away from the rest of the network. As shown in the next figure, node 3 of the discovery service can no longer reach any of the managed services. Node 3 kind of panics. It can’t tell the difference between “the rest of the universe just disappeared” and “I’ve got a blindfold on.” But if node 3 can still gossip with nodes 1 and 2, then it can propagate its belief to the whole cluster. All at once, service discovery reports that zero services are available. Any application that needs a service gets told, “Sorry, but it looks like a meteor hit the data center. It’s a smoking crater.”
Consider a similar failure, but with a platform management service instead. This service is responsible for starting and stopping machine instances. If it forms a belief that everything is down, then it would necessarily start a new copy of every single service required to run the enterprise.
This situation arises mostly with “control plane” software. The “control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality. Logging, monitoring, schedulers, scalers, load balancers, and configuration management are all parts of the control plane.
The common thread running through these failures is that the automation is not being used to simply enact the will of a human administrator. Rather, it’s more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.
In the Reddit failure, ZooKeeper held a representation of the desired state. That representation was (temporarily) incorrect.
In the case of the discovery service, the partitioned node was not able to correctly sense the current state.
Force MultiplierLike a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.
Outage Amplification
On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately ninety minutes and had degraded service for about another ninety minutes.[11] In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:
First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster.[12]
Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.
The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.
The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.
Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.
Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagree with. In other words, normal activity resumed.
The most interesting aspect of this outage is the way it emerged from a conflict between the automation platform’s “belief” about the expected state of the system and the administrator’s belief about the expected state. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. Like HAL 9000, the automation systems were stuck between two conflicting sets of instructions.
A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure.
The nodes of the discovery system gossip among themselves to synchronize their knowledge of the registered services. They run health checks periodically to see if any of the services’ nodes should be taken out of rotation. If a single instance of one of the services stops responding, then the discovery service removes that node’s IP address. No wonder they can amplify a failure. One especially challenging failure mode occurs when a service discovery node is itself partitioned away from the rest of the network. As shown in the next figure, node 3 of the discovery service can no longer reach any of the managed services. Node 3 kind of panics. It can’t tell the difference between “the rest of the universe just disappeared” and “I’ve got a blindfold on.” But if node 3 can still gossip with nodes 1 and 2, then it can propagate its belief to the whole cluster. All at once, service discovery reports that zero services are available. Any application that needs a service gets told, “Sorry, but it looks like a meteor hit the data center. It’s a smoking crater.”
Consider a similar failure, but with a platform management service instead. This service is responsible for starting and stopping machine instances. If it forms a belief that everything is down, then it would necessarily start a new copy of every single service required to run the enterprise.
This situation arises mostly with “control plane” software. The “control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality. Logging, monitoring, schedulers, scalers, load balancers, and configuration management are all parts of the control plane.
The common thread running through these failures is that the automation is not being used to simply enact the will of a human administrator. Rather, it’s more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.
In the Reddit failure, ZooKeeper held a representation of the desired state. That representation was (temporarily) incorrect.
In the case of the discovery service, the partitioned node was not able to correctly sense the current state.
A failure can also result when the “desired” state is computed incorrectly and may be impossible or impractical. For example, a naive scheduler might try to run enough instances to drain a queue in a fixed amount of time. Depending on the individual jobs’ processing time, the number of instances might be “infinity.” That will smart when the Amazon Web Services bill arrives!
Controls and Safeguards
The United States has a government agency called the Occupational Safety and Health Administration (OSHA). We don’t see them too often in the software field, but we can still learn from their safety advice for robots.[13]
Industrial robots have multiple layers of safeguards to prevent damage to people, machines, and facilities. In particular, limiting devices and sensors detect when the robot is not operating in a “normal” condition. For example, suppose a robot arm has a rotating joint. There are limits on how far the arm is allowed to rotate based on the expected operating envelope. These will be much, much smaller than the full range of motion the arm could reach. The rate of rotation will be limited so it doesn’t go flinging car doors across an assembly plant if the grip fails. Some joints even detect if they are not working against the expected amount of weight or resistance (as might happen when the front falls off).
We can implement similar safeguards in our control plane software:
If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system.
Apply hysteresis. (See Governor.) Start machines quickly, but shut them down slowly. Starting new machines is safer than shutting old ones off.
When the gap between expected state and observed state is large, signal for confirmation. This is equivalent to a big yellow rotating warning lamp on an industrial robot.
Systems that consume resources should be stateful enough to detect if they’re trying to spin up infinity instances.
Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.
Force MultiplierLike a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.
Outage Amplification
On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately ninety minutes and had degraded service for about another ninety minutes.[11] In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:
First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster.[12]
Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.
The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.
The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.
Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.
Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagree with. In other words, normal activity resumed.
The most interesting aspect of this outage is the way it emerged from a conflict between the automation platform’s “belief” about the expected state of the system and the administrator’s belief about the expected state. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. Like HAL 9000, the automation systems were stuck between two conflicting sets of instructions.
A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure.
The nodes of the discovery system gossip among themselves to synchronize their knowledge of the registered services. They run health checks periodically to see if any of the services’ nodes should be taken out of rotation. If a single instance of one of the services stops responding, then the discovery service removes that node’s IP address. No wonder they can amplify a failure. One especially challenging failure mode occurs when a service discovery node is itself partitioned away from the rest of the network. As shown in the next figure, node 3 of the discovery service can no longer reach any of the managed services. Node 3 kind of panics. It can’t tell the difference between “the rest of the universe just disappeared” and “I’ve got a blindfold on.” But if node 3 can still gossip with nodes 1 and 2, then it can propagate its belief to the whole cluster. All at once, service discovery reports that zero services are available. Any application that needs a service gets told, “Sorry, but it looks like a meteor hit the data center. It’s a smoking crater.”
Consider a similar failure, but with a platform management service instead. This service is responsible for starting and stopping machine instances. If it forms a belief that everything is down, then it would necessarily start a new copy of every single service required to run the enterprise.
This situation arises mostly with “control plane” software. The “control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality. Logging, monitoring, schedulers, scalers, load balancers, and configuration management are all parts of the control plane.
The common thread running through these failures is that the automation is not being used to simply enact the will of a human administrator. Rather, it’s more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.
In the Reddit failure, ZooKeeper held a representation of the desired state. That representation was (temporarily) incorrect.
In the case of the discovery service, the partitioned node was not able to correctly sense the current state.
A failure can also result when the “desired” state is computed incorrectly and may be impossible or impractical. For example, a naive scheduler might try to run enough instances to drain a queue in a fixed amount of time. Depending on the individual jobs’ processing time, the number of instances might be “infinity.” That will smart when the Amazon Web Services bill arrives!
Controls and Safeguards
The United States has a government agency called the Occupational Safety and Health Administration (OSHA). We don’t see them too often in the software field, but we can still learn from their safety advice for robots.[13]
Industrial robots have multiple layers of safeguards to prevent damage to people, machines, and facilities. In particular, limiting devices and sensors detect when the robot is not operating in a “normal” condition. For example, suppose a robot arm has a rotating joint. There are limits on how far the arm is allowed to rotate based on the expected operating envelope. These will be much, much smaller than the full range of motion the arm could reach. The rate of rotation will be limited so it doesn’t go flinging car doors across an assembly plant if the grip fails. Some joints even detect if they are not working against the expected amount of weight or resistance (as might happen when the front falls off).
We can implement similar safeguards in our control plane software:
If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system.
Apply hysteresis. (See Governor.) Start machines quickly, but shut them down slowly. Starting new machines is safer than shutting old ones off.
When the gap between expected state and observed state is large, signal for confirmation. This is equivalent to a big yellow rotating warning lamp on an industrial robot.
Systems that consume resources should be stateful enough to detect if they’re trying to spin up infinity instances.
Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.
Remember This- Ask for help before causing havoc.
Infrastructure management tools can make very large impacts very quickly. Build limiters and safeguards into them so they won’t destroy your whole system at once.
- Beware of lag time and momentum.
Actions initiated by automation take time. That time is usually longer than a monitoring interval, so make sure to account for some delay in the system’s response to the action.
- Beware of illusions and superstitions.
Control systems sense the environment, but they can be fooled. They compute an expected state and a “belief” about the current state. Either can be mistaken.
Slow Response
Remember This
- Slow Responses trigger Cascading Failures.
Upstream systems experiencing Slow Responses will themselves slow down and might be vulnerable to stability problems when the response times exceed their own timeouts.
- For websites, Slow Responses cause more traffic.
Users waiting for pages frequently hit the Reload button, generating even more traffic to your already overloaded system.
- Consider Fail Fast.
If your system tracks its own responsiveness, then it can tell when it’s getting slow. Consider sending an immediate error response when the average response time exceeds the system’s allowed time (or at the very least, when the average response time exceeds the caller’s timeout!).
- Hunt for memory leaks or resource contention.
Contention for an inadequate supply of database connections produces Slow Responses. Slow Responses also aggravate that contention, leading to a self-reinforcing cycle. Memory leaks cause excessive effort in the garbage collector, resulting in Slow Responses. Inefficient low-level protocols can cause network stalls, also resulting in Slow Responses.
Unbounded Result Sets
processing a row means adding a new data object to a collection. What happens when the database suddenly returns five million rows instead of the usual hundred or so? Unless your application explicitly limits the number of results it’s willing to process, it can end up exhausting its memory or spinning in a while loop long after the user loses interest.
Remember This
- Use realistic data volumes.
Typical development and test data sets are too small to exhibit this problem. You need production-sized data sets to see what happens when your query returns a million rows that you turn into objects. As a side benefit, you’ll also get better information from your performance testing when you use production-sized test data.
- Paginate at the front end.
Build pagination details into your service call. The request should include a parameter for the first item and the count. The reply should indicate (roughly) how many results there are.
- Don’t rely on the data producers.
Even if you think a query will never have more than a handful of results, beware: it could change without warning because of some other part of the system. The only sensible numbers are “zero,” “one,” and “lots,” so unless your query selects exactly one row, it has the potential to return too many. Don’t rely on the data producers to create a limited amount of data. Sooner or later, they’ll go berserk and fill up a table for no reason, and then where will you be?
- Put limits into other application-level protocols.
Service calls, RMI, DCOM, XML-RPC, and any other kind of request/reply call are vulnerable to returning huge collections of objects, thereby consuming too much memory.