Sunday, 25 December 2022

Time Management

 Time Management

Be selective - select most important work first

Biggest, hardest and most important task


-- tackle most important task first in morning

   -- plan, start and complete the task in time

   Decision - Take a decision 

   discipline - do things to accomplish what you decided

  determination - do things to accomplish what you decided


What are the most important results you have to get to be successful in your work life today?

what is the biggest task that you can compete that will make the biggest difference in your life right now?


Set the table - be very clear

Steps:

1. Decide what do you want and discuss goal and priority

2. write it down and write on paper

3. Set deadline on goals

4. Make a list of task to complete the goals

5. organise list into plan - create a checklist, priotise and  order it

6. take action on your plan - execute it

7. Do something that makes you closer to goals everyday


-- Take a clean sheet of paper and make a list of 10 goals

   - choose a goal that will have highest impact on your life and plan to execute it everyday


Plan everyday in advance

80/20 rules - of all tasks, 20% are the most important and get more benefit than 80% rest.

-- make a list of goals, activities, projects and responsibilities and then prioritised based on 80/20 rules

-- spend more time on 20% that matters most


consider the consequences 

long term thinking improves short term decision

There is never enough time to do everything but you have enough time to complete most important work


what are my highest value activity

What can I do and only I do, that if done well, will make a real difference.

what is the most valuable use of my time right now?


-- Review your list of task  regularly and find out which one have greatest consequences?


Practise zero base thinking- Zero-Based Thinking gives you the opportunity to start over. Some things in life simply aren’t worth continuing.

select one activity to abandon


You then place an A, B, C, D, or E next to each item on your list before you begin the first task.

An “A” item is defined as something that is very important, something that you must do. This is a task that will have serious positive or negative consequences if you do it or fail to do it, like visiting a key customer or finishing a report that your boss needs for an upcoming board meeting. These items are the frogs of your life.

If you have more than one A task, you prioritize these tasks by writing “A-1,” “A-2,” “A-3,” and so on in front of each item. Your A-1 task is your biggest, ugliest frog of all.

“Shoulds” versus “Musts”

A “B” item is defined as a task that you should do. But it has only mild consequences. These are the tadpoles of your work life. This means that someone may be unhappy or inconvenienced if you don’t do one of these tasks, but it is nowhere as important as an A task. Returning an unimportant telephone message or reviewing your e-mail would be a B task.

The rule is that you should never do a B task when an A task is left undone. You should never be distracted by a tadpole when a big frog is sitting there waiting to be eaten.

A “C” task is defined as something that would be nice to do but for which there are no consequences at all, whether you do it or not. C tasks include phoning a friend, having coffee or lunch with a coworker, and completing some personal business during work hours. These sorts of activities have no affect at all on your work life.

A “D” task is defined as something you can delegate to someone else. The rule is that you should delegate everything that someone else can do so that you can free up more time for the A tasks that only you can do.

An “E” task is defined as something that you can eliminate altogether, and it won’t make any real difference. This may be a task that was important at one time but is no longer relevant to you or anyone else. Often it is something you continue to do out of habit or because you enjoy it. But every minute that you spend on an E task is time taken away from a task or activity that can make a real difference in your life.

1. Review your work list right now and put an A, B, C, D, or E next to each task or activity. Select your A-1 job or project and begin on it immediately. Discipline yourself to do nothing else until this one job is complete.

2. Practice this ABCDE Method every day for the next month on every work or project list before you begin work. After a month, you will have developed the habit of setting and working on your highest-priority tasks, and your future will be assured!


Focus on the Key Result Areas

what one skill would have greatest impact on positive impact on my career?


Grade your key result areas and determine your one key skill

Get feedback from your boss, coworkers and family



      


Tuesday, 1 November 2022

 Stability Pattern

1. Timeouts
The timeout is a simple mechanism allowing you to stop waiting for an answer once you think it won’t come. 
Well-placed timeouts provide fault isolation—a problem in some other service or device does not have to become your problem.

Commercial software client libraries are notoriously devoid of timeouts. These libraries often do direct socket calls on behalf of the system. By hiding the socket from your code, they also prevent you from setting vital timeouts.
Any resource pool can be exhausted. It’s essential that any resource pool that blocks threads must have a timeout to ensure that calling threads eventually unblock, whether resources become available or not.

Also beware of language-level synchronization or mutexes. Always use the form that takes a timeout argument.

Use a generic gateway to provide the template for connection handling, error handling, query execution, and result processing. That way you only need to get it right in one place, and calling code can provide just the essential logic. Collecting this common interaction pattern into a single class also makes it easier to apply the Circuit Breaker pattern.

Timeouts are often found in the company of retries. Under the philosophy of “best effort,” the software attempts to repeat an operation that timed out. Immediately retrying an operation after a failure has a number of consequences, but only some of them are beneficial. If the operation failed because of any significant problem, it’s likely to fail again if retried immediately.

From the client’s perspective, making me wait longer is a very bad thing. If you cannot complete an operation because of some timeout, it is better for you to return a result. It can be a failure, a success, or a note that you’ve queued the work for later execution (if I should care about the distinction). In any case, just come back with an answer. Making me wait while you retry the operation might push your response time past my timeout. It certainly keeps my resources busy longer than needed.


Timeouts have natural synergy with circuit breakers. A circuit breaker can tabulate timeouts, tripping to the “off” state if too many occur.

Timeouts have natural synergy with circuit breakers. A circuit breaker can tabulate timeouts, tripping to the “off” state if too many occur.

Remember This

Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses.

The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures.

Apply Timeouts to recover from unexpected failures.

When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that.

Consider delayed retries.

Most of the explanations for a timeout involve problems in the network or the remote system that won’t be resolved right away. Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later.

2. Circuit Breaker

Principle : detect excess usage, fail first, and open the circuit. 
The circuit breaker exists to allow one subsystem (an electrical circuit) to fail (excessive current draw, possibly from a short circuit) without destroying the entire system (the house). Furthermore, once the danger has passed, the circuit breaker can be reset to restore full function to the system.

In the normal “closed” state, the circuit breaker executes operations as usual. These can be calls out to another system, or they can be internal operations that are subject to timeout or other execution failure. If the call succeeds, nothing extraordinary happens. If it fails, however, the circuit breaker makes a note of the failure. Once the number of failures (or the frequency of failures, in more sophisticated cases) exceeds a threshold, the circuit breaker trips and “opens” the circuit, as shown in the following figure.

images/stability_patterns/circuit_breaker_state_diagram.png

When the circuit is “open,” calls to the circuit breaker fail immediately, without any attempt to execute the real operation. After a suitable amount of time, the circuit breaker decides that the operation has a chance of succeeding, so it goes into the “half-open” state. In this state, the next call to the circuit breaker is allowed to execute the dangerous operation. Should the call succeed, the circuit breaker resets and returns to the “closed” state, ready for more routine operation. If this trial call fails, however, the circuit breaker returns to the open state until another timeout elapses.

Circuit breakers are a way to automatically degrade functionality when the system is under stress. No matter the fallback strategy, it can have an impact on the business of the system. Therefore, it’s essential to involve the system’s stakeholders when deciding how to handle calls made when the circuit is open. For example, should a retail system accept an order if it can’t confirm availability of the customer’s items? What about if it can’t verify the customer’s credit card or shipping address? Of course, this conversation is not unique to the use of a circuit breaker, but discussing the circuit breaker can be a more effective way of broaching the topic than asking for a requirements document.


I like the Leaky Bucket pattern from Pattern Languages of Program Design 2 [VCK96]. It’s a simple counter that you can increment every time you observe a fault. In the background, a thread or timer decrements the counter periodically (down to zero, of course.) If the count exceeds a threshold, then you know that faults are arriving quickly.

The state of the circuit breakers in a system is important to another set of stakeholders: operations. Changes in a circuit breaker’s state should always be logged, and the current state should be exposed for querying and monitoring. In fact, the frequency of state changes is a useful metric to chart over time; it is a leading indicator of problems elsewhere in the enterprise. Likewise, Operations needs some way to directly trip or reset the circuit breaker. The circuit breaker is also a convenient place to gather metrics about call volumes and response times.

Circuit breakers are effective at guarding against integration points, cascading failures, unbalanced capacities, and slow responses. They work so closely with timeouts that they often track timeout failures separately from execution failures.

Remember This

Don’t do it if it hurts.

Circuit Breaker is the fundamental pattern for protecting your system from all manner of Integration Points problems. When there’s a difficulty with Integration Points, stop calling it!

Use together with Timeouts.

Circuit Breaker is good at avoiding calls when Integration Points has a problem. The Timeouts pattern indicates that there’s a problem in Integration Points.

Expose, track, and report state changes.

Popping a Circuit Breaker always indicates something abnormal. It should be visible to Operations. It should be reported, recorded, trended, and correlated.


Bulkheads

In a ship, bulkheads are partitions that, when sealed, divide the ship into separate, watertight compartments. With hatches closed, a bulkhead prevents water from moving from one section to another. In this way, a single penetration of the hull does not irrevocably sink the ship. The bulkhead enforces a principle of damage containment.

You can employ the same technique. By partitioning your systems, you can keep a failure in one part of the system from destroying everything. Physical redundancy is the most common form of bulkheads. If there are four independent servers, then a hardware failure in one can’t affect the others. Likewise, if there are two application instances running on a server and one crashes, the other will still be running (unless, of course, the first one crashed because of some external influence that would also affect the second).

You can partition the threads inside a single process, with separate thread groups dedicated to different functions. For example, it’s often helpful to reserve a pool of request-handling threads for administrative use. That way, even if all request-handling threads on the application server are hung, it can still respond to admin requests—perhaps to collect data for postmortem analysis or a request to shut down.

Remember This

Save part of the ship.

The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen.

Pick a useful granularity.

You can partition thread pools inside an application, CPUs in a server, or servers in a cluster.

Consider Bulkheads particularly with shared services models.

Failures in service-oriented or microservice architectures can propagate very quickly. If your service goes down because of a Chain Reaction, does the entire company come to a halt? Then you’d better put in some Bulkheads.

3. Steady State

Every single time a human touches a server is an opportunity for unforced errors.

Unless the system is crashing every day (in which case, look for the presence of the stability antipatterns), the most common reason for logging in will probably be cleaning up log files or purging data.

Data purging is nasty, detail-oriented work. Referential integrity constraints in a relational database are half the battle. It can be difficult to cleanly remove obsolete data without leaving orphaned rows. The other half of the battle is ensuring that applications still work once the data is gone. That takes coding and testing.

Log files

One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.

Left unchecked, however, log files on individual machines are a risk. When log files grow without bound, they’ll eventually fill up their containing filesystem. Whether that’s a volume set aside for logs, the root disk, or the application installation directory (I hope not), it means trouble. When log files fill up the filesystem, they jeopardize stability. That’s because of the different negative effects that can occur when the filesystem is full. On a UNIX system, the last 5--10 percent (depending on the configuration of the filesystem) of space is reserved for root. That means an application will start getting I/O errors when the filesystem is 90 or 95 percent full. Of course, if the application is running as root, then it can consume the very last byte of space. On a Windows system, an application can always use the very last byte. In either case, the operating system will report errors back to the application.

Of course, it’s always better to avoid filling up the filesystem in the first place. Log file rotation requires just a few minutes of configuration.

Logging can be a wonderful aid to transparency. Make sure that all log files will get rotated out and eventually purged, though, or you’ll eventually spend time fixing the tool that’s supposed to help you fix the system.

Compliance for logs files:
These various compliance regimes require you to retain logs for years. Individual machines can’t possibly retain logs that long. Most of the machines don’t live that long, especially if you’re in the cloud! The best thing to do is get logs off of production machines as quickly as possible. Store them on a centralized server and monitor it closely for tampering.


Steady State

The third edition of Roget’s Thesaurus offers the following definition for the word fiddling: “To handle something idly, ignorantly, or destructively.” It offers helpful synonyms such as foolmeddletampertinker, and monkey. Fiddling is often followed by the “ohnosecond”—that very short moment in time during which you realize that you have pressed the wrong key and brought down a server, deleted vital data, or otherwise damaged the peace and harmony of stable operations.

Every single time a human touches a server is an opportunity for unforced errors. I know of one incident in which an engineer, attempting to be helpful, observed that a server’s root disk mirror was out of sync. He executed a command to “resilver” the mirror, bringing the two disks back into synchronization. Unfortunately, he made a typo and synced the good root disk from the new, totally empty drive that had just been swapped in to replace a bad disk, thereby instantly annihilating the operating system on that server.

It’s best to keep people off production systems to the greatest extent possible. If the system needs a lot of crank-turning and hand-holding to keep running, then administrators develop the habit of staying logged in all the time. This situation probably indicates that the servers are “pets” rather than “cattle” and inevitably leads to fiddling. To that end, the system should be able to run at least one release cycle without human intervention. The logical extreme on the “no fiddling” scale is immutable infrastructure—it can’t be fiddled with! (See Automated Deployments, for more about immutable infrastructure.)

“One release cycle” may be pretty tough if the system is deployed once a quarter. On the other hand, a microservice being continuously deployed from version control should be pretty easy to stabilize for a release cycle.

Unless the system is crashing every day (in which case, look for the presence of the stability antipatterns), the most common reason for logging in will probably be cleaning up log files or purging data.

Any mechanism that accumulates resources (whether it’s log files in the filesystem, rows in the database, or caches in memory) is like a bucket from a high-school calculus problem. The bucket fills up at a certain rate, based on the accumulation of data. It must be drained at the same rate, or greater, or it will eventually overflow. When this bucket overflows, bad things happen: servers go down, databases get slow or throw errors, response times head for the stars. The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource. Let’s look at several types of sludge that can accumulate and how to avoid the need for fiddling.

Data Purging

It certainly seems like a simple enough principle. Computing resources are always finite; therefore, you cannot continually increase consumption without limit. Still, in the rush of excitement about rolling out a new killer application, the next great mission-critical, bet-the-company whatever, data purging always gets the short end of the stick. It certainly doesn’t demo as well as…well, anything demos better than purging, really. It sometimes seems that you’ll be lucky if the system ever runs at all in the real world. The notion that it’ll run long enough to accumulate too much data to handle seems like a “high-class problem”—the kind of problem you’d love to have.

Nevertheless, someday your little database will grow up. When it hits the teenage years—about two in human years—it’ll get moody, sullen, and resentful. In the worst case, it’ll start undermining the whole system (and it will probably complain that nobody understands it, too).

The most obvious symptom of data growth will be steadily increasing I/O rates on the database servers. You may also see increasing latency at constant loads.

Data purging is nasty, detail-oriented work. Referential integrity constraints in a relational database are half the battle. It can be difficult to cleanly remove obsolete data without leaving orphaned rows. The other half of the battle is ensuring that applications still work once the data is gone. That takes coding and testing.

There are few general rules here. Much depends on the database and libraries in use. RDBMS plus ORM tends to deal badly with dangling references, for example, whereas a document-oriented database won’t even notice.

As a consequence, data purging always gets left until after the first release is out the door. The rationale is, “We’ve got six months after launch to implement purging.” (Somehow, they always say “six months.” It’s kind of like a programmer’s estimate of “two weeks.”)

Of course, after launch, there are always emergency releases to fix critical defects or add “must-have” features from marketers tired of waiting for the software to be done. The first six months can slip away pretty quickly, but when that first release launches, a fuse is lit.

Another type of sludge you will commonly encounter is old log files.

Log Files

One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.

Left unchecked, however, log files on individual machines are a risk. When log files grow without bound, they’ll eventually fill up their containing filesystem. Whether that’s a volume set aside for logs, the root disk, or the application installation directory (I hope not), it means trouble. When log files fill up the filesystem, they jeopardize stability. That’s because of the different negative effects that can occur when the filesystem is full. On a UNIX system, the last 5--10 percent (depending on the configuration of the filesystem) of space is reserved for root. That means an application will start getting I/O errors when the filesystem is 90 or 95 percent full. Of course, if the application is running as root, then it can consume the very last byte of space. On a Windows system, an application can always use the very last byte. In either case, the operating system will report errors back to the application.

What happens next is anyone’s guess. In the best-case scenario, the logging filesystem is separate from any critical data storage (such as transactions), and the application code protects itself well enough that users never realize anything is amiss. Significantly less pleasant, but still tolerable, is a nicely worded error message asking the users to have patience with us and please come back when we’ve got our act together. Several rungs down the ladder is serving a stack trace to the user.

Worse yet, the developers in one system I saw had added a “universal exception handler” to the servlet pipeline. This handler would log any kind of exception. It was reentrant, so if an exception occurred while logging an exception, it would log both the original and the new exception. As soon as the filesystem got full, this poor exception handler went nuts, trying to log an ever-increasing stack of exceptions. Because there were multiple threads, each trying to log its own Sisyphean exception, this application server was able to consume eight entire CPUs—for a little while, anyway. The exceptions, multiplying like Leonardo of Pisa’s rabbits, rapidly consumed all available memory. This was followed shortly by a crash.

Of course, it’s always better to avoid filling up the filesystem in the first place. Log file rotation requires just a few minutes of configuration.

In the case of legacy code, third-party code, or code that doesn’t use one of the excellent logging frameworks available, the logrotate utility is ubiquitous on UNIX. For Windows, you can try building logrotate under Cygwin, or you can hand roll a vbs or bat script to do the job. Logging can be a wonderful aid to transparency. Make sure that all log files will get rotated out and eventually purged, though, or you’ll eventually spend time fixing the tool that’s supposed to help you fix the system.

Log files on production systems have a terrible signal-to-noise ratio. It’s best to get them off the individual hosts as quickly as possible. Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.

Between data in the database and log files on the disk, persistent data can find plenty of ways to clog up your system. Like a jingle from an old commercial, sludge stuck in memory clogs up your application.

In-Memory Caching

To a long-running server, memory is like oxygen. Cache, left untended, will suck up all the oxygen. Low memory conditions are a threat to both stability and capacity. 

If the number of possible keys has no upper bound, then cache size limits must be enforced and the cache needs some form of cache invalidation. The simplest mechanism is a time-based cache flush. You can also investigate least recently used (LRU) or working-set algorithms, but nine times out of ten, a periodic flush will do.

Improper use of caching is the major cause of memory leaks, which in turn lead to horrors like daily server restarts. Nothing gets administrators in the habit of being logged onto production like daily (or nightly) chores.

Remember This

Avoid fiddling.

Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run for at least a typical deployment cycle without manual disk cleanups or nightly restarts.

Purge data with application logic.

DBAs can create scripts to purge data, but they don’t always know how the application behaves when data is removed. Maintaining logical integrity, especially if you use an ORM tool, requires the application to purge its own data.

Limit caching.

In-memory caching speeds up applications, until it slows them down. Limit the amount of memory a cache can consume.

Roll the logs.

Don’t keep an unlimited amount of log files. Configure log file rotation based on size. If you need to retain them for compliance, do it on a nonproduction server.

Fail Fast

If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work.
1. when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. 


Fail Fast

If slow responses are worse than no response, the worst must surely be a slow failure response. It’s like waiting through the interminable line at the DMV, only to be told you need to fill out a different form and go back to the end of the line. Can there be any bigger waste of system resources than burning cycles and clock time only to throw away the result?

If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work.

How can the system tell whether it will fail? Do we need Deep Learning? Don’t worry, you won’t need to hire a cadre of data scientists.

It’s actually much more mundane than that. There’s a large class of “resource unavailable” failures. For example, when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place—gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.

Another way to fail fast in a web application is to perform basic parameter-checking in the servlet or controller that receives the request, before talking to the database. This would be a good reason to move some parameter checking out of domain objects into something like a “Query object.”

Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit Reload three or four times.

The Fail Fast pattern improves overall system stability by avoiding slow responses. Together with timeouts, failing fast can help avert impending cascading failures. It also helps maintain capacity when the system is under stress because of partial failures.

Remember This

Avoid Slow Responses and Fail Fast.

If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.

Reserve resources, verify Integration Points early.

In the theme of “don’t do useless work,” make sure you’ll be able to complete the transaction before you start. If critical resources aren’t available—for example, a popped Circuit Breaker on a required callout—then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.

Use for input validation.

Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate just to find out that a required parameter wasn’t entered.

Let it Crash

Remember This

Crash components to save systems.

It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a known good state.

Restart fast and reintegrate.

The key to crashing well is getting back up quickly. Otherwise you risk loss of service when too many components are bouncing. Once a component is back up, it should be reintegrated automatically.

Isolate components to crash independently.

Use Circuit Breakers to isolate callers from components that crash. Use supervisors to determine what the span of restarts should be. Design your supervision tree so that crashes are isolated and don’t affect unrelated functionality.

Don’t crash monoliths.

Large processes with heavy runtimes or long startups are not the right place to apply this pattern. Applications that couple many features into a single process are also a poor choice.

Wednesday, 19 October 2022

Stability antipatterns

 1. Integration Points

Beware this necessary evil.

Every integration point will eventually fail in some way, and you need to be prepared for that failure.

Prepare for the many forms of failure.

Integration point failures take several forms, ranging from various network errors to semantic errors. You will not get nice error responses delivered through the defined protocol; instead, you’ll see some kind of protocol violation, slow response, or outright hang.

Know when to open up abstractions.

Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug at the application layer because most of them violate the high-level protocols. Packet sniffers and other network diagnostics can help.

Failures propagate quickly.

Failure in a remote system quickly becomes your problem, usually as a cascading failure when your code isn’t defensive enough.

Apply patterns to avert integration point problems.

Defensive programming via Circuit Breaker, Timeouts (see Timeouts), Decoupling Middleware, and Handshaking (see Handshaking) will all help you avoid the dangers of integration points.

2. Chain Reactions

Chain reactions are sometimes caused by blocked threads. This happens when all the request-handling threads in an application get blocked and that application stops responding. Incoming requests will get distributed out to the applications on other servers in the same layer, increasing their chance of failure.
What effect could a chain reaction have on the rest of the system? Well, for one thing, a chain reaction failure in one layer can easily lead to a cascading failure in a calling layer.

Remember This

Recognize that one server down jeopardizes the rest.

A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail. A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.

Hunt for resource leaks.

Most of the time, a chain reaction happens when your application has a memory leak. As one server runs out of memory and goes down, the other servers pick up the dead one’s burden. The increased traffic means they leak memory faster.

Hunt for obscure timing bugs.

Obscure race conditions can also be triggered by traffic. Again, if one server goes down to a deadlock, the increased load on the others makes them more likely to hit the deadlock too.

Use Autoscaling.

In the cloud, you should create health checks for every autoscaling group. The scaler will shut down instances that fail their health checks and start new ones. As long as the scaler can react faster than the chain reaction propagates, your service will be available.

Defend with Bulkheads.

Partitioning servers with Bulkheads,, can prevent chain reactions from taking out the entire service—though they won’t help the callers of whichever partition does go down. Use Circuit Breaker on the calling side for that.

3. Cascading Failures

A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.
An obvious example is a database failure. If an entire database cluster goes dark, then any application that calls the database is going to experience problems of some kind. What happens next depends on how the caller is written. If the caller handles it badly, then the caller will also start to fail, resulting in a cascading failure.
Cascading failures often result from resource pools that get drained because of a failure in a lower layer. Integration points without timeouts are a surefire way to create cascading failures.
the calling layer was using 100 percent of its CPU making calls to the lower layer and logging failures in calls to the lower layer. A Circuit Breaker,, would really have helped here.

Remember This

Stop cracks from jumping the gap.

A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. Your system surely calls out to other enterprise systems; make sure you can stay up when they go down.

Scrutinize resource pools.

A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out a resource.

Defend with Timeouts and Circuit Breaker.

A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled point.

4. Users

Remember This

Users consume memory.

Each user’s session requires some memory. Minimize that memory to improve your capacity. Use a session only for caching so you can purge the session’s contents if memory gets tight.

Users do weird, random things.

Users in the real world do things that you won’t predict (or sometimes understand). If there’s a weak spot in your application, they’ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. Look into fuzzing toolkits, property-based testing, or simulation testing.

Malicious users are out there.

Become intimate with your network design; it should help avert attacks. Make sure your systems are easy to patch—you’ll be doing a lot of it. Keep your frameworks up-to-date, and keep yourself educated.

Users will gang up on you.

Sometimes they come in really, really big mobs. When Taylor Swift tweets about your site, she’s basically pointing a sword at your servers and crying, “Release the legions!” Large mobs can trigger hangs, deadlocks, and obscure race conditions. Run special stress tests to hammer deep links or hot URLs.

5. Blocked Threads

The majority of system failures I have dealt with do not involve outright crashes. The process runs and runs but does nothing because every thread available for processing transactions is blocked waiting on some impossible outcome.

Metrics can reveal problems quickly too. Counters like “successful logins” or “failed credit cards” will show problems long before an alert goes off.

Remember This

Recall that the Blocked Threads antipattern is the proximate cause of most failures.

Application failures nearly always relate to Blocked Threads in one way or another, including the ever-popular “gradual slowdown” and “hung server.” The Blocked Threads antipattern leads to Chain Reactions and Cascading Failures antipatterns.

Scrutinize resource pools.

Like Cascading Failures, the Blocked Threads antipattern usually happens around resource pools, particularly database connection pools. A deadlock in the database can cause connections to be lost forever, and so can incorrect exception handling.

Use proven primitives.

Learn and apply safe primitives. It might seem easy to roll your own producer/consumer queue: it isn’t. Any library of concurrency utilities has more testing than your newborn queue.

Defend with Timeouts.

You cannot prove that your code has no deadlocks in it, but you can make sure that no deadlock lasts forever. Avoid infinite waits in function calls; use a version that takes a timeout parameter. Always use timeouts, even though it means you need more error-handling code.

Beware the code you cannot see.

All manner of problems can lurk in the shadows of third-party code. Be very wary. Test it yourself. Whenever possible, acquire and investigate the code for surprises and failure modes. You might also prefer open source libraries to closed source for this very reason.

Self-Denial Attacks

Self-denial is only occasionally a virtue in people and never in systems. A self-denial attack describes any situation in which the system—or the extended system that includes humans—conspires against itself.

The classic example of a self-denial attack is the email from marketing to a “select group of users” that contains some privileged information or offer. 

Remember This

Keep the lines of communication open.

Self-denial attacks originate inside your own organization, when people cause self-inflicted wounds by creating their own flash mobs and traffic spikes. You can aid and abet these marketing efforts and protect your system at the same time, but only if you know what’s coming. Make sure nobody sends mass emails with deep links. Send mass emails in waves to spread out the peak load. Create static “landing zone” pages for the first click from these offers. Watch out for embedded session IDs in URLs.

Protect shared resources.

Programming errors, unexpected scaling effects, and shared resources all create risks when traffic surges. Watch out for Fight Club bugs, where increased front-end load causes exponentially increasing back-end processing.

Expect rapid redistribution of any cool or valuable offer.

Anybody who thinks they’ll release a special deal for limited distribution is asking for trouble. There’s no such thing as limited distribution. Even if you limit the number of times a fantastic deal can be redeemed, you’ll still get crushed with people hoping beyond hope that they, too, can get a PlayStation Twelve for $99.


Scaling Effects
 Anytime you have a “many-to-one” or “many-to-few” relationship, you can be hit by scaling effects when one side increases. For instance, a database server that holds up just fine when ten machines call it might crash miserably when you add the next fifty machines.

Shared Resources

The following figure should give you an idea of how the callers can put a hurting on the shared resource.

images/stability_antipatterns/many_to_one_comm.png

The most scalable architecture is the shared-nothing architecture. Each server operates independently, without need for coordination or calls to any centralized services. In a shared nothing architecture, capacity scales more or less linearly with the number of servers.

The trouble with a shared-nothing architecture is that it might scale better at the cost of failover. For example, consider session failover. A user’s session resides in memory on an application server. When that server goes down, the next request from the user will be directed to another server. Obviously, we’d like that transition to be invisible to the user, so the user’s session should be loaded into the new application server. That requires some kind of coordination between the original application server and some other device. Perhaps the application server sends the user’s session to a session backup server after each page request. Maybe it serializes the session into a database table or shares its sessions with another designated application server. There are numerous strategies for session failover, but they all involve getting the user’s session off the original server. Most of the time, that implies some level of shared resources.

the shared resource will be allocated for exclusive use while a client is processing some unit of work. In these cases, the probability of contention scales with the number of transactions processed by the layer and the number of clients in that layer. When the shared resource saturates, you get a connection backlog. When the backlog exceeds the listen queue, you get failed transactions. At that point, nearly anything can happen. It depends on what function the caller needs the shared resource to provide. Particularly in the case of cache managers (providing coherency for distributed caches), failed transactions lead to stale data or—worse—loss of data integrity.

Remember This

Examine production versus QA environments to spot Scaling Effects.

You get bitten by Scaling Effects when you move from small one-to-one development and test environments to full-sized production environments. Patterns that work fine in small environments or one-to-one environments might slow down or fail completely when you move to production sizes.

Watch out for point-to-point communication.

Point-to-point communication scales badly, since the number of connections increases as the square of the number of participants. Consider how large your system can grow while still using point-to-point connections—it might be sufficient. Once you’re dealing with tens of servers, you will probably need to replace it with some kind of one-to-many communication.

Watch out for shared resources.

Shared resources can be a bottleneck, a capacity constraint, and a threat to stability. If your system must use some sort of shared resource, stress-test it heavily. Also, be sure its clients will keep working if the shared resource gets slow or locks up.

Unbalanced Capacities

Whether your resources take months, weeks, or seconds to provision, you can end up with mismatched ratios between different layers. That makes it possible for one tier or service to flood another with requests beyond its capacity. This especially holds when you deal with calls to rate-limited or throttled APIs!

In the illustration, the front-end service has 3,000 request-handling threads available. During peak usage, the majority of these will be serving product catalog pages or search results. Some smaller number will be in various corporate “telling” pages. A few will be involved in a checkout process.

images/stability_antipatterns/unbalanced_capacities.png

So if you can’t build every service large enough to meet the potentially overwhelming demand from the front end, then you must build both callers and providers to be resilient in the face of a tsunami of requests. For the caller, Circuit Breaker will help by relieving the pressure on downstream services when responses get slow or connections get refused. For service providers, use Handshaking and Backpressure to inform callers to throttle back on the requests. Also consider Bulkheads to reserve capacity for high-priority callers of critical services.

what can you do if your service serves such unpredictable callers? Be ready for anything. First, use capacity modeling to make sure you’re at least in the ballpark. Three thousand threads calling into seventy-five threads is not in the ballpark. Second, don’t just test your system with your usual workloads. See what happens if you take the number of calls the front end could possibly make, double it, and direct it all against your most expensive transaction. If your system is resilient, it might slow down—even start to fail fast if it can’t process transactions within the allowed time (see Fail Fast)—but it should recover once the load goes down. Crashing, hung threads, empty responses, or nonsense replies indicate your system won’t survive and might just start a cascading failure. Third, if you can, use autoscaling to react to surging demand. It’s not a panacea, since it suffers from lag and can just pass the problem down the line to an overloaded platform service. Also, be sure to impose some kind of financial constraint on your autoscaling as a risk management measure.

Remember This

Examine server and thread counts.

In development and QA, your system probably looks like one or two servers, and so do all the QA versions of the other systems you call. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle in production compared to QA.

Observe near Scaling Effects and users.

Unbalanced Capacities is a special case of Scaling Effects: one side of a relationship scales up much more than the other side. A change in traffic patterns—seasonal, market-driven, or publicity-driven—can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a hot Reddit post or celebrity tweet causes traffic to suddenly flood websites.

Virtualize QA and scale it up.

Even if your production environment is a fixed size, don’t let your QA languish at a measly pair of servers. Scale it up. Try test cases where you scale the caller and provider to different ratios. You should be able to automate this all through your data center automation tools.

Stress both sides of the interface.

If you provide the back-end system, see what happens if it suddenly gets ten times the highest-ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If you provide the front-end system, see what happens if calls to the back end stop responding or get very slow.



Dogpile

When a bunch of servers impose this transient load all at once, it’s called a dogpile. (“Dogpile” is a term from American football in which the ball-carrier gets compressed at the base of a giant pyramid of steroid-infused flesh.)

A dogpile can occur in several different situations:

  • When booting up several servers, such as after a code upgrade and restart
  • When a cron job triggers at midnight (or on the hour for any hour, really)
  • When the configuration management system pushes out a change

Remember This

Dogpiles force you to spend too much to handle peak demand.

A dogpile concentrates demand. It requires a higher peak capacity than you’d need if you spread the surge out.

Use random clock slew to diffuse the demand.

Don’t set all your cron jobs for midnight or any other on-the-hour time. Mix them up to spread the load out.

Use increasing backoff times to avoid pulsing.

A fixed retry interval will concentrate demand from callers on that period. Instead, use a backoff algorithm so different callers will be at different points in their backoff periods.


Force Multiplier

Like a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.

Outage Amplification

On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately ninety minutes and had degraded service for about another ninety minutes.[11] In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:

  1. First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster.[12]

  2. Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.

  3. The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.

  4. The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.

  5. Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.

  6. Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagree with. In other words, normal activity resumed.

The most interesting aspect of this outage is the way it emerged from a conflict between the automation platform’s “belief” about the expected state of the system and the administrator’s belief about the expected state. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. Like HAL 9000, the automation systems were stuck between two conflicting sets of instructions.

A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure.

images/stability_antipatterns/force_multiplier_service_discovery.png

The nodes of the discovery system gossip among themselves to synchronize their knowledge of the registered services. They run health checks periodically to see if any of the services’ nodes should be taken out of rotation. If a single instance of one of the services stops responding, then the discovery service removes that node’s IP address. No wonder they can amplify a failure. One especially challenging failure mode occurs when a service discovery node is itself partitioned away from the rest of the network. As shown in the next figure, node 3 of the discovery service can no longer reach any of the managed services. Node 3 kind of panics. It can’t tell the difference between “the rest of the universe just disappeared” and “I’ve got a blindfold on.” But if node 3 can still gossip with nodes 1 and 2, then it can propagate its belief to the whole cluster. All at once, service discovery reports that zero services are available. Any application that needs a service gets told, “Sorry, but it looks like a meteor hit the data center. It’s a smoking crater.”

images/stability_antipatterns/force_multiplier_service_discovery_partitioned.png

Consider a similar failure, but with a platform management service instead. This service is responsible for starting and stopping machine instances. If it forms a belief that everything is down, then it would necessarily start a new copy of every single service required to run the enterprise.

This situation arises mostly with “control plane” software. The “control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality. Logging, monitoring, schedulers, scalers, load balancers, and configuration management are all parts of the control plane.

The common thread running through these failures is that the automation is not being used to simply enact the will of a human administrator. Rather, it’s more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.

In the Reddit failure, ZooKeeper held a representation of the desired state. That representation was (temporarily) incorrect.

In the case of the discovery service, the partitioned node was not able to correctly sense the current state.


Force Multiplier

Like a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.

Outage Amplification

On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately ninety minutes and had degraded service for about another ninety minutes.[11] In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:

  1. First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster.[12]

  2. Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.

  3. The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.

  4. The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.

  5. Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.

  6. Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagree with. In other words, normal activity resumed.

The most interesting aspect of this outage is the way it emerged from a conflict between the automation platform’s “belief” about the expected state of the system and the administrator’s belief about the expected state. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. Like HAL 9000, the automation systems were stuck between two conflicting sets of instructions.

A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure.

images/stability_antipatterns/force_multiplier_service_discovery.png

The nodes of the discovery system gossip among themselves to synchronize their knowledge of the registered services. They run health checks periodically to see if any of the services’ nodes should be taken out of rotation. If a single instance of one of the services stops responding, then the discovery service removes that node’s IP address. No wonder they can amplify a failure. One especially challenging failure mode occurs when a service discovery node is itself partitioned away from the rest of the network. As shown in the next figure, node 3 of the discovery service can no longer reach any of the managed services. Node 3 kind of panics. It can’t tell the difference between “the rest of the universe just disappeared” and “I’ve got a blindfold on.” But if node 3 can still gossip with nodes 1 and 2, then it can propagate its belief to the whole cluster. All at once, service discovery reports that zero services are available. Any application that needs a service gets told, “Sorry, but it looks like a meteor hit the data center. It’s a smoking crater.”

images/stability_antipatterns/force_multiplier_service_discovery_partitioned.png

Consider a similar failure, but with a platform management service instead. This service is responsible for starting and stopping machine instances. If it forms a belief that everything is down, then it would necessarily start a new copy of every single service required to run the enterprise.

This situation arises mostly with “control plane” software. The “control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality. Logging, monitoring, schedulers, scalers, load balancers, and configuration management are all parts of the control plane.

The common thread running through these failures is that the automation is not being used to simply enact the will of a human administrator. Rather, it’s more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.

In the Reddit failure, ZooKeeper held a representation of the desired state. That representation was (temporarily) incorrect.

In the case of the discovery service, the partitioned node was not able to correctly sense the current state.

A failure can also result when the “desired” state is computed incorrectly and may be impossible or impractical. For example, a naive scheduler might try to run enough instances to drain a queue in a fixed amount of time. Depending on the individual jobs’ processing time, the number of instances might be “infinity.” That will smart when the Amazon Web Services bill arrives!

Controls and Safeguards

The United States has a government agency called the Occupational Safety and Health Administration (OSHA). We don’t see them too often in the software field, but we can still learn from their safety advice for robots.[13]

Industrial robots have multiple layers of safeguards to prevent damage to people, machines, and facilities. In particular, limiting devices and sensors detect when the robot is not operating in a “normal” condition. For example, suppose a robot arm has a rotating joint. There are limits on how far the arm is allowed to rotate based on the expected operating envelope. These will be much, much smaller than the full range of motion the arm could reach. The rate of rotation will be limited so it doesn’t go flinging car doors across an assembly plant if the grip fails. Some joints even detect if they are not working against the expected amount of weight or resistance (as might happen when the front falls off).

We can implement similar safeguards in our control plane software:

  • If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system.

  • Apply hysteresis. (See Governor.) Start machines quickly, but shut them down slowly. Starting new machines is safer than shutting old ones off.

  • When the gap between expected state and observed state is large, signal for confirmation. This is equivalent to a big yellow rotating warning lamp on an industrial robot.

  • Systems that consume resources should be stateful enough to detect if they’re trying to spin up infinity instances.

  • Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.


Force Multiplier

Like a lever, automation allows administrators to make large movements with less effort. It’s a force multiplier.

Outage Amplification

On August 11, 2016, link aggregator Reddit.com suffered an outage. It was unavailable for approximately ninety minutes and had degraded service for about another ninety minutes.[11] In their postmortem, Reddit admins described a conflict between deliberate, manual changes and their automation platform:

  1. First, the admins shut down their autoscaler service so that they could upgrade a ZooKeeper cluster.[12]

  2. Sometime into the upgrade process, the package management system detected the autoscaler was off and restarted it.

  3. The autoscaler came back online and read the partially migrated ZooKeeper data. The incomplete ZooKeeper data reflected a much smaller environment than was currently running.

  4. The autoscaler decided that too many servers were running. It therefore shut down many application and cache servers. This is the start of the downtime.

  5. Sometime later, the admins identified the autoscaler as the culprit. They overrode the autoscaler and started restoring instances manually. The instances came up, but their caches were empty. They all made requests to the database at the same time, which led to a dogpile on the database. Reddit was up but unusably slow during this time.

  6. Finally, the caches warmed sufficiently to handle typical traffic. The long nightmare ended and users resumed downvoting everything they disagree with. In other words, normal activity resumed.

The most interesting aspect of this outage is the way it emerged from a conflict between the automation platform’s “belief” about the expected state of the system and the administrator’s belief about the expected state. When the package management system reactivated the autoscaler, it had no way to know that the autoscaler was expected to be down. Likewise, the autoscaler had no way to know that its source of truth (ZooKeeper) was temporarily unable to report the truth. Like HAL 9000, the automation systems were stuck between two conflicting sets of instructions.

A similar condition can occur with service discovery systems. A service discovery service is a distributed system that attempts to report on the state of many distributed systems to other distributed systems. When things are running normally, they work as shown in the figure.

images/stability_antipatterns/force_multiplier_service_discovery.png

The nodes of the discovery system gossip among themselves to synchronize their knowledge of the registered services. They run health checks periodically to see if any of the services’ nodes should be taken out of rotation. If a single instance of one of the services stops responding, then the discovery service removes that node’s IP address. No wonder they can amplify a failure. One especially challenging failure mode occurs when a service discovery node is itself partitioned away from the rest of the network. As shown in the next figure, node 3 of the discovery service can no longer reach any of the managed services. Node 3 kind of panics. It can’t tell the difference between “the rest of the universe just disappeared” and “I’ve got a blindfold on.” But if node 3 can still gossip with nodes 1 and 2, then it can propagate its belief to the whole cluster. All at once, service discovery reports that zero services are available. Any application that needs a service gets told, “Sorry, but it looks like a meteor hit the data center. It’s a smoking crater.”

images/stability_antipatterns/force_multiplier_service_discovery_partitioned.png

Consider a similar failure, but with a platform management service instead. This service is responsible for starting and stopping machine instances. If it forms a belief that everything is down, then it would necessarily start a new copy of every single service required to run the enterprise.

This situation arises mostly with “control plane” software. The “control plane” refers to software that exists to help manage the infrastructure and applications rather than directly delivering user functionality. Logging, monitoring, schedulers, scalers, load balancers, and configuration management are all parts of the control plane.

The common thread running through these failures is that the automation is not being used to simply enact the will of a human administrator. Rather, it’s more like industrial robotics: the control plane senses the current state of the system, compares it to the desired state, and effects changes to bring the current state into the desired state.

In the Reddit failure, ZooKeeper held a representation of the desired state. That representation was (temporarily) incorrect.

In the case of the discovery service, the partitioned node was not able to correctly sense the current state.

A failure can also result when the “desired” state is computed incorrectly and may be impossible or impractical. For example, a naive scheduler might try to run enough instances to drain a queue in a fixed amount of time. Depending on the individual jobs’ processing time, the number of instances might be “infinity.” That will smart when the Amazon Web Services bill arrives!

Controls and Safeguards

The United States has a government agency called the Occupational Safety and Health Administration (OSHA). We don’t see them too often in the software field, but we can still learn from their safety advice for robots.[13]

Industrial robots have multiple layers of safeguards to prevent damage to people, machines, and facilities. In particular, limiting devices and sensors detect when the robot is not operating in a “normal” condition. For example, suppose a robot arm has a rotating joint. There are limits on how far the arm is allowed to rotate based on the expected operating envelope. These will be much, much smaller than the full range of motion the arm could reach. The rate of rotation will be limited so it doesn’t go flinging car doors across an assembly plant if the grip fails. Some joints even detect if they are not working against the expected amount of weight or resistance (as might happen when the front falls off).

We can implement similar safeguards in our control plane software:

  • If observations report that more than 80 percent of the system is unavailable, it’s more likely to be a problem with the observer than the system.

  • Apply hysteresis. (See Governor.) Start machines quickly, but shut them down slowly. Starting new machines is safer than shutting old ones off.

  • When the gap between expected state and observed state is large, signal for confirmation. This is equivalent to a big yellow rotating warning lamp on an industrial robot.

  • Systems that consume resources should be stateful enough to detect if they’re trying to spin up infinity instances.

  • Build in deceleration zones to account for momentum. Suppose your control plane senses excess load every second, but it takes five minutes to start a virtual machine to handle the load. It must make sure not to start 300 virtual machines because the high load persists.


Remember This

Ask for help before causing havoc.

Infrastructure management tools can make very large impacts very quickly. Build limiters and safeguards into them so they won’t destroy your whole system at once.

Beware of lag time and momentum.

Actions initiated by automation take time. That time is usually longer than a monitoring interval, so make sure to account for some delay in the system’s response to the action.

Beware of illusions and superstitions.

Control systems sense the environment, but they can be fooled. They compute an expected state and a “belief” about the current state. Either can be mistaken.


Slow Response

Remember This

Slow Responses trigger Cascading Failures.

Upstream systems experiencing Slow Responses will themselves slow down and might be vulnerable to stability problems when the response times exceed their own timeouts.

For websites, Slow Responses cause more traffic.

Users waiting for pages frequently hit the Reload button, generating even more traffic to your already overloaded system.

Consider Fail Fast.

If your system tracks its own responsiveness, then it can tell when it’s getting slow. Consider sending an immediate error response when the average response time exceeds the system’s allowed time (or at the very least, when the average response time exceeds the caller’s timeout!).

Hunt for memory leaks or resource contention.

Contention for an inadequate supply of database connections produces Slow Responses. Slow Responses also aggravate that contention, leading to a self-reinforcing cycle. Memory leaks cause excessive effort in the garbage collector, resulting in Slow Responses. Inefficient low-level protocols can cause network stalls, also resulting in Slow Responses.

Unbounded Result Sets
processing a row means adding a new data object to a collection. What happens when the database suddenly returns five million rows instead of the usual hundred or so? Unless your application explicitly limits the number of results it’s willing to process, it can end up exhausting its memory or spinning in a while loop long after the user loses interest.


Remember This

Use realistic data volumes.

Typical development and test data sets are too small to exhibit this problem. You need production-sized data sets to see what happens when your query returns a million rows that you turn into objects. As a side benefit, you’ll also get better information from your performance testing when you use production-sized test data.

Paginate at the front end.

Build pagination details into your service call. The request should include a parameter for the first item and the count. The reply should indicate (roughly) how many results there are.

Don’t rely on the data producers.

Even if you think a query will never have more than a handful of results, beware: it could change without warning because of some other part of the system. The only sensible numbers are “zero,” “one,” and “lots,” so unless your query selects exactly one row, it has the potential to return too many. Don’t rely on the data producers to create a limited amount of data. Sooner or later, they’ll go berserk and fill up a table for no reason, and then where will you be?

Put limits into other application-level protocols.

Service calls, RMI, DCOM, XML-RPC, and any other kind of request/reply call are vulnerable to returning huge collections of objects, thereby consuming too much memory.