Learn while Working: November 2022

Stability Pattern

1. Timeouts
The timeout is a simple mechanism allowing you to stop waiting for an answer once you think it won’t come.
Well-placed timeouts provide fault isolation—a problem in some other service or device does not have to become your problem.

Commercial software client libraries are notoriously devoid of timeouts. These libraries often do direct socket calls on behalf of the system. By hiding the socket from your code, they also prevent you from setting vital timeouts.
Any resource pool can be exhausted. It’s essential that any resource pool that blocks threads must have a timeout to ensure that calling threads eventually unblock, whether resources become available or not.

Also beware of language-level synchronization or mutexes. Always use the form that takes a timeout argument.

Use a generic gateway to provide the template for connection handling, error handling, query execution, and result processing. That way you only need to get it right in one place, and calling code can provide just the essential logic. Collecting this common interaction pattern into a single class also makes it easier to apply the Circuit Breaker pattern.

Timeouts are often found in the company of retries. Under the philosophy of “best effort,” the software attempts to repeat an operation that timed out. Immediately retrying an operation after a failure has a number of consequences, but only some of them are beneficial. If the operation failed because of any significant problem, it’s likely to fail again if retried immediately.

From the client’s perspective, making me wait longer is a very bad thing. If you cannot complete an operation because of some timeout, it is better for you to return a result. It can be a failure, a success, or a note that you’ve queued the work for later execution (if I should care about the distinction). In any case, just come back with an answer. Making me wait while you retry the operation might push your response time past my timeout. It certainly keeps my resources busy longer than needed.

Timeouts have natural synergy with circuit breakers. A circuit breaker can tabulate timeouts, tripping to the “off” state if too many occur.

Remember This

Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses.

The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures.

Apply Timeouts to recover from unexpected failures.

When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that.

Consider delayed retries.

Most of the explanations for a timeout involve problems in the network or the remote system that won’t be resolved right away. Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later.

2. Circuit Breaker

Principle : detect excess usage, fail first, and open the circuit.

The circuit breaker exists to allow one subsystem (an electrical circuit) to fail (excessive current draw, possibly from a short circuit) without destroying the entire system (the house). Furthermore, once the danger has passed, the circuit breaker can be reset to restore full function to the system.

In the normal “closed” state, the circuit breaker executes operations as usual. These can be calls out to another system, or they can be internal operations that are subject to timeout or other execution failure. If the call succeeds, nothing extraordinary happens. If it fails, however, the circuit breaker makes a note of the failure. Once the number of failures (or the frequency of failures, in more sophisticated cases) exceeds a threshold, the circuit breaker trips and “opens” the circuit, as shown in the following figure.

images/stability_patterns/circuit_breaker_state_diagram.png

When the circuit is “open,” calls to the circuit breaker fail immediately, without any attempt to execute the real operation. After a suitable amount of time, the circuit breaker decides that the operation has a chance of succeeding, so it goes into the “half-open” state. In this state, the next call to the circuit breaker is allowed to execute the dangerous operation. Should the call succeed, the circuit breaker resets and returns to the “closed” state, ready for more routine operation. If this trial call fails, however, the circuit breaker returns to the open state until another timeout elapses.

Circuit breakers are a way to automatically degrade functionality when the system is under stress. No matter the fallback strategy, it can have an impact on the business of the system. Therefore, it’s essential to involve the system’s stakeholders when deciding how to handle calls made when the circuit is open. For example, should a retail system accept an order if it can’t confirm availability of the customer’s items? What about if it can’t verify the customer’s credit card or shipping address? Of course, this conversation is not unique to the use of a circuit breaker, but discussing the circuit breaker can be a more effective way of broaching the topic than asking for a requirements document.

I like the Leaky Bucket pattern from Pattern Languages of Program Design 2 [VCK96]. It’s a simple counter that you can increment every time you observe a fault. In the background, a thread or timer decrements the counter periodically (down to zero, of course.) If the count exceeds a threshold, then you know that faults are arriving quickly.

The state of the circuit breakers in a system is important to another set of stakeholders: operations. Changes in a circuit breaker’s state should always be logged, and the current state should be exposed for querying and monitoring. In fact, the frequency of state changes is a useful metric to chart over time; it is a leading indicator of problems elsewhere in the enterprise. Likewise, Operations needs some way to directly trip or reset the circuit breaker. The circuit breaker is also a convenient place to gather metrics about call volumes and response times.

Circuit breakers are effective at guarding against integration points, cascading failures, unbalanced capacities, and slow responses. They work so closely with timeouts that they often track timeout failures separately from execution failures.

Remember This

Don’t do it if it hurts.

Circuit Breaker is the fundamental pattern for protecting your system from all manner of Integration Points problems. When there’s a difficulty with Integration Points, stop calling it!

Use together with Timeouts.

Circuit Breaker is good at avoiding calls when Integration Points has a problem. The Timeouts pattern indicates that there’s a problem in Integration Points.

Expose, track, and report state changes.

Popping a Circuit Breaker always indicates something abnormal. It should be visible to Operations. It should be reported, recorded, trended, and correlated.

Bulkheads

In a ship, bulkheads are partitions that, when sealed, divide the ship into separate, watertight compartments. With hatches closed, a bulkhead prevents water from moving from one section to another. In this way, a single penetration of the hull does not irrevocably sink the ship. The bulkhead enforces a principle of damage containment.

You can employ the same technique. By partitioning your systems, you can keep a failure in one part of the system from destroying everything. Physical redundancy is the most common form of bulkheads. If there are four independent servers, then a hardware failure in one can’t affect the others. Likewise, if there are two application instances running on a server and one crashes, the other will still be running (unless, of course, the first one crashed because of some external influence that would also affect the second).

You can partition the threads inside a single process, with separate thread groups dedicated to different functions. For example, it’s often helpful to reserve a pool of request-handling threads for administrative use. That way, even if all request-handling threads on the application server are hung, it can still respond to admin requests—perhaps to collect data for postmortem analysis or a request to shut down.

Remember This

Save part of the ship.: The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen.
Pick a useful granularity.: You can partition thread pools inside an application, CPUs in a server, or servers in a cluster.
Consider Bulkheads particularly with shared services models.: Failures in service-oriented or microservice architectures can propagate very quickly. If your service goes down because of a Chain Reaction, does the entire company come to a halt? Then you’d better put in some Bulkheads.

3. Steady State

Every single time a human touches a server is an opportunity for unforced errors.

Unless the system is crashing every day (in which case, look for the presence of the stability antipatterns), the most common reason for logging in will probably be cleaning up log files or purging data.

Data purging is nasty, detail-oriented work. Referential integrity constraints in a relational database are half the battle. It can be difficult to cleanly remove obsolete data without leaving orphaned rows. The other half of the battle is ensuring that applications still work once the data is gone. That takes coding and testing.

Log files

One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.

Left unchecked, however, log files on individual machines are a risk. When log files grow without bound, they’ll eventually fill up their containing filesystem. Whether that’s a volume set aside for logs, the root disk, or the application installation directory (I hope not), it means trouble. When log files fill up the filesystem, they jeopardize stability. That’s because of the different negative effects that can occur when the filesystem is full. On a UNIX system, the last 5--10 percent (depending on the configuration of the filesystem) of space is reserved for root. That means an application will start getting I/O errors when the filesystem is 90 or 95 percent full. Of course, if the application is running as root, then it can consume the very last byte of space. On a Windows system, an application can always use the very last byte. In either case, the operating system will report errors back to the application.

Of course, it’s always better to avoid filling up the filesystem in the first place. Log file rotation requires just a few minutes of configuration.

Logging can be a wonderful aid to transparency. Make sure that all log files will get rotated out and eventually purged, though, or you’ll eventually spend time fixing the tool that’s supposed to help you fix the system.

Compliance for logs files:

These various compliance regimes require you to retain logs for years. Individual machines can’t possibly retain logs that long. Most of the machines don’t live that long, especially if you’re in the cloud! The best thing to do is get logs off of production machines as quickly as possible. Store them on a centralized server and monitor it closely for tampering.

Steady State

The third edition of Roget’s Thesaurus offers the following definition for the word fiddling: “To handle something idly, ignorantly, or destructively.” It offers helpful synonyms such as fool, meddle, tamper, tinker, and monkey. Fiddling is often followed by the “ohnosecond”—that very short moment in time during which you realize that you have pressed the wrong key and brought down a server, deleted vital data, or otherwise damaged the peace and harmony of stable operations.

Every single time a human touches a server is an opportunity for unforced errors. I know of one incident in which an engineer, attempting to be helpful, observed that a server’s root disk mirror was out of sync. He executed a command to “resilver” the mirror, bringing the two disks back into synchronization. Unfortunately, he made a typo and synced the good root disk from the new, totally empty drive that had just been swapped in to replace a bad disk, thereby instantly annihilating the operating system on that server.

It’s best to keep people off production systems to the greatest extent possible. If the system needs a lot of crank-turning and hand-holding to keep running, then administrators develop the habit of staying logged in all the time. This situation probably indicates that the servers are “pets” rather than “cattle” and inevitably leads to fiddling. To that end, the system should be able to run at least one release cycle without human intervention. The logical extreme on the “no fiddling” scale is immutable infrastructure—it can’t be fiddled with! (See Automated Deployments, for more about immutable infrastructure.)

“One release cycle” may be pretty tough if the system is deployed once a quarter. On the other hand, a microservice being continuously deployed from version control should be pretty easy to stabilize for a release cycle.

Any mechanism that accumulates resources (whether it’s log files in the filesystem, rows in the database, or caches in memory) is like a bucket from a high-school calculus problem. The bucket fills up at a certain rate, based on the accumulation of data. It must be drained at the same rate, or greater, or it will eventually overflow. When this bucket overflows, bad things happen: servers go down, databases get slow or throw errors, response times head for the stars. The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource. Let’s look at several types of sludge that can accumulate and how to avoid the need for fiddling.

Data Purging

It certainly seems like a simple enough principle. Computing resources are always finite; therefore, you cannot continually increase consumption without limit. Still, in the rush of excitement about rolling out a new killer application, the next great mission-critical, bet-the-company whatever, data purging always gets the short end of the stick. It certainly doesn’t demo as well as…well, anything demos better than purging, really. It sometimes seems that you’ll be lucky if the system ever runs at all in the real world. The notion that it’ll run long enough to accumulate too much data to handle seems like a “high-class problem”—the kind of problem you’d love to have.

Nevertheless, someday your little database will grow up. When it hits the teenage years—about two in human years—it’ll get moody, sullen, and resentful. In the worst case, it’ll start undermining the whole system (and it will probably complain that nobody understands it, too).

The most obvious symptom of data growth will be steadily increasing I/O rates on the database servers. You may also see increasing latency at constant loads.

Data purging is nasty, detail-oriented work. Referential integrity constraints in a relational database are half the battle. It can be difficult to cleanly remove obsolete data without leaving orphaned rows. The other half of the battle is ensuring that applications still work once the data is gone. That takes coding and testing.

There are few general rules here. Much depends on the database and libraries in use. RDBMS plus ORM tends to deal badly with dangling references, for example, whereas a document-oriented database won’t even notice.

As a consequence, data purging always gets left until after the first release is out the door. The rationale is, “We’ve got six months after launch to implement purging.” (Somehow, they always say “six months.” It’s kind of like a programmer’s estimate of “two weeks.”)

Of course, after launch, there are always emergency releases to fix critical defects or add “must-have” features from marketers tired of waiting for the software to be done. The first six months can slip away pretty quickly, but when that first release launches, a fuse is lit.

Another type of sludge you will commonly encounter is old log files.

Log Files

What happens next is anyone’s guess. In the best-case scenario, the logging filesystem is separate from any critical data storage (such as transactions), and the application code protects itself well enough that users never realize anything is amiss. Significantly less pleasant, but still tolerable, is a nicely worded error message asking the users to have patience with us and please come back when we’ve got our act together. Several rungs down the ladder is serving a stack trace to the user.

Worse yet, the developers in one system I saw had added a “universal exception handler” to the servlet pipeline. This handler would log any kind of exception. It was reentrant, so if an exception occurred while logging an exception, it would log both the original and the new exception. As soon as the filesystem got full, this poor exception handler went nuts, trying to log an ever-increasing stack of exceptions. Because there were multiple threads, each trying to log its own Sisyphean exception, this application server was able to consume eight entire CPUs—for a little while, anyway. The exceptions, multiplying like Leonardo of Pisa’s rabbits, rapidly consumed all available memory. This was followed shortly by a crash.

Of course, it’s always better to avoid filling up the filesystem in the first place. Log file rotation requires just a few minutes of configuration.

In the case of legacy code, third-party code, or code that doesn’t use one of the excellent logging frameworks available, the logrotate utility is ubiquitous on UNIX. For Windows, you can try building logrotate under Cygwin, or you can hand roll a vbs or bat script to do the job. Logging can be a wonderful aid to transparency. Make sure that all log files will get rotated out and eventually purged, though, or you’ll eventually spend time fixing the tool that’s supposed to help you fix the system.

You will sometimes hear people talking about logging in terms of compliance requirements. Compliance in all its forms makes many heavy demands on IT infrastructure and operations. The specific demands depend on your industry, but there’s always a component about “controls.” The Sarbanes–Oxley Act of 2002 (SOX) requires adequate controls on any system that produces financially significant information. The company must be able to demonstrate that nobody can monkey with the financial data. Another common requirement is to record and demonstrate that only authorized users accessed certain data. Many companies also face industry- and country-specific regulations.

Log files on production systems have a terrible signal-to-noise ratio. It’s best to get them off the individual hosts as quickly as possible. Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.

Between data in the database and log files on the disk, persistent data can find plenty of ways to clog up your system. Like a jingle from an old commercial, sludge stuck in memory clogs up your application.

In-Memory Caching

To a long-running server, memory is like oxygen. Cache, left untended, will suck up all the oxygen. Low memory conditions are a threat to both stability and capacity.

If the number of possible keys has no upper bound, then cache size limits must be enforced and the cache needs some form of cache invalidation. The simplest mechanism is a time-based cache flush. You can also investigate least recently used (LRU) or working-set algorithms, but nine times out of ten, a periodic flush will do.

Improper use of caching is the major cause of memory leaks, which in turn lead to horrors like daily server restarts. Nothing gets administrators in the habit of being logged onto production like daily (or nightly) chores.

Remember This

Avoid fiddling.

Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run for at least a typical deployment cycle without manual disk cleanups or nightly restarts.

Purge data with application logic.

DBAs can create scripts to purge data, but they don’t always know how the application behaves when data is removed. Maintaining logical integrity, especially if you use an ORM tool, requires the application to purge its own data.

Limit caching.

In-memory caching speeds up applications, until it slows them down. Limit the amount of memory a cache can consume.

Roll the logs.

Don’t keep an unlimited amount of log files. Configure log file rotation based on size. If you need to retain them for compliance, do it on a nonproduction server.

Fail Fast

If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work.

1. when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

Fail Fast

If slow responses are worse than no response, the worst must surely be a slow failure response. It’s like waiting through the interminable line at the DMV, only to be told you need to fill out a different form and go back to the end of the line. Can there be any bigger waste of system resources than burning cycles and clock time only to throw away the result?

How can the system tell whether it will fail? Do we need Deep Learning? Don’t worry, you won’t need to hire a cadre of data scientists.

It’s actually much more mundane than that. There’s a large class of “resource unavailable” failures. For example, when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place—gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.

One of my more interesting projects was for a studio photography company. Part of the project involved working on the software that rendered images for high-resolution printing. The previous generation of this software had a problem that generated more work for humans downstream: if color profiles, images, backgrounds, or alpha masks weren’t available, it “rendered” a black image full of zero-valued pixels. This black image went into the printing pipeline and was printed, wasting paper, chemicals, and time. Quality checkers would pull the black image and send it back to the people at the beginning of the process for diagnosis, debugging, and correction. Ultimately, they would fix the problem (usually by calling developers to the printing facility) and remake the bad print. Since the order was already late getting out the door, they would expedite the remake—meaning it interrupted the pipeline of work and went to the head of the line.

When my team started on the rendering software, we applied the Fail Fast pattern. As soon as the print job arrived, the renderer checked for the presence of every font (missing fonts caused a similar remake, but not because of black images), image, background, and alpha mask. It preallocated memory, so it couldn’t fail an allocation later. The renderer reported any such failure to the job control system immediately, before it wasted several minutes of compute time. Best of all, “broken” orders would be pulled from the pipeline, avoiding the case of having partial orders waiting at the end of the process. Once we launched the new renderer, the software-induced remake rate dropped to zero. Orders could still be remade because of other quality problems—dust in the camera, poor exposure, or bad cropping—but at least our software wasn’t the cause.

The only thing we didn’t preallocate was disk space for the final image. We violated “steady state” under the direction of the customer, who indicated that he had his own rock-solid purging process. Turns out the “purging process” was one guy who occasionally deleted a bunch of files by hand. Less than one year after we launched, the drives filled up. Sure enough, the one place we broke the Fail Fast principle was the one place our renderer failed to report errors before wasting effort. It would render images—several minutes of compute time—and then throw an exception.

Another way to fail fast in a web application is to perform basic parameter-checking in the servlet or controller that receives the request, before talking to the database. This would be a good reason to move some parameter checking out of domain objects into something like a “Query object.”

Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit Reload three or four times.

The Fail Fast pattern improves overall system stability by avoiding slow responses. Together with timeouts, failing fast can help avert impending cascading failures. It also helps maintain capacity when the system is under stress because of partial failures.

Remember This

Avoid Slow Responses and Fail Fast.

If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.

Reserve resources, verify Integration Points early.

In the theme of “don’t do useless work,” make sure you’ll be able to complete the transaction before you start. If critical resources aren’t available—for example, a popped Circuit Breaker on a required callout—then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.

Use for input validation.

Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate just to find out that a required parameter wasn’t entered.

Let it Crash

Remember This

Crash components to save systems.: It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a known good state.

Restart fast and reintegrate.: The key to crashing well is getting back up quickly. Otherwise you risk loss of service when too many components are bouncing. Once a component is back up, it should be reintegrated automatically.

Isolate components to crash independently.: Use Circuit Breakers to isolate callers from components that crash. Use supervisors to determine what the span of restarts should be. Design your supervision tree so that crashes are isolated and don’t affect unrelated functionality.

Don’t crash monoliths.: Large processes with heavy runtimes or long startups are not the right place to apply this pattern. Applications that couple many features into a single process are also a poor choice.

Learn while Working

Tuesday 1 November 2022

Remember This

Remember This

Remember This

Steady State

Data Purging

Log Files

In-Memory Caching

Remember This

Fail Fast

Fail Fast

Remember This

Remember This

Blog Archive