Sunday 3 September 2023

Notes - LeaderShip

 

  1. An ill-defined mess or pain point
  2. A goal we don’t know how to reach
  3. A solution someone fell in love with
  4.   -> What is the problem
    Hard-to-reach goals require people to come up with new ideas rather than sticking to business as usual.
    For instance, many successful innovations hinge on rethinking what customers really care about, versus what the existing solutions in the market cater to.

    1. Is the statement true?
    When looking at a problem statement, a good first question to ask is, How do we know this is true? Could this be incorrect?*

    2. Are there simple self-imposed limitations?
    To find self-imposed limitations, simply review the framing of the problem and ask: How are we framing this? Is it too narrow? Are we putting constraints on the solution that aren’t necessarily real?

    3. Is a solution “baked into” the problem framing?
    Are there other things in play? How about our promotion processes? How about informal connections? Do women get less exposure to senior decision makers?

    4. Is the problem clear?


    CHAPTER SUMMARY

    frame the problem

    Before you can reframe a problem, you first have to frame it, giving you something to work on. To do so:

    • Ask, “What problem are we trying to solve?” This triggers the reframing process. You might also ask “Are we solving the right problem?” or “Let’s revisit the problem for a second.”
    • If possible, quickly write a problem statement, describing the problem in a few sentences. Keep it short, and use full sentences.
    • Next to the statement, list the main stakeholders: Who is involved in the problem?

    Once you have the first framing, subject it to a quick review. Look for the following in particular:

    • Is the statement true? Is the elevator actually slow? Compared to what? How do we know this?
    • Are there self-imposed limitations? At TV2, the team asked “Where can we find money?” instead of assuming it had to come out of their own budget.
    • Is a solution “baked into” the problem framing? Often, problems are framed so that they point to a specific answer. This is not necessarily bad, but it’s important to notice.
    • Is the problem clear? Problems don’t always present as problems. Often, you are really looking at a goal or a pain point in disguise.
    • With whom is the problem located? Words like weme, and they suggest who may “own” the problem. Who is not mentioned or implicated?
    • Are there strong emotions? Emotional words typically indicate areas you should explore in more depth.
    • Are there false trade-offs? Who defined the choices you are presented with? Can you create a better alternative than the ones presented?

    Once you have completed the initial review, step 1 (Frame) is done, setting you up to reframe the problem.


look outside the frame

For each problem, remember to look outside the frame:

  • Don’t get caught up in the visible details.
  • Think about what might be missing from your current framing of the problem.

Once you have done a general review, try to apply the four tactics described in the chapter, summarized here.

1. Look beyond your own expertise

Remember the law of the hammer: we tend to frame problems so that they match our preferred solutions. In Brazil, the finance people focused on the financial metrics of the stock price, overlooking the communications aspect.

Consider the following:

  • What is your own favorite “hammer,” meaning the type of solution you are good at applying?
  • What type of problem does your hammer match?
  • What if the problem was not such a problem: What else could it be?

2. Look to prior events

Recall the shouting match with the teacher in which a prior event may have caused the issue: “Did you eat breakfast this morning?”

Consider:

  • How are you framing the problem from a time perspective?
  • Did something important happen before the period of time you are looking at?
  • For that matter, is there something after the time period that you missed? For instance, do people act a certain way because they fear a future outcome?

3. Look for hidden influences

Remember the marshmallow test and how the researchers overlooked the influence of poverty. Or think about how Pierre figured out the influence his bank’s office building had on recruiting.

Consider:

  • Are there stakeholders whose influence you’re missing?
  • Are there higher-level, systemic factors at play that influence the people involved?

4. Look for nonobvious aspects of the situation

Remember the light bulb problem, in which a less salient quality—that light bulbs emit heat—led to a more efficient solution than the one most people come up with.

  • Are there nonobvious aspects of the problem or the situation that you could look into?
  • Do you have data that can help you, or other things that are already available to us?
  • How is functional fixedness affecting you?

Finally, are there other things “outside the frame” that you are not paying attention to? Incentives? Emotions? People or groups you have forgotten about? Briefly consider this, and then move on.



Sunday 25 December 2022

Time Management

 Time Management

Be selective - select most important work first

Biggest, hardest and most important task


-- tackle most important task first in morning

   -- plan, start and complete the task in time

   Decision - Take a decision 

   discipline - do things to accomplish what you decided

  determination - do things to accomplish what you decided


What are the most important results you have to get to be successful in your work life today?

what is the biggest task that you can compete that will make the biggest difference in your life right now?


Set the table - be very clear

Steps:

1. Decide what do you want and discuss goal and priority

2. write it down and write on paper

3. Set deadline on goals

4. Make a list of task to complete the goals

5. organise list into plan - create a checklist, priotise and  order it

6. take action on your plan - execute it

7. Do something that makes you closer to goals everyday


-- Take a clean sheet of paper and make a list of 10 goals

   - choose a goal that will have highest impact on your life and plan to execute it everyday


Plan everyday in advance

80/20 rules - of all tasks, 20% are the most important and get more benefit than 80% rest.

-- make a list of goals, activities, projects and responsibilities and then prioritised based on 80/20 rules

-- spend more time on 20% that matters most


consider the consequences 

long term thinking improves short term decision

There is never enough time to do everything but you have enough time to complete most important work


what are my highest value activity

What can I do and only I do, that if done well, will make a real difference.

what is the most valuable use of my time right now?


-- Review your list of task  regularly and find out which one have greatest consequences?


Practise zero base thinking- Zero-Based Thinking gives you the opportunity to start over. Some things in life simply aren’t worth continuing.

select one activity to abandon


You then place an A, B, C, D, or E next to each item on your list before you begin the first task.

An “A” item is defined as something that is very important, something that you must do. This is a task that will have serious positive or negative consequences if you do it or fail to do it, like visiting a key customer or finishing a report that your boss needs for an upcoming board meeting. These items are the frogs of your life.

If you have more than one A task, you prioritize these tasks by writing “A-1,” “A-2,” “A-3,” and so on in front of each item. Your A-1 task is your biggest, ugliest frog of all.

“Shoulds” versus “Musts”

A “B” item is defined as a task that you should do. But it has only mild consequences. These are the tadpoles of your work life. This means that someone may be unhappy or inconvenienced if you don’t do one of these tasks, but it is nowhere as important as an A task. Returning an unimportant telephone message or reviewing your e-mail would be a B task.

The rule is that you should never do a B task when an A task is left undone. You should never be distracted by a tadpole when a big frog is sitting there waiting to be eaten.

A “C” task is defined as something that would be nice to do but for which there are no consequences at all, whether you do it or not. C tasks include phoning a friend, having coffee or lunch with a coworker, and completing some personal business during work hours. These sorts of activities have no affect at all on your work life.

A “D” task is defined as something you can delegate to someone else. The rule is that you should delegate everything that someone else can do so that you can free up more time for the A tasks that only you can do.

An “E” task is defined as something that you can eliminate altogether, and it won’t make any real difference. This may be a task that was important at one time but is no longer relevant to you or anyone else. Often it is something you continue to do out of habit or because you enjoy it. But every minute that you spend on an E task is time taken away from a task or activity that can make a real difference in your life.

1. Review your work list right now and put an A, B, C, D, or E next to each task or activity. Select your A-1 job or project and begin on it immediately. Discipline yourself to do nothing else until this one job is complete.

2. Practice this ABCDE Method every day for the next month on every work or project list before you begin work. After a month, you will have developed the habit of setting and working on your highest-priority tasks, and your future will be assured!


Focus on the Key Result Areas

what one skill would have greatest impact on positive impact on my career?


Grade your key result areas and determine your one key skill

Get feedback from your boss, coworkers and family



      


Tuesday 1 November 2022

 Stability Pattern

1. Timeouts
The timeout is a simple mechanism allowing you to stop waiting for an answer once you think it won’t come. 
Well-placed timeouts provide fault isolation—a problem in some other service or device does not have to become your problem.

Commercial software client libraries are notoriously devoid of timeouts. These libraries often do direct socket calls on behalf of the system. By hiding the socket from your code, they also prevent you from setting vital timeouts.
Any resource pool can be exhausted. It’s essential that any resource pool that blocks threads must have a timeout to ensure that calling threads eventually unblock, whether resources become available or not.

Also beware of language-level synchronization or mutexes. Always use the form that takes a timeout argument.

Use a generic gateway to provide the template for connection handling, error handling, query execution, and result processing. That way you only need to get it right in one place, and calling code can provide just the essential logic. Collecting this common interaction pattern into a single class also makes it easier to apply the Circuit Breaker pattern.

Timeouts are often found in the company of retries. Under the philosophy of “best effort,” the software attempts to repeat an operation that timed out. Immediately retrying an operation after a failure has a number of consequences, but only some of them are beneficial. If the operation failed because of any significant problem, it’s likely to fail again if retried immediately.

From the client’s perspective, making me wait longer is a very bad thing. If you cannot complete an operation because of some timeout, it is better for you to return a result. It can be a failure, a success, or a note that you’ve queued the work for later execution (if I should care about the distinction). In any case, just come back with an answer. Making me wait while you retry the operation might push your response time past my timeout. It certainly keeps my resources busy longer than needed.


Timeouts have natural synergy with circuit breakers. A circuit breaker can tabulate timeouts, tripping to the “off” state if too many occur.

Timeouts have natural synergy with circuit breakers. A circuit breaker can tabulate timeouts, tripping to the “off” state if too many occur.

Remember This

Apply Timeouts to Integration Points, Blocked Threads, and Slow Responses.

The Timeouts pattern prevents calls to Integration Points from becoming Blocked Threads. Thus, timeouts avert Cascading Failures.

Apply Timeouts to recover from unexpected failures.

When an operation is taking too long, sometimes we don’t care why…we just need to give up and keep moving. The Timeouts pattern lets us do that.

Consider delayed retries.

Most of the explanations for a timeout involve problems in the network or the remote system that won’t be resolved right away. Immediate retries are liable to hit the same problem and result in another timeout. That just makes the user wait even longer for her error message. Most of the time, you should queue the operation and retry it later.

2. Circuit Breaker

Principle : detect excess usage, fail first, and open the circuit. 
The circuit breaker exists to allow one subsystem (an electrical circuit) to fail (excessive current draw, possibly from a short circuit) without destroying the entire system (the house). Furthermore, once the danger has passed, the circuit breaker can be reset to restore full function to the system.

In the normal “closed” state, the circuit breaker executes operations as usual. These can be calls out to another system, or they can be internal operations that are subject to timeout or other execution failure. If the call succeeds, nothing extraordinary happens. If it fails, however, the circuit breaker makes a note of the failure. Once the number of failures (or the frequency of failures, in more sophisticated cases) exceeds a threshold, the circuit breaker trips and “opens” the circuit, as shown in the following figure.

images/stability_patterns/circuit_breaker_state_diagram.png

When the circuit is “open,” calls to the circuit breaker fail immediately, without any attempt to execute the real operation. After a suitable amount of time, the circuit breaker decides that the operation has a chance of succeeding, so it goes into the “half-open” state. In this state, the next call to the circuit breaker is allowed to execute the dangerous operation. Should the call succeed, the circuit breaker resets and returns to the “closed” state, ready for more routine operation. If this trial call fails, however, the circuit breaker returns to the open state until another timeout elapses.

Circuit breakers are a way to automatically degrade functionality when the system is under stress. No matter the fallback strategy, it can have an impact on the business of the system. Therefore, it’s essential to involve the system’s stakeholders when deciding how to handle calls made when the circuit is open. For example, should a retail system accept an order if it can’t confirm availability of the customer’s items? What about if it can’t verify the customer’s credit card or shipping address? Of course, this conversation is not unique to the use of a circuit breaker, but discussing the circuit breaker can be a more effective way of broaching the topic than asking for a requirements document.


I like the Leaky Bucket pattern from Pattern Languages of Program Design 2 [VCK96]. It’s a simple counter that you can increment every time you observe a fault. In the background, a thread or timer decrements the counter periodically (down to zero, of course.) If the count exceeds a threshold, then you know that faults are arriving quickly.

The state of the circuit breakers in a system is important to another set of stakeholders: operations. Changes in a circuit breaker’s state should always be logged, and the current state should be exposed for querying and monitoring. In fact, the frequency of state changes is a useful metric to chart over time; it is a leading indicator of problems elsewhere in the enterprise. Likewise, Operations needs some way to directly trip or reset the circuit breaker. The circuit breaker is also a convenient place to gather metrics about call volumes and response times.

Circuit breakers are effective at guarding against integration points, cascading failures, unbalanced capacities, and slow responses. They work so closely with timeouts that they often track timeout failures separately from execution failures.

Remember This

Don’t do it if it hurts.

Circuit Breaker is the fundamental pattern for protecting your system from all manner of Integration Points problems. When there’s a difficulty with Integration Points, stop calling it!

Use together with Timeouts.

Circuit Breaker is good at avoiding calls when Integration Points has a problem. The Timeouts pattern indicates that there’s a problem in Integration Points.

Expose, track, and report state changes.

Popping a Circuit Breaker always indicates something abnormal. It should be visible to Operations. It should be reported, recorded, trended, and correlated.


Bulkheads

In a ship, bulkheads are partitions that, when sealed, divide the ship into separate, watertight compartments. With hatches closed, a bulkhead prevents water from moving from one section to another. In this way, a single penetration of the hull does not irrevocably sink the ship. The bulkhead enforces a principle of damage containment.

You can employ the same technique. By partitioning your systems, you can keep a failure in one part of the system from destroying everything. Physical redundancy is the most common form of bulkheads. If there are four independent servers, then a hardware failure in one can’t affect the others. Likewise, if there are two application instances running on a server and one crashes, the other will still be running (unless, of course, the first one crashed because of some external influence that would also affect the second).

You can partition the threads inside a single process, with separate thread groups dedicated to different functions. For example, it’s often helpful to reserve a pool of request-handling threads for administrative use. That way, even if all request-handling threads on the application server are hung, it can still respond to admin requests—perhaps to collect data for postmortem analysis or a request to shut down.

Remember This

Save part of the ship.

The Bulkheads pattern partitions capacity to preserve partial functionality when bad things happen.

Pick a useful granularity.

You can partition thread pools inside an application, CPUs in a server, or servers in a cluster.

Consider Bulkheads particularly with shared services models.

Failures in service-oriented or microservice architectures can propagate very quickly. If your service goes down because of a Chain Reaction, does the entire company come to a halt? Then you’d better put in some Bulkheads.

3. Steady State

Every single time a human touches a server is an opportunity for unforced errors.

Unless the system is crashing every day (in which case, look for the presence of the stability antipatterns), the most common reason for logging in will probably be cleaning up log files or purging data.

Data purging is nasty, detail-oriented work. Referential integrity constraints in a relational database are half the battle. It can be difficult to cleanly remove obsolete data without leaving orphaned rows. The other half of the battle is ensuring that applications still work once the data is gone. That takes coding and testing.

Log files

One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.

Left unchecked, however, log files on individual machines are a risk. When log files grow without bound, they’ll eventually fill up their containing filesystem. Whether that’s a volume set aside for logs, the root disk, or the application installation directory (I hope not), it means trouble. When log files fill up the filesystem, they jeopardize stability. That’s because of the different negative effects that can occur when the filesystem is full. On a UNIX system, the last 5--10 percent (depending on the configuration of the filesystem) of space is reserved for root. That means an application will start getting I/O errors when the filesystem is 90 or 95 percent full. Of course, if the application is running as root, then it can consume the very last byte of space. On a Windows system, an application can always use the very last byte. In either case, the operating system will report errors back to the application.

Of course, it’s always better to avoid filling up the filesystem in the first place. Log file rotation requires just a few minutes of configuration.

Logging can be a wonderful aid to transparency. Make sure that all log files will get rotated out and eventually purged, though, or you’ll eventually spend time fixing the tool that’s supposed to help you fix the system.

Compliance for logs files:
These various compliance regimes require you to retain logs for years. Individual machines can’t possibly retain logs that long. Most of the machines don’t live that long, especially if you’re in the cloud! The best thing to do is get logs off of production machines as quickly as possible. Store them on a centralized server and monitor it closely for tampering.


Steady State

The third edition of Roget’s Thesaurus offers the following definition for the word fiddling: “To handle something idly, ignorantly, or destructively.” It offers helpful synonyms such as foolmeddletampertinker, and monkey. Fiddling is often followed by the “ohnosecond”—that very short moment in time during which you realize that you have pressed the wrong key and brought down a server, deleted vital data, or otherwise damaged the peace and harmony of stable operations.

Every single time a human touches a server is an opportunity for unforced errors. I know of one incident in which an engineer, attempting to be helpful, observed that a server’s root disk mirror was out of sync. He executed a command to “resilver” the mirror, bringing the two disks back into synchronization. Unfortunately, he made a typo and synced the good root disk from the new, totally empty drive that had just been swapped in to replace a bad disk, thereby instantly annihilating the operating system on that server.

It’s best to keep people off production systems to the greatest extent possible. If the system needs a lot of crank-turning and hand-holding to keep running, then administrators develop the habit of staying logged in all the time. This situation probably indicates that the servers are “pets” rather than “cattle” and inevitably leads to fiddling. To that end, the system should be able to run at least one release cycle without human intervention. The logical extreme on the “no fiddling” scale is immutable infrastructure—it can’t be fiddled with! (See Automated Deployments, for more about immutable infrastructure.)

“One release cycle” may be pretty tough if the system is deployed once a quarter. On the other hand, a microservice being continuously deployed from version control should be pretty easy to stabilize for a release cycle.

Unless the system is crashing every day (in which case, look for the presence of the stability antipatterns), the most common reason for logging in will probably be cleaning up log files or purging data.

Any mechanism that accumulates resources (whether it’s log files in the filesystem, rows in the database, or caches in memory) is like a bucket from a high-school calculus problem. The bucket fills up at a certain rate, based on the accumulation of data. It must be drained at the same rate, or greater, or it will eventually overflow. When this bucket overflows, bad things happen: servers go down, databases get slow or throw errors, response times head for the stars. The Steady State pattern says that for every mechanism that accumulates a resource, some other mechanism must recycle that resource. Let’s look at several types of sludge that can accumulate and how to avoid the need for fiddling.

Data Purging

It certainly seems like a simple enough principle. Computing resources are always finite; therefore, you cannot continually increase consumption without limit. Still, in the rush of excitement about rolling out a new killer application, the next great mission-critical, bet-the-company whatever, data purging always gets the short end of the stick. It certainly doesn’t demo as well as…well, anything demos better than purging, really. It sometimes seems that you’ll be lucky if the system ever runs at all in the real world. The notion that it’ll run long enough to accumulate too much data to handle seems like a “high-class problem”—the kind of problem you’d love to have.

Nevertheless, someday your little database will grow up. When it hits the teenage years—about two in human years—it’ll get moody, sullen, and resentful. In the worst case, it’ll start undermining the whole system (and it will probably complain that nobody understands it, too).

The most obvious symptom of data growth will be steadily increasing I/O rates on the database servers. You may also see increasing latency at constant loads.

Data purging is nasty, detail-oriented work. Referential integrity constraints in a relational database are half the battle. It can be difficult to cleanly remove obsolete data without leaving orphaned rows. The other half of the battle is ensuring that applications still work once the data is gone. That takes coding and testing.

There are few general rules here. Much depends on the database and libraries in use. RDBMS plus ORM tends to deal badly with dangling references, for example, whereas a document-oriented database won’t even notice.

As a consequence, data purging always gets left until after the first release is out the door. The rationale is, “We’ve got six months after launch to implement purging.” (Somehow, they always say “six months.” It’s kind of like a programmer’s estimate of “two weeks.”)

Of course, after launch, there are always emergency releases to fix critical defects or add “must-have” features from marketers tired of waiting for the software to be done. The first six months can slip away pretty quickly, but when that first release launches, a fuse is lit.

Another type of sludge you will commonly encounter is old log files.

Log Files

One log file is like one pile of cow dung—not very valuable, and you’d rather not dig through it. Collect tons of cow dung and it becomes “fertilizer.” Likewise, if you collect enough log files you can discover value.

Left unchecked, however, log files on individual machines are a risk. When log files grow without bound, they’ll eventually fill up their containing filesystem. Whether that’s a volume set aside for logs, the root disk, or the application installation directory (I hope not), it means trouble. When log files fill up the filesystem, they jeopardize stability. That’s because of the different negative effects that can occur when the filesystem is full. On a UNIX system, the last 5--10 percent (depending on the configuration of the filesystem) of space is reserved for root. That means an application will start getting I/O errors when the filesystem is 90 or 95 percent full. Of course, if the application is running as root, then it can consume the very last byte of space. On a Windows system, an application can always use the very last byte. In either case, the operating system will report errors back to the application.

What happens next is anyone’s guess. In the best-case scenario, the logging filesystem is separate from any critical data storage (such as transactions), and the application code protects itself well enough that users never realize anything is amiss. Significantly less pleasant, but still tolerable, is a nicely worded error message asking the users to have patience with us and please come back when we’ve got our act together. Several rungs down the ladder is serving a stack trace to the user.

Worse yet, the developers in one system I saw had added a “universal exception handler” to the servlet pipeline. This handler would log any kind of exception. It was reentrant, so if an exception occurred while logging an exception, it would log both the original and the new exception. As soon as the filesystem got full, this poor exception handler went nuts, trying to log an ever-increasing stack of exceptions. Because there were multiple threads, each trying to log its own Sisyphean exception, this application server was able to consume eight entire CPUs—for a little while, anyway. The exceptions, multiplying like Leonardo of Pisa’s rabbits, rapidly consumed all available memory. This was followed shortly by a crash.

Of course, it’s always better to avoid filling up the filesystem in the first place. Log file rotation requires just a few minutes of configuration.

In the case of legacy code, third-party code, or code that doesn’t use one of the excellent logging frameworks available, the logrotate utility is ubiquitous on UNIX. For Windows, you can try building logrotate under Cygwin, or you can hand roll a vbs or bat script to do the job. Logging can be a wonderful aid to transparency. Make sure that all log files will get rotated out and eventually purged, though, or you’ll eventually spend time fixing the tool that’s supposed to help you fix the system.

Log files on production systems have a terrible signal-to-noise ratio. It’s best to get them off the individual hosts as quickly as possible. Ship the log files to a centralized logging server, such as Logstash, where they can be indexed, searched, and monitored.

Between data in the database and log files on the disk, persistent data can find plenty of ways to clog up your system. Like a jingle from an old commercial, sludge stuck in memory clogs up your application.

In-Memory Caching

To a long-running server, memory is like oxygen. Cache, left untended, will suck up all the oxygen. Low memory conditions are a threat to both stability and capacity. 

If the number of possible keys has no upper bound, then cache size limits must be enforced and the cache needs some form of cache invalidation. The simplest mechanism is a time-based cache flush. You can also investigate least recently used (LRU) or working-set algorithms, but nine times out of ten, a periodic flush will do.

Improper use of caching is the major cause of memory leaks, which in turn lead to horrors like daily server restarts. Nothing gets administrators in the habit of being logged onto production like daily (or nightly) chores.

Remember This

Avoid fiddling.

Human intervention leads to problems. Eliminate the need for recurring human intervention. Your system should run for at least a typical deployment cycle without manual disk cleanups or nightly restarts.

Purge data with application logic.

DBAs can create scripts to purge data, but they don’t always know how the application behaves when data is removed. Maintaining logical integrity, especially if you use an ORM tool, requires the application to purge its own data.

Limit caching.

In-memory caching speeds up applications, until it slows them down. Limit the amount of memory a cache can consume.

Roll the logs.

Don’t keep an unlimited amount of log files. Configure log file rotation based on size. If you need to retain them for compliance, do it on a nonproduction server.

Fail Fast

If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work.
1. when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. 


Fail Fast

If slow responses are worse than no response, the worst must surely be a slow failure response. It’s like waiting through the interminable line at the DMV, only to be told you need to fill out a different form and go back to the end of the line. Can there be any bigger waste of system resources than burning cycles and clock time only to throw away the result?

If the system can determine in advance that it will fail at an operation, it’s always better to fail fast. That way, the caller doesn’t have to tie up any of its capacity waiting and can get on with other work.

How can the system tell whether it will fail? Do we need Deep Learning? Don’t worry, you won’t need to hire a cadre of data scientists.

It’s actually much more mundane than that. There’s a large class of “resource unavailable” failures. For example, when a load balancer gets a connection request but not one of the servers in its service pool is functioning, it should immediately refuse the connection. Some configurations have the load balancer queue the connection request for a while in the hopes that a server will become available in a short period of time. This violates the Fail Fast pattern.

The application or service can tell from the incoming request or message roughly what database connections and external integration points will be needed. The service can quickly check out the connections it will need and verify the state of the circuit breakers around the integration points. This is sort of the software equivalent of the chef’s mise en place—gathering all the ingredients needed to perform the request before it begins. If any of the resources are not available, the service can fail immediately, rather than getting partway through the work.

Another way to fail fast in a web application is to perform basic parameter-checking in the servlet or controller that receives the request, before talking to the database. This would be a good reason to move some parameter checking out of domain objects into something like a “Query object.”

Even when failing fast, be sure to report a system failure (resources not available) differently than an application failure (parameter violations or invalid state). Reporting a generic “error” message may cause an upstream system to trip a circuit breaker just because some user entered bad data and hit Reload three or four times.

The Fail Fast pattern improves overall system stability by avoiding slow responses. Together with timeouts, failing fast can help avert impending cascading failures. It also helps maintain capacity when the system is under stress because of partial failures.

Remember This

Avoid Slow Responses and Fail Fast.

If your system cannot meet its SLA, inform callers quickly. Don’t make them wait for an error message, and don’t make them wait until they time out. That just makes your problem into their problem.

Reserve resources, verify Integration Points early.

In the theme of “don’t do useless work,” make sure you’ll be able to complete the transaction before you start. If critical resources aren’t available—for example, a popped Circuit Breaker on a required callout—then don’t waste work by getting to that point. The odds of it changing between the beginning and the middle of the transaction are slim.

Use for input validation.

Do basic user input validation even before you reserve resources. Don’t bother checking out a database connection, fetching domain objects, populating them, and calling validate just to find out that a required parameter wasn’t entered.

Let it Crash

Remember This

Crash components to save systems.

It may seem counterintuitive to create system-level stability through component-level instability. Even so, it may be the best way to get back to a known good state.

Restart fast and reintegrate.

The key to crashing well is getting back up quickly. Otherwise you risk loss of service when too many components are bouncing. Once a component is back up, it should be reintegrated automatically.

Isolate components to crash independently.

Use Circuit Breakers to isolate callers from components that crash. Use supervisors to determine what the span of restarts should be. Design your supervision tree so that crashes are isolated and don’t affect unrelated functionality.

Don’t crash monoliths.

Large processes with heavy runtimes or long startups are not the right place to apply this pattern. Applications that couple many features into a single process are also a poor choice.