Skip to content

Latest commit



1731 lines (1188 loc) · 49.5 KB

File metadata and controls

1731 lines (1188 loc) · 49.5 KB


Building highly available systems in Erlang

Joe Armstrong

The types of highly available systems

How do you get ten nines availability? Why even ten? The number is arbitrary.

  • Washing machine / pacemaker (!) Very specialised, embedded. Processor and the data are in the same place, so easy to program.
  • Deep space mission The only things that’ll be left of humans after we’re gone (so far)
  • Aircraft control system Wait until the plane is on the ground before changing the software. Shut down the nuclear plant before updating the software.
  • Internet systems This talk. Data and processing separate and distributed.

Internet available systems

Systems like this need highly available data.

Computation can be performed anywhere.

We want many routes to the data.

Where is my data?

If there are 10 million computers and my data is on ten of them, I can’t ask each computer if it has my data.

Algorithm: Chord
  • Hash the computers IPs
  • Sort the hashes
  • Hash the lookup key
  • Put the data on the first machine with a hash that’s lower than the key’s hash

How do I store replicas in a consistent way?

Collect data in parallel. Save data in parallel.

The problem of reliable storage of data has been solved. – Joe Armstrong

The six properties of available systems

1. Isolation

Formal definition: “my program should not fuck up your program”.

2. Concurrency

Programming in sequential languages is difficult because the world isn’t seequential.

“Embarrassingly parallel” problems: web servers.

3. Must be able to detect failure

If you can’t detect it, you can’t fix it. This must work across machine boundaries. If a machine dies you can’t tell if it’s the machine or the network.

If you make things synchronous you’ll bugger things up.

4. Fault identification

It’s not enough to know that there is a fault.

5. Live code upgrade

Wow. Why would you want to stop it? We want zero downtime. Early requirement for Erlang. At Ericsson you got told off if your system was down for more than four minutes in a year. That’s ~ five nines availability.

6. Stable storage

Suppose all computers crash: you want your data back.

Other thoughts

Jim Gray

Fail fast: software should either function correctly or detect the fault, signal failure and stop operating.

If you’ve got a single process you can’t let it die. If you have millions of processes, you can let a few thousand die.

See society: we wander around doing what we want, and if we fall down with a heart attack the ambulance rocks up and fixes you. Send in the medic.

Have threads that detect the failure of other threads.


The big idea is messaging. – Alan Kay

How to satisfy the properties in Erlang

Use a programming language designed for it. Armstrong can only think of one. Ha ha.

  1. Isolation

    Isolate processes so that they can’t damage one another. No shared memory, lightweight.

    Treating failure with shared memory is very difficult.

  2. Concurrency

    Run the processes in parallel. Hardware design will mean that soon we’re able to run many processes concurrently.

    Erlang has network transparency so the processes might be running elsewhere.

  3. Failure detecting

    Erlang processes can detect failure. This is out of bound: not a normal message. It’s a signal. It’s messy if you handle failure in the same place that you handle normal stuff.

    Fix the failure somewhere else. What does failing A have to send running B so that B can carry on doing the job that A didn’t manage.

  4. Fault identification

    Special processes that handle errors.

        {'EXIT', Pid, Why} ->
            error_log:log_error({erlang:now(), Pid, Why})
  5. Live code upgrade

    In Erlang you can modify code as it runs.

    f1(X) ->
        foo:bar(X), % Call the latest version of this module
        bar(X). % Call this version of bar
    bar(X) ->
  6. Stable storage Use mnesia Use third party storage

Fault tolerance implies scalability


Armstrong’s Phd. thesis

High availability at Heroku

Mark McGranaghan @mmcgrana

Lessons learned from PaaS at Heroku

Everyone was doing the same over and over again: routing, runtime, data. Package these three things, and then apps can use them.

Runs on AWS.

What has to be available?

API, routing, packaging, data, logging, runtime.

1,000 instances (virtualised servers), 1,000,000 apps.


Platform available HA routing

Simply, a load balancer. If you lose a back end, the balancer makes it transparent to the users.

Crashes and supervision

Code crashes, get used to it. Locally, things like upstart work fine. On a distributed platform, you need a global view of application health. Supervisor detects exit codes, restart the app.

Crashes as the only code path

Crash is the same path as normal app exit. This enables you to handle the failure of an instance, eg if AWS nukes it.

Error kernel

Gets more reliable as you make it smaller.

Message passing

Nodes communicate to a message broker with narrow, versionable JSON messages.

At any given time there are going to be apps that speak the old version, and apps that speak the new vesion.

See Erlang

Heroku tries to solve the same problems that Erlang has. It’s not surpring that there is a similarity in the approaches.

One broker is a single point of failure. To get around that:

Publish one, subscribe many

If one broker fails, transparrently failover in the client to a different broker.

Graceful degradation

If you’ve got distributed services and you can’t read from one, gracefully degrade.

If you can’t write to a service, persist an ‘owe’ write. When the service comes back on line, persist the owed write. All billing writes tickets locally, then asynchronously writes the information to the service.


Everything outside of architecture: culture, organisation, etc.

Heroku recently had a problem. They had a post-mortem of the incident, which took a team about a week. They’ve noticed that most of the causes weren’t completely technical. They involved people.

What are the biggest causes of availability failures? Not the architecture. Failed deploys (too fast, too slow); bad visibility; cascading feedback.


Has to be repeatable.

bin/ship --component api --vesion 1234

Some initial pool of deploy servers: .5% of the nodes. Use the data coming back from the nodes to determine whether the deploy will be successful. When ready, deploy to others over period of between minutes and weeks.

Incremental rollout

Heroku had a large change to the way processes work, but they managed to roll it out incrementally without users noticing.

Feature flag.

Core orchestration app.

Ship the code incrementally then ship the feature incrementally to new users.


Graph it.

Real time visibility. Availability can be thought of as how often things go down, and how long the stay down. Keep an eye on it.

Service level assertions

Get the computer to keep an eye on the graphs. If the graph enters the red state, there’s probably a problem.

assert(p99_latency < 50)

Time of day isn’t accounted for, but mostly that’s not a problem: they’re looking out for catastrophic failure. Perhaps they’ll consider using the derivative and the second derivative.

Flow control and backpressure

Eventually flow will get to a node that can’t handle the traffic that reaches. Potentially the whole branch that leads to the node can get fried. If you have flow control, you can divert the excess traffic away from the sensitive node, and avoid breaking the path for everyone. Some traffic will get a 500, but not all.

echo 0 > /etc/rates/publish

This’ll get picked up by the controller.

Anomoly detection, fault tolerance, anticipation

John Allspaw @allspaw

Four cornerstones

  • Anticipation
  • Monitoring
  • Response
  • Learning

Monitoring and anomoly detection

Things break. It’s harder to find out that you’d think it would be.

Active health check

HTTP call to service

  • Pros
    • Easy to implement
    • Easy to understand
    • Well-known pattern
  • Cons
    • Messaging can fail
    • Limited scalability
Supervisor sensitivity

1 sec timeout, 1 retry, 3 sec interval.

Just because you want to poll something every three seconds doesn’t mean it’s going to happen every three seconds.

How many seconds of errors can you tolerate serving?

Passive health check

“I’m alive”

  • Pros
    • Efficient
    • Different scalability
    • Fewer moving parts
    • Less exposure
    • Can submit to multiple places
    • Can scale out monitory to a much larger architecture
  • Cons
    • Non ideal for network

Passive even logging

True fire and forget

  • Pros
    • On demand publish
  • Cons
    • Onus is on the app


You’ve got to understand what’s happening at the time. Eg: at Christmas you may not have the same behaviour as normal.

Static thresholds are difficult.

148,000 metrics at Etsy.

Finding out what’s normal is a big deal. How do you know if a drop or a lift is something that you’ve got to do something about?


  • Moving average is a possibility
  • Holt-Winters exponential smoothing Make a forcast of time series data, the most recent data has an exponentially larger influence on the prediction than later data.

    Can use it to work out if something is out of bounds. You get a Holt-Winters aberration.


Fault tolerance

  • Detect
  • Correct
  • Clean up
  • Redundancy
    • Spatial - going to talk about this.
    • Temporal
    • Informational

Don’t confuse variation and faults. A fault is an unexpected variation that can’t be compensated for or masked.

Spatial redundancy
  • Active / Active
  • Active / Passive
  • Roaming spare
  • Dedicated spare
In-line fault tolerance


Fail closed

Check the dependencies of the thing that your checking too. Doesn’t work too well with many dependencies. If you go too crazy you move away from tolerating variance.

Fail open

Carry on without a feature on failure. For example. IP location lookup probably isn’t that important if you don’t have it.

Systemic complexity

Cascading failure is often an example of resonance. “Each time we have an instance of resonance between components we have an opportunity to learn something new”.


Imagination, not paranoia. Encourage “What could go wrong?” thinking. People love to tell war-stories because they contain hard-won lessons.

  • FMEA
  • Architecture review
  • Go-or-no-go meeting
  • “Game Day” exercises

Either wait for something to break while we’re not watching, or break it ourselves while we’re watching.

When things go right, people are involved; when things go wrong, people are involved.

Building technology mid-flight

Sam Hamilton


Why mid-flight? The moment you have a customer, you need to serve them at the same time that you’re building your new site.

If your system is built for the next ten years, you need to ask how much of it is being used.

Expectations were that company size would correlate with the number of transactions per second that the stack would have to cope with.

Decisions, decisions

Dan North @tastapod

“Every decision you make is a trade-off”, or, there are no best practices.

Often we make decisions without realising that there is a trade-off.

If you can’t say what you’re trading off you’re not able to make a rational decision.

  • Team composition
  • Development style
  • Architecture
  • Deployment

Team composition

Why care? See Conway’s law.

What about co-located or distributed? What are you trading off? Play the trade off game.

  • Feature teams vs layer teams
  • Experienced or inexperienced What if there’s a bunch of grunt work? The junior guys could have fun learning and being mentored. Get the work appropriate for the people.
  • Small teams vs big teams

Pattern: shallow silos

Normal wisdom says that you shouldn’t work in silos. If you’re just in silos you have a bus count of 1. At the other end of the spectrum, if you pair all the time, you might find that context switching is a large overhead.

Balance pair rotation with context switching.

Try having a separate standup with your stakeholder. You can plan just what you need for the day.

Other trade-offs

  • Automated vs manual builds As soon as you automate anything you’re locking down the process. If you don’t try the manual process you might miss out information about the process.
  • Automated vs manual testing If you only have automated tests you miss out on exploratory tests. Not only are you testing when you manually test, you’re reviewing.
  • Test-first vs test-driven vs test-after vs test-whenever Test first is all of the tests at the start.

Pattern: spike and stabilise

Hack stuff together, see how it works, if you like it then you can make it ready for production.

TDD is walking through water up to your chest. You’re not drowning, but you aren’t going as fast as swimming.

Feeback from users vs feedback from defects

Invest in code based on evidence. Why put lots of effort into writing tests when you aren’t sure that you’re going to use it.

Development style


  • Monolith vs components As soon you’ve got components you’ve got to consider the communications between the components.

Pattern: short software half-life

Small expendable co-operating components. If I make the assumption that half the code won’t be there in three months, how does that affect the way I approach it.

  • Each component is fit for purpose.
  • Hard shell, soft centre.
  • The message is the API.

Pattern: ginger cake

The idea that you can copy and paste, and it’s alright. If you know what you’re doing, you don’t need to worry too much, until it’s a problem.


  • Automated vs manual As soon as you automate you elimate your ability to learn about it.
  • Vertical vs horizontal scaling
  • Hosted vs in-house
  • Bespoke or commodity

Pattern: dancing skeleton

Just get something working is the walking skeleton pattern. The dancing skeleton is putting something into production really really quickly. It’d use the full stack, and have a REPL. You have strings into the app, and pull the strings to make it dance.


If you don’t understand your trade off, you don’t understand the decision you’re making.

When you know what you are trading off you can make informed decisions.

Developers have a mental disorder

Greg Young

Likes to look inside the program for a conference.

Pattern: throw spaghetti at the wall

Write stuff. Ask what works.

Disfunction: Spring, Tomcat, Hibernate, …, what are we building?

Despise ORMs: impedance mismatch. If you have a domain model that’s different from the relational DB then there’s a lot of pain.

Years ago the selling points for DBs were things that we don’t even consider now. Now the selling points for ORMs are that we can move from Oracle to MySQL.

When you have two sides to a thing, and one side is really easy to measure and the other is really hard, we end up over optimising for the really easy to measure side.

Find an example of a function that you’ve used to reduce DRY. Copy and paste it. Call different functions from different places. Remove checks if you know that they’re not needed for the specific case. Get rid of things you know aren’t needed. If the functions are now different, you’ve found an example of coupling.

Developers love building things that noone wants. We love building abstractions.

In many cases it’s worth writing code two or three times, rather than extracting a common interface.

How many people think that they could move off of Hibernate within a fortnight? Not many.

What percentage of the framework do you actually use? By abstracting you’re often compicating, and making harder to understand.

Understand that non-programmers can give as much value by hacking around in Access as we do.

We could learn a lot by hiring a programmer.

We’re trained to aim for perfection, when really business doesn’t have these problems. Business doesn’t really have problems that need optimising.

Simple made easy

Rich Hickey

Simplicity is prerequisite for reliability.

Word Origins

sim-plex one fold/braid

One role or task that something has to do. Once concept. Different from one instance and one operation. Don’t interleave.

It’s objective: things are either twisted together or they’re not.

lie near Near - on our hard drive; near ot our understanding; near to our capabilities. “No one’s really significantly smarter than anyone else”.

Easy is relative. Near to what?

We repeatedly choose things that are near. If we continue to do that we’ll not learn anything new.

Construct vs Artifact

We focus on experience of use of construct, rather than the long term results of the use of the artifact.

Don’t assess constructs by their artifacts.


You can’t make something reliable if you don’t understand it. If things are intertwined, and you need to examine one thing that has a problem, you get everything that it’s attached to.


I don’t think your test suite makes you able to change your code without fear.

Your ability to reason about your code is your ability to change it without fear.


Every bug that you’ve found: passed all of the tests, satisfied the type checker. Your ability to reason about your program is critical to debugging.

Development speed

If you emphasise ease, you’ll be speedy at the beginning; ignoring the complexity will slow you down over the long term.

If you aren’t careful an elephant will come into your standup and trample everyone.

Easy yet complex

Easy things can be complicating. What matters is the complexity that the ‘easy’ things yield. For example ‘x=5’ is very easy, but what does it mean when you find it in the middle of a block of code.

Simplicity benefits

Ease of understanding, change, debugging. Flexible policy, location. Just because you can test something doesn’t mean you can change it rapidly.

Making things easy

Get used to it. But what about thinking about it. How quickly can you change your ability to thing about the problem? Not very. The distance between your problem and yourself is large. What are you going to do? Change the distance between you and the problem, by making it more simple.

Parens are hard!

You’re not used to them, so they’re not nearby. Are they simple? No! Not for Scheme. What could be simpler than having one thing? The one thing isn’t the problem. Because there’s one thing, the concept is overloaded. So the thing that’s simpler that one thing, is more than one thing. In Clojure, parens almost always mean a call, vectors are used for grouping.

LISP programmers know the value of everything and the cost of nothing. –Alan Perlis

More recently s/value/benefit/.

MethodsFunctions, namespaces
ORMDeclaritive data manipulation


You complected my thing. To interleave, entwine or braid.

Best to avoid in the first place. It wasn’t a mistake in the beginning: we went out and did it. What’s the fix? Compose.


Composing simple components is the key to robost software.

Modularity and simplicity

Simplicity implies partitioning and stratification, but not the other way around. Be careful. Just because you’ve partitioned things, doesn’t mean that it is going to be simple. This is important. Component A may well ‘know’ about the operation of B, and B may know about A in the same way. This isn’t simple.

State is never simple

But it is easy. State makes your program more complicated even if you have only one thread.

Refs / vars

If you’re using more than one variable to represent a thing you’re doing it wrong.

StateEverything that it touches
ObjectsState, identity, value, ops, …
There’s loads of stuff all baked in, everything just gets horrid.
SyntaxMeaning, order. The meaning of something and the arrangement are combined.
Switch / matchingMultiple who/what pairs
VariablesValue, time
Imperitave loops, foldWhat / how
ConditionalsWhy, rest of program

Simplicity toolkit

Use the language or libraries.

Environmental complexity

The kind of complexity that you can’t do anything about. Individual good decisions don’t combine to make many good things.

Abstract for simplicity

Don’t do it too much. It’s certainly been bashed a lot here. But realise that it can be right. Choose, say, ten abstractions for your team. You won’t choose IThingyFactoryProcessor because then you’ve only got nine abstractions left.

I don’t know, I don’t want to know.


Information is simple, but don’t make it more complex by shoving it into a class with its own micro-language.


Encapsulation is for implementation details, not for information. Information doesn’t have implementation, unless you added one. Why? If the answer is, “because I’m using Java” that’s not a great answer.

Wrapping information: that’s the way the languaages make you do it.

Litmust test: Can you move it? Can you move your subsystems to a different process, thread, language? How much do you need to change?

If you pull stuff out of process, then perhaps you’ll get IOExceptions pervaiding everything. Subsystems should have data as the interface: data in, data out. Not IPersonInfo.

See HTTP calls: are we going to be making circular HTTP calls to get stuff done? No! That’s stupid! So why do it in one process?

Choose simplicity

Get to dislike entanglement. Your tools don’t measure simplicity: tests, type checkers, refactoring. They’re not bad, but they don’t do it for you.

Simplicity made easy

  • Choose simple constructs
  • Create abstractions with simplicity as a basis
  • Often simple means more things, not fewer.

Progressive architectures at RBS

Many suits.

RBS pionered the use of XML and made it easy. Ha ha ha.

New drivers

Change. There’s lots of it.

  • Regulation
  • Faster trades Nano second timing
  • Utilisation Just enough hardware

All has to be done with tighter budgets. “I’m sure you’ll all be glad to hear”.

What do they do?

  • UX Not just a pretty front end, improve the effectiveness of the users of the system.
  • Big data
  • Data visualisation
  • Data virtualisation

The Manhattan Processor

Avoiding GC pauses in high frequency trading.

In the past, things had to be written in C++ because it was seen as the only way to get away from the GC.

The Manhattan Processor is similar to the LMAX disruptor.

What happens if Apple trading at 595p announces good financial results, and therefore the price rises to 615p, but Java decides to perform a full GC at the same time? Big deal!

The GC

Cost : 1 second per GB of heap, at best. Tuning the GC isn’t really enough.

Two options:

  • Use massive Eden space, and don’t go over that space. Then you won’t have a GC run.
  • Use a small Eden space, then GC will run in known time.

RBS use the second option. For many reasons it’s easier: if you’re using third party libraries, you don’t know how much space they’re going to use.

What is the MP?

  • In house, non-allocating queue specialised for multiple produce single consumer requirement.
  • Predictable pauses < 1ms
  • No full collection in 15hr trading day.


What’s the big deal?

Precting the future.


  • Linear approximation
  • Single-step Monte Carlo
  • Multi-step Monte Carlo

The market is complex. It needs to be modelled as such.

Monte Carlo

Random number generator. More experiments give better results.

Eg estimate area of circle: is a random point inside the circle. Ratio of yes / no gives the area of the circle.

Multi step Monte Carlo

Many events that can affect the outcome. Have to be able to model the events at discrete points in time.

Multi step simulation needs thousands of compute cores. How do they manage that? Distributed computers. This can generate ~10TBs of data. How is this stored? Its got to be accessed concurrently by many engines.


  • Speed
  • Scalability
  • Robustness
  • Interoperability
  • Support infrastructure
Distributed file systems

HBase, HDFS, etc. NAS. HDFS really stood out for them.

ODC: One vesion of the truth for everyone

ESB allows them to distribute information around the organisation. There are limits to messaging.


Leaves interpretation of the facts to the consmer. Different consumers can end up with different interpretations of the truth.

Copying data lies at the root of many of the bank’s problems.


Central store: gets all eyes on a single version of the truth.

Need: low latency, high throughput.

And if synergies can be identified....

Architecting for failure at the Guardian

Michael Bunton-Spall @bruntonspall

Sharing failure. Cock up less. Mistakes are often the most interesting part.


Systems are going to fail.

Architect for failure:

  • Prevent
  • Mitigate

Pre 2008

J2EE basics. Apply scaling basics, like load balancers and multiple app servers, but still only one DB. Can scale, in theory, by adding more balancers and servers.

Scaled architecture

  • Can’t scale DB in the same way


Global load balancer -> load balancers -> App servers -> Multiple DBs as active / passive.

Gives redundancy with multiple data servers. Multiple internet connections.

  • 3.5M daily browsers
  • 1.6M unique pieces of content
  • Hundreds of staff
  • Can create micro-sites
  • Monolithic
  • One sytem that understands everything
    • Football
    • Finance
    • Mortgage
    • Content
  • Deployment
  • Build time

Microapps circa 2011

Framework. See front page. In a monolithic system everything has to be rendered by the same thing. Decompose it.

  • Core content
  • Metadata
  • Microapps Can be managed independently. Just because you change one bit doesn’t mean you need to build and test everything else.
    • Tweets
    • More on this story


SSI like technology, over HTTP. You just have an URL to be embedded in the page, and that’ll return all the HTML that needs to appear in the page.

  • Advantages
    • HTTP well known
    • Comes wth caching
  • max-age
  • stale-if-error
    • Squid extra thingy. I don’t mind seeing stale stuff if there’s really an error. The cache will continue returning data until you’ve fixed the microapp.
      • Don’t care about what renders the HTML, therefore you aren’t tied to J2EE for everything. Tired of getters and setters. They can be hosted where you want: AppEngine, internally.
  • If you host on AppEngine you don’t have to talk to your ops team.
  • Therefore development is faster.
  • “Actually I can do this better in a different language”.
  • Innovation improves.

Microapps don’t talk to the database directly. They talk through the Content API.

Problem: what if you put the new thing on the main page, and everyone sees it. All of a sudden, the quota is exceeded and everyone sees an error. Solution: put a cache between the main app and the microapps.

  • Cons
    • Support
    • Maintainence
    • Diversification
  • Decided to settle on the JVM. Willing to pay the cost of working with other things on the JVM.
    • The ops team have to know how to monitor what you’ve written. Some of it is on EC2, some on AppEngine etc.
    • Also increased complexity.
  • Big cache. Memcached. 40G cached HTML.
    • Latency (the big one)
  • Microapp latency affects CMS latency.
  • Slow is a big problem. Full GCs. 500s are better.

Emergency mode

Peaky traffic was a problem. They featured a cuddly rat that everyone liked.

Sometimes you need speed, and can get rid of dynamic stuff.

  • Caches don’t expire
  • Page pressing
    • Full page cache
    • Capable of 1k pages/second/server

Caching problems

It’s more important to cache page one than page 700. Cache the important stuff.


One of the most important things you can do. What changed? What can I turn off to stop it going wrong? Is CPU usage that important when something is going wrong? It’s normally a side effect.

Alerting is not monitoring.


You need them so you can turn things off. The Guardian uses releae valves.

Some switches are automatic. For example emergency mode is automatic. Don’t alert just because you’ve gone into emergency mode, that’s fine. If you go into emergency mode every five minutes that’s a problem.

Don’t ping-pong automatic switches. Go in, wait, then leave.


Why care? You must be able to architect the system so you can diagnose the problem. Log analysis is important. Get the logs off the server, then you can restart the server.

If your logs are 30 gigs then copying will take a long time. Your logs must be parsable. You’ve got to be able to use grep to find the problem.

Cloud9 IDE in JavaScript

How do you write an IDE in the browser?

Every thing is a DOM element. Everything is a span.

Problem: if you edit an 80,000 line file, how does the browser handle that? How big would the DOM be?

Solution: only show the bits that you can see.


240,000 lines of code.


Compare the good parts to the whole langage.

Static file analysis.

Static analysis with JavaScript

Two things needed

  • Parse
  • Analyse


JavaScript in, AST (abstract syntax tree) out.


Enter treehuger.js

“The jquery of analysis”.

Big Data at Facebook

Big data

  • 25PB of compressed data
  • 150PB of uncompressed data
  • 400TB/day of uncompressed raw data

What’s it used for?

  • Reporting
  • Model generation
  • Analysis
  • Index generation

A/B testing

Non-intuitive results.

Friend map

Impossible without the data infrastructure that they’ve built.

Modelling process

Rich Hickey

What’s more fundamental?

Time, process


immutable magniture, quantity, number
a putative entity we associate with a series of causally related values over time
Value of an identity at a moment in time
Relative before / after ordering of causal values


Deals with things that change. Problem: two cars vying for the same parking space. What is place?

Are places in charge?

Should a parking space have a picnic method? Should a park have a post-gig method?

What do we see?

Light. We don’t put our eyes on the table. Time elapses. We don’t see the table right now. We just see the past.

You don’t see the present in your program. Look at the caching architecture. L1 cache, L2 cache, multiple independent caches per CPU.


  1. Reality
  2. Perception
  3. Memory
  4. Logic

In our program how do we perceive things? We take our eye to the table.


take entirely

Sensory systems only ever perceive the past.


Mindful, remembering

New memories about the same identities don’t replace the old.

Program memory

Sometimes like brain memory, mostly like place memory.

  • Destroys the past
  • Corrupts remembering
  • Interferes with perception

Locking: freeze the universe while I can read this stuff. Perception should be highly paralelizable.


go forward, advance

Fan of functional programming, but got to allow change.

Leave the past begind. Not food calculators.

Connecting everything to the participants is inherantly broken. Where’s the win method on the sumo restler?


Things don’t change in place. Place includes time. The future is a function of the past. Coordination is desirable locally.

Epochal time model

F F F pure events (functions) / \ / \ / \ v1 => v2 => v3 -> v4 states

All vs together: identity. Observers observe v1, v2, …

Persistent structures


  • Immutable
  • Safe on the disk
  • Great fit for perceptions and memories
  • Not idiomatic in Java

Identity constructs as gatekeepers of time


Functional model

Wow this is going fast.

CAS as time construct

(swqp! an-atom f args)

(f vN args) ;; becomes vN + 1

Clojure, but you can make one in Java easily.


What if my logical unit of work involves a million steps? Creating a million interim values via pure function invocation is a waste.


Structure that is a simbiotic twin of persistant structures. Not lasting. When applied to structures, not persistant.

No one sees the transient in the middle of the op.


Function of transient to transient. Can’t affect the world.

Resilient response in complex systems

John Allspaw - Etsy


We hae a responsibility to write operable software. As important as functionality. How it runs in production is your main concern.

Last year AWS went down for 80 hours. Failure can happen to anyone.

How can this happen in 2012? Complex systems:

  • Cascading
  • non-linear
  • feedback loops
  • etc

Do we know what we’re doing?

We don’t need to ask why it happens, just get used to the fact that it is going to happen. Instead ask: what are you going to do when it does happen?

If we only spend effort and time on prevention, we’re missing out on working on solutions to failure. Pay the firefighters.

Have a look at this.

Symptoms in an emergency

  • Forced beyond your learned roles.
  • Perform actions which have consequences that are unknown and or hard to see.
  • Cognitively and perceptively noisy.

What shall we do about it?

Learn from others that have done this before.

Characteristics of response

  • Neglect how processes develop
  • Forget what happened, almost as soon as it happens
  • Difficulty dealing with exponential increases in speed
  • Think in causal series not causal nets Not dominoes
  • Thematic vagabonding Flip from one thing to another.
  • Goal fixation Opposite of the last. It’s got to be the x.
  • Refusal to make decisions Too much stuff to think about. Not enough authority.
  • Unnecessary heroism Non communicating lone-wolf
  • Distraction Noise that doesn’t help.
Find Jens Rasmussen paper
  • Skill based

Simple, routine

  • Rule based

Knowable, rule based. Probably spend most of our time here, or above

  • Knowledge based

WTF is going on? Lone-wolf, vagabonding etc.

High reliability organisations

see Managing the Unexpected

see The self-designing high-reliability organization

Aircraft carrier: shrink SF airport onto an aircraft carrier, reduce the time between take off and landing.

  • Close interdependence between groups
  • Close coordination and information sharing
  • High redundancy
  • Broad definition of who belongs on the team
  • Teammates inculuded in the comms loop
  • Lots of error correction
  • Constant awareness of risk of failure
  • Detailed records, so you can learn from them
  • Authority patterns dispersed People who handle an outage should have the authority to handle the situation
  • Reporting of errors is rewarded

What else can we do?

  • Drill Practice what you’re going to do. You’ve got to be familiar with the tools that you are going to use. Practice using tcpdump if that’s what you are going to have to use in an outage.
  • GameDay In production. Affect the infrastructure, in production, while it’s being used. How much is it going to hurt? How quickly can we get the system rebuilt. Don’t freak out.
  • Learn to improvise We can adapt. Realise it.
  • Learn from mistakes Postmortems.
    • Timelines: what happened when.
    • Put it in public, everyone invited.
    • Search for second stories, instead of human error.

Why did it make sense for someone to make a mistake. It did make sense to them at the time. They did something because it made sense. Don’t blame training or specific individual characteristics.

  • Cultivate blamelessness

This doesn’t mean everyone is off the hook.

  • Give people authority to improve things
  • Collaborative and skillful communication
  • Share near-miss events Say when you’ve nearly cocked up.


Ironies of automation - Lisanne Bainbridge

  • Move human from manual operator to supervisor.
  • Augments humans ability
  • Doesn’t remove human error This can occur at the automation time
  • Brittle You can only do what has been pre-programmed.

Law of stretched systems

Things are stretched to capacity. As soon as there is an improvement, it will be exploited to a new tempo and intensity of activity.

Theere are considerations instead of blindly automating.

Near misses

  • Act like vaccines
  • Happen more often, so more data
  • Reminder of hazards

Parting thoughts

Don’t just think about failure. Think about why you don’t have an outage every day.

  1. Ways in which things go right are special cases of things that go wrong.
  2. Ways things that go wrong are special cases of things that go right.

Which one? Perhaps both. Don’t just ask why did we fail, also ask why did we succeed?

Don’t ignore failure prevention.

Faith, Evolution and Programming Languages

Philip Wadler

Faith, evolution

Evolution happens in our field, much more rapidly than in life. What are the timescales? It seems that they’re short, but perhaps there are important things that evolve over larger timescales than are apparent.

Gerhard Getzen

Nazi. Founder of field. Natural detuction. Boole 1820s

Want to be able to simplify proofs. Why? If a proof is in simplest form, you know that you only have to prove subformulas of the problem. Eg To prove A => B you only need to prove A and B, not C.

There are no proofs of false. You don’t want to be able to prove false.

The history of logic is filled with simple things that people didn’t see for 20, or 30 or 50 years. You should be on the lookout for these sorts of things.

Church, Lambda calculus 1940

Interested in a nice notation for writing logical proofs, and wanted to be able to abstract procedures for intalectual control.

Function and record construction.

These two things are isomorphic. The correspondense wasn’t published until 1980. This is why you should pay attention to theoretical computer science.

Curry-Howard isomorphism 1980

(Haskell Curry)

Curry noticed the correspondence between things that looked like programs and things that look like logic.

Every good idea in computing should first be discovered by a logician.

Milner polymorphic types paper

Great ideas are so great that you discover them twice.

Frege quantifiers (for all)

You can prove (x + 1)^2 is x^2 + 2x + 1 and it’s true for all x.

Proofs always terminate. Typed lambda calculus therefore always terminates. But you want to be able to do interesting stuff, that you can only do with things that you can’t prove will terminate. Therefore, you can add back into the typed lambda calculus recursion that’ll let you not terminate.

You can write a language that will have known complexity.

Haskell type classes

Look them up, they’re great. Once you’ve designed your library properly you immediatly get things like comparison or printing for free.

Blame calculus

Derive the type checks that you need in untyped code so that you can communicate from a typed language to an untyped language.

Play framework reactive framework

What is it?

Web application framwork for scala and java.

Why reactive?

Need to be able to model the streams of data. One request per thread doesn’t scale.

Java input stream

Blocks until complete. I’d like to be notified when data is available.

A reactive model


Scala at the Guardian

Graham Tackley @tackers


3.5M unique browsers, second most popular after the Daily Mail.

Line counts

  • 185k lines of java
  • 35k lines XML
  • 72k lines velocity
  • 250k lines of test java

It’s OK, but

  • Two week code to live time.
  • Slow to work with.
  • Don’t use an ORM.

Pre 2009 everything had to be part of that code base. Post that, they implemented the micro-app framework mentioned in the other talk. Talk


They value small independent components over the monolith. The monolith: refactoring is easier in one big monolith. The relief of not having a massive code base was huge.

Clarity of intent over ceremonial abstraction.


They were using Python, but found that it was too different. They found that they didn’t want to throw away everything that they’d done already. Evolution not revolution.


Lots of the tools that they were used to they could continue to use. Same runtime. Still just a war deployed to production.

Huge drop in verbosity.

Started off writing tests. They were excited about writing the tests. They found that they were just writing Scala as more concise Java. They didn’t rewrite everything, but whenever they touched a class they rewrote it.

Eventually they decided to ditch the old Java libs that they were using and go to scala + lift + solr.

Plan: didn’t really have one. The rest of the team saw their happiness with the new language. 21 back end developers.

What’s the best way of learning about Scala. Sitting and talking about it. Most people estimated that it took between one and three months to get as productive in Scala as they were in Java.



How on earth do they get good Scala programmers? They look for good Java developers who want to learn Scala. They attract the polyglots who are interested in learning the language.


They simply havenm’t had the problems that have been popularly reported in the press.


  • Clarity over cleverness



They don’t really do functional programming. They find that writing components that are immutable helps a lot.

Erlang, the language from the future

Damian Katz

What is it?

Functional concrrent programming language. Built at Ericsson in 80s and 90s.

Used in telecoms switches.

What is it like?

Weird. “Boy, this is really bizarre”.

Once writing CouchDB was really motivated to persist with learning it. Extremely productive.

Kind of slow.


If completely useless. Case makes up for it.

Simple. Taken aback by the simplicity of it. No classes, no OO, but it has records. Syntactic sugar for tag tuples.

Functional. Create closures, pass around functions.

Wow, this seems odd. Comparing it to OO languages.

How do you get anything done without OO?

You get used to it. Before you know it you’re productive.


Has processes. Communicate with messages.

Don’t get stop the world GC. Every process can be GCed individually.

Error handling

Let it crash. The supervisor process will notice. Simplifies application code, leads to much more reliable software.

Hard to believe, but it’s true.

Pattern matching

Makes it very easy to extract what you want.

Why is it from the future?

In the past, there was one CPU and uniform main memory. Concurrency was about making one process to many things.

Now we’ve got the multi-core present. NUMA : non uniform memory access.

Most languages don’t model reality.

We still have a big flat memory model, and each access has the same cost. But that’s not the case. Our hardware doesn’t work that way anymore.

Erlang maps very well to the multi-core reality that we have today. It looks a lot like the physical hardware that we’re running on.

Accidental multi-core awwwesomeness

Messaging between systems should be as easy as messaging within the system itself.

Erlang and CouchDB

Not so fast.

What’s the problem? Syntax. The weird syntax makes Erlang slow. Weird syntax makes it hard to get massive adoption. Massive adoption makes it hard to get lots of people making it very fast.

When would you use it?

Back end heavy lifting systems.

When would you avoid it?