Joe Armstrong
How do you get ten nines availability? Why even ten? The number is arbitrary.
- Washing machine / pacemaker (!) Very specialised, embedded. Processor and the data are in the same place, so easy to program.
- Deep space mission The only things that’ll be left of humans after we’re gone (so far)
- Aircraft control system Wait until the plane is on the ground before changing the software. Shut down the nuclear plant before updating the software.
- Internet systems This talk. Data and processing separate and distributed.
Systems like this need highly available data.
Computation can be performed anywhere.
We want many routes to the data.
If there are 10 million computers and my data is on ten of them, I can’t ask each computer if it has my data.
Algorithm: Chord
- Hash the computers IPs
- Sort the hashes
- Hash the lookup key
- Put the data on the first machine with a hash that’s lower than the key’s hash
Collect data in parallel. Save data in parallel.
The problem of reliable storage of data has been solved. – Joe Armstrong
Formal definition: “my program should not fuck up your program”.
Programming in sequential languages is difficult because the world isn’t seequential.
“Embarrassingly parallel” problems: web servers.
If you can’t detect it, you can’t fix it. This must work across machine boundaries. If a machine dies you can’t tell if it’s the machine or the network.
If you make things synchronous you’ll bugger things up.
It’s not enough to know that there is a fault.
Wow. Why would you want to stop it? We want zero downtime. Early requirement for Erlang. At Ericsson you got told off if your system was down for more than four minutes in a year. That’s ~ five nines availability.
Suppose all computers crash: you want your data back.
Fail fast: software should either function correctly or detect the fault, signal failure and stop operating.
If you’ve got a single process you can’t let it die. If you have millions of processes, you can let a few thousand die.
See society: we wander around doing what we want, and if we fall down with a heart attack the ambulance rocks up and fixes you. Send in the medic.
Have threads that detect the failure of other threads.
The big idea is messaging. – Alan Kay
Use a programming language designed for it. Armstrong can only think of one. Ha ha.
- Isolation
Isolate processes so that they can’t damage one another. No shared memory, lightweight.
Treating failure with shared memory is very difficult.
- Concurrency
Run the processes in parallel. Hardware design will mean that soon we’re able to run many processes concurrently.
Erlang has network transparency so the processes might be running elsewhere.
- Failure detecting
Erlang processes can detect failure. This is out of bound: not a normal message. It’s a signal. It’s messy if you handle failure in the same place that you handle normal stuff.
Fix the failure somewhere else. What does failing A have to send running B so that B can carry on doing the job that A didn’t manage.
- Fault identification
Special processes that handle errors.
receive {'EXIT', Pid, Why} -> error_log:log_error({erlang:now(), Pid, Why}) end
- Live code upgrade
In Erlang you can modify code as it runs.
... f1(X) -> foo:bar(X), % Call the latest version of this module bar(X). % Call this version of bar bar(X) -> ...
- Stable storage Use mnesia Use third party storage
Mark McGranaghan @mmcgrana
Lessons learned from PaaS at Heroku
Everyone was doing the same over and over again: routing, runtime, data. Package these three things, and then apps can use them.
Runs on AWS.
API, routing, packaging, data, logging, runtime.
1,000 instances (virtualised servers), 1,000,000 apps.
Simply, a load balancer. If you lose a back end, the balancer makes it transparent to the users.
Code crashes, get used to it. Locally, things like upstart work fine. On a distributed platform, you need a global view of application health. Supervisor detects exit codes, restart the app.
Crash is the same path as normal app exit. This enables you to handle the failure of an instance, eg if AWS nukes it.
Gets more reliable as you make it smaller.
Nodes communicate to a message broker with narrow, versionable JSON messages.
At any given time there are going to be apps that speak the old version, and apps that speak the new vesion.
Heroku tries to solve the same problems that Erlang has. It’s not surpring that there is a similarity in the approaches.
One broker is a single point of failure. To get around that:
If one broker fails, transparrently failover in the client to a different broker.
If you’ve got distributed services and you can’t read from one, gracefully degrade.
If you can’t write to a service, persist an ‘owe’ write. When the service comes back on line, persist the owed write. All billing writes tickets locally, then asynchronously writes the information to the service.
Everything outside of architecture: culture, organisation, etc.
Heroku recently had a problem. They had a post-mortem of the incident, which took a team about a week. They’ve noticed that most of the causes weren’t completely technical. They involved people.
What are the biggest causes of availability failures? Not the architecture. Failed deploys (too fast, too slow); bad visibility; cascading feedback.
Has to be repeatable.
bin/ship --component api --vesion 1234
Some initial pool of deploy servers: .5% of the nodes. Use the data coming back from the nodes to determine whether the deploy will be successful. When ready, deploy to others over period of between minutes and weeks.
Heroku had a large change to the way processes work, but they managed to roll it out incrementally without users noticing.
Feature flag.
Core orchestration app.
Ship the code incrementally then ship the feature incrementally to new users.
Real time visibility. Availability can be thought of as how often things go down, and how long the stay down. Keep an eye on it.
Get the computer to keep an eye on the graphs. If the graph enters the red state, there’s probably a problem.
assert(p99_latency < 50)
Time of day isn’t accounted for, but mostly that’s not a problem: they’re looking out for catastrophic failure. Perhaps they’ll consider using the derivative and the second derivative.
Eventually flow will get to a node that can’t handle the traffic that reaches. Potentially the whole branch that leads to the node can get fried. If you have flow control, you can divert the excess traffic away from the sensitive node, and avoid breaking the path for everyone. Some traffic will get a 500, but not all.
echo 0 > /etc/rates/publish
This’ll get picked up by the controller.
John Allspaw @allspaw
- Anticipation
- Monitoring
- Response
- Learning
Things break. It’s harder to find out that you’d think it would be.
HTTP call to service
- Pros
- Easy to implement
- Easy to understand
- Well-known pattern
- Cons
- Messaging can fail
- Limited scalability
1 sec timeout, 1 retry, 3 sec interval.
Just because you want to poll something every three seconds doesn’t mean it’s going to happen every three seconds.
How many seconds of errors can you tolerate serving?
“I’m alive”
- Pros
- Efficient
- Different scalability
- Fewer moving parts
- Less exposure
- Can submit to multiple places
- Can scale out monitory to a much larger architecture
- Cons
- Non ideal for network
True fire and forget
- Pros
- On demand publish
- Cons
- Onus is on the app
You’ve got to understand what’s happening at the time. Eg: at Christmas you may not have the same behaviour as normal.
Static thresholds are difficult.
148,000 metrics at Etsy.
Finding out what’s normal is a big deal. How do you know if a drop or a lift is something that you’ve got to do something about?
- Moving average is a possibility
- Holt-Winters exponential smoothing
Make a forcast of time series data, the most recent data has an
exponentially larger influence on the prediction than later
data.
Can use it to work out if something is out of bounds. You get a Holt-Winters aberration.
- Detect
- Correct
- Clean up
- Redundancy
- Spatial - going to talk about this.
- Temporal
- Informational
Don’t confuse variation and faults. A fault is an unexpected variation that can’t be compensated for or masked.
- Active / Active
- Active / Passive
- Roaming spare
- Dedicated spare
Timeouts
Check the dependencies of the thing that your checking too. Doesn’t work too well with many dependencies. If you go too crazy you move away from tolerating variance.
Carry on without a feature on failure. For example. IP location lookup probably isn’t that important if you don’t have it.
Cascading failure is often an example of resonance. “Each time we have an instance of resonance between components we have an opportunity to learn something new”.
Imagination, not paranoia. Encourage “What could go wrong?” thinking. People love to tell war-stories because they contain hard-won lessons.
- FMEA
- FMECA
- Architecture review
- Go-or-no-go meeting
- “Game Day” exercises
Either wait for something to break while we’re not watching, or break it ourselves while we’re watching.
When things go right, people are involved; when things go wrong, people are involved.
Sam Hamilton
Why mid-flight? The moment you have a customer, you need to serve them at the same time that you’re building your new site.
If your system is built for the next ten years, you need to ask how much of it is being used.
Expectations were that company size would correlate with the number of transactions per second that the stack would have to cope with.
Dan North @tastapod
“Every decision you make is a trade-off”, or, there are no best practices.
Often we make decisions without realising that there is a trade-off.
If you can’t say what you’re trading off you’re not able to make a rational decision.
- Team composition
- Development style
- Architecture
- Deployment
Why care? See Conway’s law.
What about co-located or distributed? What are you trading off? Play the trade off game.
- Feature teams vs layer teams
- Experienced or inexperienced What if there’s a bunch of grunt work? The junior guys could have fun learning and being mentored. Get the work appropriate for the people.
- Small teams vs big teams
Normal wisdom says that you shouldn’t work in silos. If you’re just in silos you have a bus count of 1. At the other end of the spectrum, if you pair all the time, you might find that context switching is a large overhead.
Balance pair rotation with context switching.
Try having a separate standup with your stakeholder. You can plan just what you need for the day.
- Automated vs manual builds As soon as you automate anything you’re locking down the process. If you don’t try the manual process you might miss out information about the process.
- Automated vs manual testing If you only have automated tests you miss out on exploratory tests. Not only are you testing when you manually test, you’re reviewing.
- Test-first vs test-driven vs test-after vs test-whenever Test first is all of the tests at the start.
Hack stuff together, see how it works, if you like it then you can make it ready for production.
TDD is walking through water up to your chest. You’re not drowning, but you aren’t going as fast as swimming.
Feeback from users vs feedback from defects
Invest in code based on evidence. Why put lots of effort into writing tests when you aren’t sure that you’re going to use it.
- Monolith vs components As soon you’ve got components you’ve got to consider the communications between the components.
Small expendable co-operating components. If I make the assumption that half the code won’t be there in three months, how does that affect the way I approach it.
- Each component is fit for purpose.
- Hard shell, soft centre.
- The message is the API.
The idea that you can copy and paste, and it’s alright. If you know what you’re doing, you don’t need to worry too much, until it’s a problem.
- Automated vs manual As soon as you automate you elimate your ability to learn about it.
- Vertical vs horizontal scaling
- Hosted vs in-house
- Bespoke or commodity
Just get something working is the walking skeleton pattern. The dancing skeleton is putting something into production really really quickly. It’d use the full stack, and have a REPL. You have strings into the app, and pull the strings to make it dance.
If you don’t understand your trade off, you don’t understand the decision you’re making.
When you know what you are trading off you can make informed decisions.
Greg Young
Likes to look inside the program for a conference.
Write stuff. Ask what works.
Disfunction: Spring, Tomcat, Hibernate, …, what are we building?
Despise ORMs: impedance mismatch. If you have a domain model that’s different from the relational DB then there’s a lot of pain.
Years ago the selling points for DBs were things that we don’t even consider now. Now the selling points for ORMs are that we can move from Oracle to MySQL.
When you have two sides to a thing, and one side is really easy to measure and the other is really hard, we end up over optimising for the really easy to measure side.
Find an example of a function that you’ve used to reduce DRY. Copy and paste it. Call different functions from different places. Remove checks if you know that they’re not needed for the specific case. Get rid of things you know aren’t needed. If the functions are now different, you’ve found an example of coupling.
Developers love building things that noone wants. We love building abstractions.
In many cases it’s worth writing code two or three times, rather than extracting a common interface.
How many people think that they could move off of Hibernate within a fortnight? Not many.
What percentage of the framework do you actually use? By abstracting you’re often compicating, and making harder to understand.
Understand that non-programmers can give as much value by hacking around in Access as we do.
We could learn a lot by hiring a programmer.
We’re trained to aim for perfection, when really business doesn’t have these problems. Business doesn’t really have problems that need optimising.
Rich Hickey
Simplicity is prerequisite for reliability.
- Simple
- sim-plex one fold/braid
One role or task that something has to do. Once concept. Different from one instance and one operation. Don’t interleave.
It’s objective: things are either twisted together or they’re not.
- Easy
- lie near
Near - on our hard drive; near ot our understanding;
near to our capabilities. “No one’s really significantly
smarter than anyone else”.
Easy is relative. Near to what?
We repeatedly choose things that are near. If we continue to do that we’ll not learn anything new.
We focus on experience of use of construct, rather than the long term results of the use of the artifact.
Don’t assess constructs by their artifacts.
You can’t make something reliable if you don’t understand it. If things are intertwined, and you need to examine one thing that has a problem, you get everything that it’s attached to.
I don’t think your test suite makes you able to change your code without fear.
Your ability to reason about your code is your ability to change it without fear.
Every bug that you’ve found: passed all of the tests, satisfied the type checker. Your ability to reason about your program is critical to debugging.
If you emphasise ease, you’ll be speedy at the beginning; ignoring the complexity will slow you down over the long term.
If you aren’t careful an elephant will come into your standup and trample everyone.
Easy things can be complicating. What matters is the complexity that the ‘easy’ things yield. For example ‘x=5’ is very easy, but what does it mean when you find it in the middle of a block of code.
Ease of understanding, change, debugging. Flexible policy, location. Just because you can test something doesn’t mean you can change it rapidly.
Get used to it. But what about thinking about it. How quickly can you change your ability to thing about the problem? Not very. The distance between your problem and yourself is large. What are you going to do? Change the distance between you and the problem, by making it more simple.
You’re not used to them, so they’re not nearby. Are they simple? No! Not for Scheme. What could be simpler than having one thing? The one thing isn’t the problem. Because there’s one thing, the concept is overloaded. So the thing that’s simpler that one thing, is more than one thing. In Clojure, parens almost always mean a call, vectors are used for grouping.
LISP programmers know the value of everything and the cost of nothing. –Alan Perlis
More recently s/value/benefit/.
Complexity | Simplicity |
---|---|
State | Values |
Methods | Functions, namespaces |
ORM | Declaritive data manipulation |
Syntax | Data |
You complected my thing. To interleave, entwine or braid.
Best to avoid in the first place. It wasn’t a mistake in the beginning: we went out and did it. What’s the fix? Compose.
Composing simple components is the key to robost software.
Simplicity implies partitioning and stratification, but not the other way around. Be careful. Just because you’ve partitioned things, doesn’t mean that it is going to be simple. This is important. Component A may well ‘know’ about the operation of B, and B may know about A in the same way. This isn’t simple.
But it is easy. State makes your program more complicated even if you have only one thread.
If you’re using more than one variable to represent a thing you’re doing it wrong.
Construct | Complects |
---|---|
State | Everything that it touches |
Objects | State, identity, value, ops, … |
There’s loads of stuff all baked in, everything just gets horrid. | |
Syntax | Meaning, order. The meaning of something and the arrangement are combined. |
Inheritance | Types |
Switch / matching | Multiple who/what pairs |
Variables | Value, time |
Imperitave loops, fold | What / how |
ORM | OMG! |
Conditionals | Why, rest of program |
Use the language or libraries.
The kind of complexity that you can’t do anything about. Individual good decisions don’t combine to make many good things.
Don’t do it too much. It’s certainly been bashed a lot here. But realise that it can be right. Choose, say, ten abstractions for your team. You won’t choose IThingyFactoryProcessor because then you’ve only got nine abstractions left.
I don’t know, I don’t want to know.
Information is simple, but don’t make it more complex by shoving it into a class with its own micro-language.
Encapsulation is for implementation details, not for information. Information doesn’t have implementation, unless you added one. Why? If the answer is, “because I’m using Java” that’s not a great answer.
Wrapping information: that’s the way the languaages make you do it.
Litmust test: Can you move it? Can you move your subsystems to a different process, thread, language? How much do you need to change?
If you pull stuff out of process, then perhaps you’ll get IOExceptions pervaiding everything. Subsystems should have data as the interface: data in, data out. Not IPersonInfo.
See HTTP calls: are we going to be making circular HTTP calls to get stuff done? No! That’s stupid! So why do it in one process?
Get to dislike entanglement. Your tools don’t measure simplicity: tests, type checkers, refactoring. They’re not bad, but they don’t do it for you.
- Choose simple constructs
- Create abstractions with simplicity as a basis
- Often simple means more things, not fewer.
Many suits.
RBS pionered the use of XML and made it easy. Ha ha ha.
Change. There’s lots of it.
- Regulation
- Faster trades Nano second timing
- Utilisation Just enough hardware
All has to be done with tighter budgets. “I’m sure you’ll all be glad to hear”.
- UX Not just a pretty front end, improve the effectiveness of the users of the system.
- Big data
- Data visualisation
- Data virtualisation
Avoiding GC pauses in high frequency trading.
In the past, things had to be written in C++ because it was seen as the only way to get away from the GC.
The Manhattan Processor is similar to the LMAX disruptor.
What happens if Apple trading at 595p announces good financial results, and therefore the price rises to 615p, but Java decides to perform a full GC at the same time? Big deal!
Cost : 1 second per GB of heap, at best. Tuning the GC isn’t really enough.
Two options:
- Use massive Eden space, and don’t go over that space. Then you won’t have a GC run.
- Use a small Eden space, then GC will run in known time.
RBS use the second option. For many reasons it’s easier: if you’re using third party libraries, you don’t know how much space they’re going to use.
- In house, non-allocating queue specialised for multiple produce single consumer requirement.
- Predictable pauses < 1ms
- No full collection in 15hr trading day.
Precting the future.
- Linear approximation
- Single-step Monte Carlo
- Multi-step Monte Carlo
The market is complex. It needs to be modelled as such.
Random number generator. More experiments give better results.
Eg estimate area of circle: is a random point inside the circle. Ratio of yes / no gives the area of the circle.
Many events that can affect the outcome. Have to be able to model the events at discrete points in time.
Multi step simulation needs thousands of compute cores. How do they manage that? Distributed computers. This can generate ~10TBs of data. How is this stored? Its got to be accessed concurrently by many engines.
- Speed
- Scalability
- Robustness
- Interoperability
- Support infrastructure
HBase, HDFS, etc. NAS. HDFS really stood out for them.
ESB allows them to distribute information around the organisation. There are limits to messaging.
Leaves interpretation of the facts to the consmer. Different consumers can end up with different interpretations of the truth.
Copying data lies at the root of many of the bank’s problems.
Central store: gets all eyes on a single version of the truth.
Need: low latency, high throughput.
And if synergies can be identified....
Michael Bunton-Spall @bruntonspall
Sharing failure. Cock up less. Mistakes are often the most interesting part.
content.guardianapis.com
Systems are going to fail.
Architect for failure:
- Prevent
- Mitigate
J2EE basics. Apply scaling basics, like load balancers and multiple app servers, but still only one DB. Can scale, in theory, by adding more balancers and servers.
- Can’t scale DB in the same way
Global load balancer -> load balancers -> App servers -> Multiple DBs as active / passive.
Gives redundancy with multiple data servers. Multiple internet connections.
- 3.5M daily browsers
- 1.6M unique pieces of content
- Hundreds of staff
- Can create micro-sites
- Monolithic
- One sytem that understands everything
- Football
- Finance
- Mortgage
- Content
- Deployment
- Build time
Framework. See front page. In a monolithic system everything has to be rendered by the same thing. Decompose it.
- Core content
- Metadata
- Microapps
Can be managed independently. Just because you change one bit
doesn’t mean you need to build and test everything else.
- Tweets
SSI like technology, over HTTP. You just have an URL to be embedded in the page, and that’ll return all the HTML that needs to appear in the page.
- Advantages
- HTTP well known
- Comes wth caching
- max-age
- stale-if-error
- Squid extra thingy. I don’t mind seeing stale stuff if
there’s really an error. The cache will continue
returning data until you’ve fixed the microapp.
- Don’t care about what renders the HTML, therefore you aren’t tied to J2EE for everything. Tired of getters and setters. They can be hosted where you want: AppEngine, internally.
- Squid extra thingy. I don’t mind seeing stale stuff if
there’s really an error. The cache will continue
returning data until you’ve fixed the microapp.
- If you host on AppEngine you don’t have to talk to your ops team.
- Therefore development is faster.
- “Actually I can do this better in a different language”.
- Innovation improves.
Microapps don’t talk to the database directly. They talk through the Content API.
Problem: what if you put the new thing on the main page, and everyone sees it. All of a sudden, the quota is exceeded and everyone sees an error. Solution: put a cache between the main app and the microapps.
- Cons
- Support
- Maintainence
- Diversification
- Decided to settle on the JVM. Willing to pay the cost of
working with other things on the JVM.
- The ops team have to know how to monitor what you’ve written. Some of it is on EC2, some on AppEngine etc.
- Also increased complexity.
- Big cache. Memcached. 40G cached HTML.
- Latency (the big one)
- Microapp latency affects CMS latency.
- Slow is a big problem. Full GCs. 500s are better.
Peaky traffic was a problem. They featured a cuddly rat that everyone liked.
Sometimes you need speed, and can get rid of dynamic stuff.
- Caches don’t expire
- Page pressing
- Full page cache
- Capable of 1k pages/second/server
It’s more important to cache page one than page 700. Cache the important stuff.
One of the most important things you can do. What changed? What can I turn off to stop it going wrong? Is CPU usage that important when something is going wrong? It’s normally a side effect.
Alerting is not monitoring.
You need them so you can turn things off. The Guardian uses releae valves.
Some switches are automatic. For example emergency mode is automatic. Don’t alert just because you’ve gone into emergency mode, that’s fine. If you go into emergency mode every five minutes that’s a problem.
Don’t ping-pong automatic switches. Go in, wait, then leave.
Why care? You must be able to architect the system so you can diagnose the problem. Log analysis is important. Get the logs off the server, then you can restart the server.
If your logs are 30 gigs then copying will take a long time. Your logs must be parsable. You’ve got to be able to use grep to find the problem.
Every thing is a DOM element. Everything is a span.
Problem: if you edit an 80,000 line file, how does the browser handle that? How big would the DOM be?
Solution: only show the bits that you can see.
240,000 lines of code.
Compare the good parts to the whole langage.
Static file analysis.
Two things needed
- Parse
- Analyse
JavaScript in, AST (abstract syntax tree) out.
“The jquery of analysis”.
- 25PB of compressed data
- 150PB of uncompressed data
- 400TB/day of uncompressed raw data
- Reporting
- Model generation
- Analysis
- Index generation
Non-intuitive results.
Impossible without the data infrastructure that they’ve built.
Rich Hickey
Time, process
- Value
- immutable magniture, quantity, number
- Identity
- a putative entity we associate with a series of causally related values over time
- Sate
- Value of an identity at a moment in time
- Time
- Relative before / after ordering of causal values
Deals with things that change. Problem: two cars vying for the same parking space. What is place?
Should a parking space have a picnic method? Should a park have a post-gig method?
Light. We don’t put our eyes on the table. Time elapses. We don’t see the table right now. We just see the past.
You don’t see the present in your program. Look at the caching architecture. L1 cache, L2 cache, multiple independent caches per CPU.
- Reality
- Perception
- Memory
- Logic
In our program how do we perceive things? We take our eye to the table.
- Perceive
- take entirely
Sensory systems only ever perceive the past.
- Memory
- Mindful, remembering
New memories about the same identities don’t replace the old.
Sometimes like brain memory, mostly like place memory.
- Destroys the past
- Corrupts remembering
- Interferes with perception
Locking: freeze the universe while I can read this stuff. Perception should be highly paralelizable.
- Process
- go forward, advance
Fan of functional programming, but got to allow change.
Leave the past begind. Not food calculators.
Connecting everything to the participants is inherantly broken. Where’s the win method on the sumo restler?
Things don’t change in place. Place includes time. The future is a function of the past. Coordination is desirable locally.
F F F pure events (functions) / \ / \ / \ v1 => v2 => v3 -> v4 states
All vs together: identity. Observers observe v1, v2, …
Yay!
- Immutable
- Safe on the disk
- Great fit for perceptions and memories
- Not idiomatic in Java
???
Wow this is going fast.
(swqp! an-atom f args)
(f vN args) ;; becomes vN + 1
Clojure, but you can make one in Java easily.
What if my logical unit of work involves a million steps? Creating a million interim values via pure function invocation is a waste.
Structure that is a simbiotic twin of persistant structures. Not lasting. When applied to structures, not persistant.
No one sees the transient in the middle of the op.
Function of transient to transient. Can’t affect the world.
John Allspaw - Etsy
We hae a responsibility to write operable software. As important as functionality. How it runs in production is your main concern.
Last year AWS went down for 80 hours. Failure can happen to anyone.
How can this happen in 2012? Complex systems:
- Cascading
- non-linear
- feedback loops
- etc
Do we know what we’re doing?
We don’t need to ask why it happens, just get used to the fact that it is going to happen. Instead ask: what are you going to do when it does happen?
If we only spend effort and time on prevention, we’re missing out on working on solutions to failure. Pay the firefighters.
Have a look at this.
- Forced beyond your learned roles.
- Perform actions which have consequences that are unknown and or hard to see.
- Cognitively and perceptively noisy.
Learn from others that have done this before.
- Neglect how processes develop
- Forget what happened, almost as soon as it happens
- Difficulty dealing with exponential increases in speed
- Think in causal series not causal nets Not dominoes
- Thematic vagabonding Flip from one thing to another.
- Goal fixation Opposite of the last. It’s got to be the x.
- Refusal to make decisions Too much stuff to think about. Not enough authority.
- Unnecessary heroism Non communicating lone-wolf
- Distraction Noise that doesn’t help.
- Skill based
Simple, routine
- Rule based
Knowable, rule based. Probably spend most of our time here, or above
- Knowledge based
WTF is going on? Lone-wolf, vagabonding etc.
Aircraft carrier: shrink SF airport onto an aircraft carrier, reduce the time between take off and landing.
- Close interdependence between groups
- Close coordination and information sharing
- High redundancy
- Broad definition of who belongs on the team
- Teammates inculuded in the comms loop
- Lots of error correction
- Constant awareness of risk of failure
- Detailed records, so you can learn from them
- Authority patterns dispersed People who handle an outage should have the authority to handle the situation
- Reporting of errors is rewarded
- Drill Practice what you’re going to do. You’ve got to be familiar with the tools that you are going to use. Practice using tcpdump if that’s what you are going to have to use in an outage.
- GameDay In production. Affect the infrastructure, in production, while it’s being used. How much is it going to hurt? How quickly can we get the system rebuilt. Don’t freak out.
- Learn to improvise We can adapt. Realise it.
- Learn from mistakes
Postmortems.
- Timelines: what happened when.
- Put it in public, everyone invited.
- Search for second stories, instead of human error.
Why did it make sense for someone to make a mistake. It did make sense to them at the time. They did something because it made sense. Don’t blame training or specific individual characteristics.
- Cultivate blamelessness
This doesn’t mean everyone is off the hook.
- Give people authority to improve things
- Collaborative and skillful communication
- Share near-miss events Say when you’ve nearly cocked up.
Ironies of automation - Lisanne Bainbridge
- Move human from manual operator to supervisor.
- Augments humans ability
- Doesn’t remove human error This can occur at the automation time
- Brittle You can only do what has been pre-programmed.
Things are stretched to capacity. As soon as there is an improvement, it will be exploited to a new tempo and intensity of activity.
Theere are considerations instead of blindly automating.
- Act like vaccines
- Happen more often, so more data
- Reminder of hazards
Don’t just think about failure. Think about why you don’t have an outage every day.
- Ways in which things go right are special cases of things that go wrong.
- Ways things that go wrong are special cases of things that go right.
Which one? Perhaps both. Don’t just ask why did we fail, also ask why did we succeed?
Don’t ignore failure prevention.
Philip Wadler
Evolution happens in our field, much more rapidly than in life. What are the timescales? It seems that they’re short, but perhaps there are important things that evolve over larger timescales than are apparent.
Nazi. Founder of field. Natural detuction. Boole 1820s
Want to be able to simplify proofs. Why? If a proof is in simplest form, you know that you only have to prove subformulas of the problem. Eg To prove A => B you only need to prove A and B, not C.
There are no proofs of false. You don’t want to be able to prove false.
The history of logic is filled with simple things that people didn’t see for 20, or 30 or 50 years. You should be on the lookout for these sorts of things.
Interested in a nice notation for writing logical proofs, and wanted to be able to abstract procedures for intalectual control.
Function and record construction.
These two things are isomorphic. The correspondense wasn’t published until 1980. This is why you should pay attention to theoretical computer science.
(Haskell Curry)
Curry noticed the correspondence between things that looked like programs and things that look like logic.
Every good idea in computing should first be discovered by a logician.
Great ideas are so great that you discover them twice.
You can prove (x + 1)^2 is x^2 + 2x + 1 and it’s true for all x.
Proofs always terminate. Typed lambda calculus therefore always terminates. But you want to be able to do interesting stuff, that you can only do with things that you can’t prove will terminate. Therefore, you can add back into the typed lambda calculus recursion that’ll let you not terminate.
You can write a language that will have known complexity.
Look them up, they’re great. Once you’ve designed your library properly you immediatly get things like comparison or printing for free.
Derive the type checks that you need in untyped code so that you can communicate from a typed language to an untyped language.
Web application framwork for scala and java.
Need to be able to model the streams of data. One request per thread doesn’t scale.
Blocks until complete. I’d like to be notified when data is available.
IOC.
Graham Tackley @tackers
3.5M unique browsers, second most popular after the Daily Mail.
Line counts
- 185k lines of java
- 35k lines XML
- 72k lines velocity
- 250k lines of test java
- Two week code to live time.
- Slow to work with.
- Don’t use an ORM.
Pre 2009 everything had to be part of that code base. Post that, they implemented the micro-app framework mentioned in the other talk. Talk
They value small independent components over the monolith. The monolith: refactoring is easier in one big monolith. The relief of not having a massive code base was huge.
Clarity of intent over ceremonial abstraction.
They were using Python, but found that it was too different. They found that they didn’t want to throw away everything that they’d done already. Evolution not revolution.
Lots of the tools that they were used to they could continue to use. Same runtime. Still just a war deployed to production.
Huge drop in verbosity.
Started off writing tests. They were excited about writing the tests. They found that they were just writing Scala as more concise Java. They didn’t rewrite everything, but whenever they touched a class they rewrote it.
Eventually they decided to ditch the old Java libs that they were using and go to scala + lift + solr.
Plan: didn’t really have one. The rest of the team saw their happiness with the new language. 21 back end developers.
What’s the best way of learning about Scala. Sitting and talking about it. Most people estimated that it took between one and three months to get as productive in Scala as they were in Java.
How on earth do they get good Scala programmers? They look for good Java developers who want to learn Scala. They attract the polyglots who are interested in learning the language.
They simply havenm’t had the problems that have been popularly reported in the press.
- Clarity over cleverness
They don’t really do functional programming. They find that writing components that are immutable helps a lot.
Damian Katz
Functional concrrent programming language. Built at Ericsson in 80s and 90s.
Used in telecoms switches.
Weird. “Boy, this is really bizarre”.
Once writing CouchDB was really motivated to persist with learning it. Extremely productive.
Kind of slow.
If completely useless. Case makes up for it.
Simple. Taken aback by the simplicity of it. No classes, no OO, but it has records. Syntactic sugar for tag tuples.
Functional. Create closures, pass around functions.
Wow, this seems odd. Comparing it to OO languages.
You get used to it. Before you know it you’re productive.
Has processes. Communicate with messages.
Don’t get stop the world GC. Every process can be GCed individually.
Let it crash. The supervisor process will notice. Simplifies application code, leads to much more reliable software.
Hard to believe, but it’s true.
Makes it very easy to extract what you want.
In the past, there was one CPU and uniform main memory. Concurrency was about making one process to many things.
Now we’ve got the multi-core present. NUMA : non uniform memory access.
We still have a big flat memory model, and each access has the same cost. But that’s not the case. Our hardware doesn’t work that way anymore.
Erlang maps very well to the multi-core reality that we have today. It looks a lot like the physical hardware that we’re running on.
Messaging between systems should be as easy as messaging within the system itself.
Not so fast.
What’s the problem? Syntax. The weird syntax makes Erlang slow. Weird syntax makes it hard to get massive adoption. Massive adoption makes it hard to get lots of people making it very fast.
Back end heavy lifting systems.