Note: These lecture notes were slightly modified from the ones posted on the 6.858 course website from 2014.
- Historically, worried about EM signals leaking. NSA TEMPEST.
- Broadly, systems may need to worry about many unexpected ways in which
- information can be revealed.
Example setting: a server (e.g., Apache) has an RSA private key.
- Server uses RSA private key (e.g., decrypt message from client).
- Something about the server's computation is leaked to the client.
Many information leaks have been looked at:
- How long it takes to decrypt.
- How decryption affects shared resources (cache, TLB, branch predictor).
- Emissions from the CPU itself (RF, audio, power consumption, etc).
Side-channel attacks don't have to be crypto-related.
- E.g., operation time relates to which character of password was incorrect.
- Or time related to how many common friends you + some user have on Facebook.
- Or how long it takes to load a page in browser (depends if it was cached).
- Or recovering printed text based on sound from dot-matrix printer. Ref
- But attacks on passwords or keys are usually the most damaging.
Adversary can analyze information leaks, use it to reconstruct private key.
- Currently, side-channel attacks on systems described in the paper are rare.
- E.g., Apache web server running on some Internet-connected machine.
- Often some other vulnerability exists and is easier to exploit.
- Slowly becoming a bigger concern: new side-channels (VMs), better attacks.
- Side-channel attacks are more commonly used to attack trusted/embedded hw.
- E.g., chip running cryptographic operations on a smartcard.
- Often these have a small attack surface, not many other ways to get in.
- As paper mentions, some crypto coprocessors designed to avoid this attack.
What's the "Remote timing attacks are practical" paper's contribution? Ref
- Timing attacks known for a while.
- This paper: possible to attack standard Apache web server over the network.
- Uses lots of observations/techniques from prior work on timing attacks.
- To understand how this works, first let's look at some internals of RSA..
- Pick two random primes,
pandq.- Let
n = p*q.
- Let
- A reasonable key length, i.e.,
|n|or|d|, is 2048 bits today. - Euler's function
phi(n): number of elements ofZ_n^*relatively prime ton.- Theorem [no proof here]:
a^(phi(n)) = 1 mod n, for allaandn.
- Theorem [no proof here]:
- So, how to encrypt and decrypt?
- Pick two exponents
dande, such thatm^(e*d) = m (mod n), which- means
e*d = 1 mod phi(n).
- means
- Encryption will be
c = m^e (mod n); decryption will bem = c^d (mod n).
- Pick two exponents
- How to get such
eandd?- For
n=pq,phi(n) = (p-1)(q-1). - Easy to compute
d=1/e, if we knowphi(n). Extended Euclidean algorithm - In practice, pick small
e(e.g., 65537), to make encryption fast.
- For
- Public key is
(n, e). - Private key is, in principle,
(n, d).- Note:
pandqmust be kept secret! - Otherwise, adversary can compute
dfrome, as we did above. - Knowing
pandqalso turns out to be helpful for fast decryption. - So, in practice, private key includes
(p, q)as well.
- Note:
RSA is tricky to use "securely" -- be careful if using RSA directly!
- Ciphertexts are multiplicative
E(a)*E(b) = a^e * b^e = (ab)^e.- Can allow adversary to manipulate encryptions, generate new ones.
- RSA is deterministic
- Encrypting the same plaintext will generate the same ciphertext each time.
- Adversary can tell when the same thing is being re-encrypted.
- Typically solved by "padding" messages before encryption.
OAEP
- Take plaintext message bits, add padding bits before and after plaintext.
- Encrypt the combined bits (must be less than
|n|bits total). - Padding includes randomness, as well as fixed bit patterns.
- Helps detect tampering (e.g. ciphertext multiplication).
- Key problem: fast modular exponentiation.
- In general, quadratic complexity.
- Multiplying two 1024-bit numbers is slow.
- Computing the modulus for 1024-bit numbers is slow (1024-bit divison).
- Recall what the CRT says:
- if
x==a1 (mod p)andx==a2 (mod q), wherepandqare relatively prime, - then there's a unique solution
x==a (mod pq).- and, there's an efficient algorithm for computing
a
- and, there's an efficient algorithm for computing
- if
- Suppose we want to compute
m = c^d (mod pq). - Can compute
m1 = c^d (mod p), andm2 = c^d (mod q). - Then use CRT to compute
m = c^d (mod n)fromm1,m2; it's unique and fast. - Computing
m1(orm2) is ~4x faster than computingmdirectly (~quadratic). - Computing
mfromm1andm2using CRT is ~negligible in comparison. - So, roughly a 2x speedup.
- Naive approach to computing
c^d: multiplycby itself,dtimes. - Better approach, called repeated squaring:
c^(2x) = (c^x)^2c^(2x+1) = (c^x)^2 * c- To compute
c^d, first computec^(floor(d/2)), then use above forc^d. - Recursively apply until the computation hits
c^0 = 1. - Number of squarings:
|d|(the number of bits needed to representd) - Number of multiplications: number of 1 bits in
d
- Better yet (sometimes), called sliding window:
c^(2x) = (c^x)^2c^(32x+1) = (c^x)^32 * cc^(32x+3) = (c^x)^32 * c^3- ...
c^(32x+z) = (c^x)^32 * c^z, generally (wherez<=31)- Can pre-compute a table of all necessary
c^zpowers, store in memory. - The choice of power-of-2 constant (e.g., 32) depends on usage.
- Costs: extra memory, extra time to pre-compute powers ahead of time.
- Note: only pre-compute odd powers of
c(use first rule for even). - OpenSSL uses 32 (table with 16 pre-computed entries).
- Reducing
mod peach time (after square or multiply) is expensive.- Typical implementation: do long division, find remainder.
- Hard to avoid reduction: otherwise, value grows exponentially.
- Idea (by Peter Montgomery): do computations in another representation.
- Shift the base (e.g.,
c) into different representation upfront. - Perform modular operations in this representation (will be cheaper).
- Shift numbers back into original representation when done.
- Ideally, savings from reductions outweigh cost of shifting.
- Shift the base (e.g.,
- Montgomery representation: multiply everything by some factor R.
a mod q <-> aR mod qb mod q <-> bR mod qc = a*b mod q <-> cR mod q = (aR * bR)/R mod q- Each mul (or sqr) in Montgomery-space requires division by
R.
- Why is modular multiplication cheaper in Montgomery repr.?
- Choose
Rso division byRis easy:R = 2^|q|(2^512for 1024-bit keys). - Because we divide by
R, we will often not need to domod q.|aR| = |q||bR| = |q||aR * bR| = 2|q||aR * bR / R| = |q|
- How do we divide by
Rcheaply?- Only works if lower bits are zero.
- Observation: since we care about value
mod q, multiples ofqdon't matter. - Trick: add multiples of
qto the number being divided byR, make low bits 0.- For example, suppose
R=2^4 (10000),q=7 (111), dividex=26 (11010)by R.x+2q = (binary) * 101000x+2q+8q = (binary) 1100000
- Now, can easily divide by
R: result is binary 110 (or 6). - Generally, always possible:
- Low bit of
qis 1 (qis prime), so can "shoot down" any bits. - To "shoot down" bit
k, add2^k * q - To shoot down low-order bits
l, addq*(l*(-q^-1) mod R) - Then, dividing by
Rmeans simply discarding low zero bits.
- Low bit of
- For example, suppose
- Choose
- One remaining problem: result will be
< R, but might be> q.- If the result happens to be greater than
q, need to subtractq. - This is called the "extra reduction".
- When computing
x^d mod q,Pr[extra reduction] = (x mod q) / 2R.- Here,
xis assumed to be already in Montgomery form. - Intuition: as we multiply bigger numbers, will overflow more often.
- Here,
- If the result happens to be greater than
- How to multiply 512-bit numbers?
- Representation: break up into 32-bit values (or whatever hardware supports).
- Naive approach: pair-wise multiplication of all 32-bit components.
- Same as if you were doing digit-wise multiplication of numbers on paper.
- Requires
O(nm)time if two numbers havenandmcomponents respectively. O(n^2)if the two numbers are close.
- Karatsuba multiplication: assumes both numbers have same number of components.
O(n^log_3(2)) = O(n^1.585)time.- Split both numbers (
xandy) into two components (x1,x0andy1,y0).x = x1 * B + x0y = y1 * B + y0- E.g.,
B=2^32when splitting 64-bit numbers into 32-bit components.
- Naive:
x*y = x1y1 * B^2 + x0y1 * B + x1y0 * B + x0y0- Four multiplies:
O(n^2)
- Four multiplies:
- Faster:
x*y = x1y1 * (B^2+B) - (x1-x0)(y1-y0) * B + x0y0 * (B+1)- ...
= x1y1 * B^2 + ( -(x1-x0)(y1-y0) + x1y1 + x0y0 ) * B + x0y0 - Just three multiplies, and a few more additions.
- ...
- Recursively apply this algorithm to keep splitting into more halves.
- Sometimes called "recursive multiplication".
- Meaningfully faster (no hidden big constants)
- For 1024-bit keys, "
n" here is 16 (512/32). n^2 = 256n^1.585 = 81
- For 1024-bit keys, "
- Multiplication algorithm needs to decide when to use Karatsuba vs. Naive.
- Two cases matter: two large numbers, and one large + one small number.
- OpenSSL: if equal number of components, use Karatsuba, otherwise Naive.
- In some intermediate cases, Karatsuba may win too, but OpenSSL ignores it,
- according to this paper.
- Server's SSL certificate contains public key.
- Server must use private key to prove its identity.
- Client sends random bits to server, encrypted with server's public key.
- Server decrypts client's message, uses these bits to generate session key.
- In reality, server also verifies message padding.
- However, can still measure time until server responds in some way.
Figure of decryption pipeline on the server:
* CRT * To Montgomery * Modular exp
--> * c_0 = c mod q --> * c'_0 = c_0*R mod q --> * m'_0 = (c'_0)^d mod q
/
/ * Use sliding window for bits
/ * of the exponent d
c * Karatsuba if c'_0 and q have
* same number of 32-bit parts
\
\ * Extra reductions proportional
\ * to ((c'_0)^z mod q) / 2R;
--> ... * z comes from sliding window
- Then, compute
m_0 = m'_0/R mod q. - Then, combine
m_0andm_1using CRT to getm. - Then verify padding in
m. - Finally, use payload in some way (SSL, etc).
- Victim Apache HTTPS web server using OpenSSL, has private key in memory.
- Connected to Stanford's campus network.
- Adversary controls some client machine on campus network.
- Adversary sends specially-constructed ciphertext in msg to server.
- Server decrypts ciphertext, finds garbage padding, returns an error.
- Client measures response time to get error message.
- Uses the response time to guess bits of
q.
- Overall response time is on the order of 5 msec.
- Time difference between requests can be around 10 usec.
- What causes time variations?
- Karatsuba vs. Naive
- extra reductions
- Once guessed enough bits of
q, can factorn=p*q, computedfrome. - About 1M queries seem enough to obtain 512-bit
pandqfor 1024-bit key.- Only need to guess the top 256 bits of
pandq, then use another algorithm.
- Only need to guess the top 256 bits of
- See the Remote timing attacks are practical paper cited in the References section at the end for more details.
- Let
q = q_0 q_1 .. q_N, whereN = |q|(say, 512 bits for 1024-bit keys). - Assume we know some number
jof high-order bits ofq(q_0throughq_j). - Construct two approximations of q, guessing
q_{j+1}is either 0 or 1:g = q_0 q_1 .. q_j 0 0 0 .. 0g_hi = q_0 q_1 .. q_j 1 0 0 .. 0
- Get the server to perform modular exponentiation (
g^d) for both guesses.- We know
gis necessarily less thanq. - If
gandg_hiare both less thanq, time taken shouldn't change much. - If
g_hiis greater thanq, time taken might change noticeably.g_hi mod qis small.- Less time: fewer extra reductions in Montgomery.
- More time: switch from Karatsuba to normal multiplication.
- Knowing the time taken can tell us if 0 or 1 was the right guess.
- We know
- How to get the server to perform modular exponentiation on our guess?
- Send our guess as if it were the encryption of randomness to server.
- One snag: server will convert our message to Montgomery form.
- Since Montgomery's
Ris known, send (g/R mod n) as message to server.
- How do we know if the time difference should be positive or negative?
- Paper seems to suggest it doesn't matter: just look for large diff.
- Figure 3a shows the measured time differences for each bit's guess.
- Karatsuba vs. normal multiplication happens at 32-bit boundaries.
- First 32 bits: extra reductions dominate.
- Next bits: Karatsuba vs normal multiplication dominates.
- At some point, extra reductions start dominating again.
- What happens if the time difference from the two effects cancels out?
- Figure 3, key 3.
- Larger neighborhood changes the balance a bit, reveals a non-zero gap.
- How does the paper get accurate measurements?
- Client machine uses processor's timestamp counter (
rdtscon x86). - Measure several times, take the median value.
- Not clear why median; min seems like it would be the true compute time.
- One snag: relatively few multiplications by
g, due to sliding windows. - Solution: get more multiplications by values close to
g(+ same forg_hi). - Specifically, probe a "neighborhood" of
g(g, g+1, .., g+400).
- Client machine uses processor's timestamp counter (
- Why probe a 400-value neighborhood of
ginstead of measuringg400 times?- Consider the kinds of noise we are trying to deal with.
- (1) Noise unrelated to computation (e.g. interrupts, network latency).
- This might go away when we measure the same thing many times.
- See Figure 2a in the paper.
- (2) "Noise" related to computation.
- E.g., multiplying by
g^3andg_hi^3in sliding window takes diff time. - Repeated measurements will return the same value.
- Will not help determine whether mul by
gorg_hihas more reductions. - See Figure 2b in the paper.
- E.g., multiplying by
- Neighborhood values average out 2nd kind of noise.
- Since neighborhood values are nearby, still has ~same # reductions.
- Timing attack on decryption time: RSA blinding.
- Choose random
r. - Multiply ciphertext by
r^e mod n:c' = c*r^e mod n. - Due to multiplicative property of RSA,
c'is an encryption ofm*r. - Decrypt ciphertext
c'to get messagem'. - Divide plaintext by
r:m = m'/r. - About a 10% CPU overhead for OpenSSL, according to Brumley's paper.
- Choose random
- Make all code paths predictable in terms of execution time.
- Hard, compilers will strive to remove unnecessary operations.
- Precludes efficient special-case algorithms.
- Difficult to predict execution time: instructions aren't fixed-time.
- Can we take away access to precise clocks?
- Yes for single-threaded attackers on a machine we control.
- Can add noise to legitimate computation, but attacker might average.
- Can quantize legitimate computations, at some performance cost.
- But with "sleeping" quantization, throughput can still leak info.
- Relatively tricky to develop an exploit (but that's a one-time problem).
- Possible to notice attack on server (many connection requests).
- Though maybe not so easy on a busy web server cluster?
- Adversary has to be close by, in terms of network.
- Not that big of a problem for adversary.
- Can average over more queries, co-locate nearby (Amazon EC2),
- Run on a nearby bot or browser, etc.
- Adversary may need to know the version, optimization flags, etc of OpenSSL.
- Is it a good idea to rely on such a defense?
- How big of an impediment is this?
- If adversary mounts attack, effects are quite bad (key leaked).
- Page-fault timing for password guessing [Tenex system]
- Suppose the kernel provides a system call to check user's password.
- Checks the password one byte at a time, returns error when finds mismatch.
- Adversary aligns password, so that first byte is at the end of a page,
- Rest of password is on next page.
- Somehow arrange for the second page to be swapped out to disk.
- Or just unmap the next page entirely (using equivalent of
mmap).
- Or just unmap the next page entirely (using equivalent of
- Measure time to return an error when guessing password.
- If it took a long time, kernel had to read in the second page from disk.
- Or, if unmapped, if crashed, then kernel tried to read second page. ]
- Means first character was right!
- If it took a long time, kernel had to read in the second page from disk.
- Can guess an
N-character password in256*Ntries, rather than256^N.
- Suppose the kernel provides a system call to check user's password.
- Cache analysis attacks: processor's cache shared by all processes.
- E.g.: accessing one of the sliding-window multiples brings it in cache.
- Necessarily evicts something else in the cache.
- Malicious process could fill cache with large array, watch what's evicted.
- Guess parts of exponent (
d) based on offsets being evicted.
- Cache attacks are potentially problematic with "mobile code".
- NaCl modules, Javascript, Flash, etc running on your desktop or phone.
- Network traffic timing / analysis attacks.
- Even when data is encrypted, its ciphertext size remains ~same as plaintext.
- Recent papers show can infer a lot about SSL/VPN traffic by sizes, timing.
- E.g., Fidelity lets customers manage stocks through an SSL web site.
- Web site displays some kind of pie chart image for each stock.
- User's browser requests images for all of the user's stocks.
- Adversary can enumerate all stock pie chart images, knows sizes.
- Can tell what stocks a user has, based on sizes of data transfers.
- Similar to CRIME attack mentioned in guest lecture earlier this term.
- Remote timing attacks are practical
- Cache missing for fun and profit
- Efficient Cache Attacks on AES, and Countermeasures
- Get Your Hands Off My Laptop: Physical Side-Channel Key-Extraction Attacks on PCs
- Cross-VM Side Channels and Their Use to Extract Private Keys
- Ed25519: high-speed high-security signatures