Skip to content

Latest commit

 

History

History
149 lines (119 loc) · 4.55 KB

url-shortener.md

File metadata and controls

149 lines (119 loc) · 4.55 KB

URL Shortener

What is it?

Creates aliases for longer URLs that redirect to the original URL when hit.
Refer: TinyURL, Bit.ly

Requirements

Functional requirements

  • Generate short URL
  • Short URL should redirect to original URL
  • Users should be able to pick a custom URL
  • Links will have an expiry time

Non-functional requirements

  • Minimal latency in redirection
  • System should be highly available
  • Autogenerated links must be non-guessable

Miscellaneous

  • Should have exposed REST APIs

Capacity estimation

Traffic

Assume:

  • 500M shortenings/month
  • 100:1 read-write ratio

So,
URL writes/s = 500M / (30 days * 24 hours * 3600 seconds) = 200 writes/s
URL reads/s = 200 * 100 = 20k reads/s //100:1 read-write ratio

Storage

Assume:

  • 500 shortenings/month
  • Each URL is stored for 5 years
  • Each URL object is 500 bytes (including related metadata)

So,
Total objects = 500 * 12 months * 5 years = 30B
Total storage = 300B * 500 bytes = 15 TB

Bandwidth

For write requests: 200 URLs/s * 500 bytes = 100 KB/s
For read requests: 20k URLs/s * 500 bytes = 10 MB/s

Memory

Requests/day = 20K URLs/s * 3600 seconds * 24 hours = 1.7B
Assume only 20% is cached; so memory needed for that is (0.2 * 1.7B * 500 bytes) = 170GB

Summary:
Assuming 500M URLs/month and 100:1 read-write ratio:

Parameter Value
New URLs 200/s
URL redirections 20k/s
Incoming data (bandwidth) 100 KB/s
Outgoing data (bandwidth) 10 MB/s
Storage for 5 years 15 TB
Cache memory 170 GB

System APIs

Create API

POST createURL

Params:
url
api_dev_key //Optional, needs to sign up to create a large number of URLs
short_url //Optional
expiry_date //Optional

Delete API

DELETE deleteURL

Params:
api_dev_key
short_url

Database design

Observations:

  • Billions of objects
  • Each object is < 1 KB
  • No relationships between objects
  • Read-heavy

Each object has 2 parts - URL details, User details.
Let tables be:
URL

Field Datatype
hash (PK) varChar(16)
url varChar(512)
creation_date datetime
expiry_date datetime
user_id int

User

Field Datatype
name varChar(20)
email_id varChar(32)
creation_date datetime
last_login datetime

A NoSQL DB is more suitable in this case due to the absence of relationships.

Basic design and algorithm

Compute the hash (say MD5) of the given URL and then encode it.
Using base64 (A-Z a-z 0-9 /+), 6 chars can have 64^6 = 68.7B possibilities.

However, entering the same URL will result in the same short URL.
To prevent duplicates, we can append a random character or the unique user_id as a salt before hashing.

Concurrency problems: Two app-servers create a new key at the same time; both of them check the DB and see that the key is not already in use.
Both of them use that key and then send it to the DB to be stored; the second one will fail.

To create unique keys at scale, Twitter's snowflake can be referenced.

To prevent collisions and concurrency problems:
We can have a standalone key-generation service (KGS) that has a list of unique keys and provides that to app-servers to use.
Keys can be pre-generated and stored in the KGS, which serves this to app-servers that request it instead of hitting the DB to check if keys already exist.

Assuming 1 byte per character, storage required to store all keys: 6 (characters per key) * 68.7B (unique keys) = 412 GB.

When multiple KGS are in use, the keys can be generated centrally by a server and a certain number of keys can be distributed to each KGS in batches.

Data Partitioning and Replication

We can use:

  • Range-based partitioning: URLs are stored in the DB partition based on the first letters of the URL. But this will lead to an unbalanced state.
  • Hash-based partitioning: URLs are stored in the DB partition based on the hash. Consistent hashing can be used.

Caching

We can cache 20% of the URLs, which will need 170GB.
The LRU cache eviction policy will fit this use-case.
When there's a cache miss, the core DB value should be updated across cache DBs too.

Load balancing

We can place load balancers between:

  • Client and app-servers
  • App-servers and DB servers
  • App-servers and cache servers

Distribution would depend on load.

DB cleanup

If a user tries to access an expired link, clear it.
Have another periodic service that goes through links and invalidates expired links in batches.