URL Shortener

What is it?

Creates aliases for longer URLs that redirect to the original URL when hit.
Refer: TinyURL, Bit.ly

Requirements

Functional requirements

Generate short URL
Short URL should redirect to original URL
Users should be able to pick a custom URL
Links will have an expiry time

Non-functional requirements

Minimal latency in redirection
System should be highly available
Autogenerated links must be non-guessable

Miscellaneous

Should have exposed REST APIs

Capacity estimation

Traffic

Assume:

500M shortenings/month
100:1 read-write ratio

So,
URL writes/s = 500M / (30 days * 24 hours * 3600 seconds) = 200 writes/s
URL reads/s = 200 * 100 = 20k reads/s //100:1 read-write ratio

Storage

Assume:

500 shortenings/month
Each URL is stored for 5 years
Each URL object is 500 bytes (including related metadata)

So,
Total objects = 500 * 12 months * 5 years = 30B
Total storage = 300B * 500 bytes = 15 TB

Bandwidth

For write requests: 200 URLs/s * 500 bytes = 100 KB/s
For read requests: 20k URLs/s * 500 bytes = 10 MB/s

Memory

Requests/day = 20K URLs/s * 3600 seconds * 24 hours = 1.7B
Assume only 20% is cached; so memory needed for that is (0.2 * 1.7B * 500 bytes) = 170GB

Summary:
Assuming 500M URLs/month and 100:1 read-write ratio:

Parameter	Value
New URLs	200/s
URL redirections	20k/s
Incoming data (bandwidth)	100 KB/s
Outgoing data (bandwidth)	10 MB/s
Storage for 5 years	15 TB
Cache memory	170 GB

System APIs

Create API

POST createURL

Params:
url
api_dev_key //Optional, needs to sign up to create a large number of URLs
short_url //Optional
expiry_date //Optional

Delete API

DELETE deleteURL

Params:
api_dev_key
short_url

Database design

Observations:

Billions of objects
Each object is < 1 KB
No relationships between objects
Read-heavy

Each object has 2 parts - URL details, User details.
Let tables be:
URL

Field	Datatype
hash (PK)	varChar(16)
url	varChar(512)
creation_date	datetime
expiry_date	datetime
user_id	int

User

Field	Datatype
name	varChar(20)
email_id	varChar(32)
creation_date	datetime
last_login	datetime

A NoSQL DB is more suitable in this case due to the absence of relationships.

Basic design and algorithm

Compute the hash (say MD5) of the given URL and then encode it.
Using base64 (A-Z a-z 0-9 /+), 6 chars can have 64^6 = 68.7B possibilities.

However, entering the same URL will result in the same short URL.
To prevent duplicates, we can append a random character or the unique user_id as a salt before hashing.

Concurrency problems: Two app-servers create a new key at the same time; both of them check the DB and see that the key is not already in use.
Both of them use that key and then send it to the DB to be stored; the second one will fail.

To create unique keys at scale, Twitter's snowflake can be referenced.

To prevent collisions and concurrency problems:
We can have a standalone key-generation service (KGS) that has a list of unique keys and provides that to app-servers to use.
Keys can be pre-generated and stored in the KGS, which serves this to app-servers that request it instead of hitting the DB to check if keys already exist.

Assuming 1 byte per character, storage required to store all keys: 6 (characters per key) * 68.7B (unique keys) = 412 GB.

When multiple KGS are in use, the keys can be generated centrally by a server and a certain number of keys can be distributed to each KGS in batches.

Data Partitioning and Replication

We can use:

Range-based partitioning: URLs are stored in the DB partition based on the first letters of the URL. But this will lead to an unbalanced state.
Hash-based partitioning: URLs are stored in the DB partition based on the hash. Consistent hashing can be used.

Caching

We can cache 20% of the URLs, which will need 170GB.
The LRU cache eviction policy will fit this use-case.
When there's a cache miss, the core DB value should be updated across cache DBs too.

Load balancing

We can place load balancers between:

Client and app-servers
App-servers and DB servers
App-servers and cache servers

Distribution would depend on load.

DB cleanup

If a user tries to access an expired link, clear it.
Have another periodic service that goes through links and invalidates expired links in batches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!