Creates aliases for longer URLs that redirect to the original URL when hit.
Refer: TinyURL, Bit.ly
- Generate short URL
- Short URL should redirect to original URL
- Users should be able to pick a custom URL
- Links will have an expiry time
- Minimal latency in redirection
- System should be highly available
- Autogenerated links must be non-guessable
- Should have exposed REST APIs
Assume:
- 500M shortenings/month
- 100:1 read-write ratio
So,
URL writes/s = 500M / (30 days * 24 hours * 3600 seconds) = 200 writes/s
URL reads/s = 200 * 100 = 20k reads/s //100:1 read-write ratio
Assume:
- 500 shortenings/month
- Each URL is stored for 5 years
- Each URL object is 500 bytes (including related metadata)
So,
Total objects = 500 * 12 months * 5 years = 30B
Total storage = 300B * 500 bytes = 15 TB
For write requests: 200 URLs/s * 500 bytes = 100 KB/s
For read requests: 20k URLs/s * 500 bytes = 10 MB/s
Requests/day = 20K URLs/s * 3600 seconds * 24 hours = 1.7B
Assume only 20% is cached; so memory needed for that is (0.2 * 1.7B * 500 bytes) = 170GB
Summary:
Assuming 500M URLs/month and 100:1 read-write ratio:
Parameter | Value |
---|---|
New URLs | 200/s |
URL redirections | 20k/s |
Incoming data (bandwidth) | 100 KB/s |
Outgoing data (bandwidth) | 10 MB/s |
Storage for 5 years | 15 TB |
Cache memory | 170 GB |
POST createURL
Params:
url
api_dev_key //Optional, needs to sign up to create a large number of URLs
short_url //Optional
expiry_date //Optional
DELETE deleteURL
Params:
api_dev_key
short_url
Observations:
- Billions of objects
- Each object is < 1 KB
- No relationships between objects
- Read-heavy
Each object has 2 parts - URL details, User details.
Let tables be:
URL
Field | Datatype |
---|---|
hash (PK) | varChar(16) |
url | varChar(512) |
creation_date | datetime |
expiry_date | datetime |
user_id | int |
User
Field | Datatype |
---|---|
name | varChar(20) |
email_id | varChar(32) |
creation_date | datetime |
last_login | datetime |
A NoSQL DB is more suitable in this case due to the absence of relationships.
Compute the hash (say MD5) of the given URL and then encode it.
Using base64 (A-Z a-z 0-9 /+), 6 chars can have 64^6 = 68.7B possibilities.
However, entering the same URL will result in the same short URL.
To prevent duplicates, we can append a random character or the unique user_id as a salt before hashing.
Concurrency problems: Two app-servers create a new key at the same time; both of them check the DB and see that the key is not already in use.
Both of them use that key and then send it to the DB to be stored; the second one will fail.
To create unique keys at scale, Twitter's snowflake can be referenced.
To prevent collisions and concurrency problems:
We can have a standalone key-generation service (KGS) that has a list of unique keys and provides that to app-servers to use.
Keys can be pre-generated and stored in the KGS, which serves this to app-servers that request it instead of hitting the DB to check if keys already exist.
Assuming 1 byte per character, storage required to store all keys: 6 (characters per key) * 68.7B (unique keys) = 412 GB.
When multiple KGS are in use, the keys can be generated centrally by a server and a certain number of keys can be distributed to each KGS in batches.
We can use:
- Range-based partitioning: URLs are stored in the DB partition based on the first letters of the URL. But this will lead to an unbalanced state.
- Hash-based partitioning: URLs are stored in the DB partition based on the hash. Consistent hashing can be used.
We can cache 20% of the URLs, which will need 170GB.
The LRU cache eviction policy will fit this use-case.
When there's a cache miss, the core DB value should be updated across cache DBs too.
We can place load balancers between:
- Client and app-servers
- App-servers and DB servers
- App-servers and cache servers
Distribution would depend on load.
If a user tries to access an expired link, clear it.
Have another periodic service that goes through links and invalidates expired links in batches.