- Return data about the last 100 public GitHub repositories. Allows you to limit the results based on different parameters (language, license, etc.)
- Particular attention must be paid to performance and updating the scale of the project. The project must respond as quickly as possible and be able to be used by as many simultaneous users as possible.
- Any provision put in place to make the project maintainable over time is a plus.
- Process the results in parallel to retrieve information concerning the programming languages they use.
- Filter data based on user search
- Aggregate data and view statistics from Git repositories with the language selected.
Example response:
{
"repositories": [
{
"full_name": "FreeCodeCamp/FreeCodeCamp",
"owner": "FreeCodeCamp/FreeCodeCamp",
"repository": "FreeCodeCamp",
"languages": {
"javascript": {
"bytes": 123456
}
}
},
"// ..."
],
"//": "..."
}The application is containerised and can be run using docker-compose. It is configured using environment variables
Using docker-compose:
docker compose upRun using docker compose with mocks and defaults:
USE_FETCHER=mock \
MOCK_FETCHER_AVG_REQUEST_SECONDS=2.5 \
MOCK_RATE_LIMIT=20 \
MOCK_RATE_LIMIT_WINDOW_SECONDS=60 \
FETCH_TIMEOUT_SECONDS=4 \
SLEEPOVER_DURATION_SECONDS=4 \
docker compose up| Variable | Default | Description |
|---|---|---|
| PORT | 5000 | The port the API server listens on |
| REDIS_HOSTPORT | redis:6379 | The HostPort of the redis database |
| REDIS_PREFIX | The prefix for the redis database | |
| REQUEST_MEMCACHE_MAX_AGE_SECONDS | 10 | The TTL for the internal requests memory cache |
| Variable | Default | Description |
|---|---|---|
| REDIS_HOSTPORT | redis:6379 | The HostPort of the redis database |
| REDIS_PREFIX | The prefix for the redis database | |
| USE_FETCHER | mock | Use the mock fetcher for testing * mock - Use the mock fetcher (uses static data from the ./worker/interfaces/fetcher/mock/data directory * live - Use the live fetcher (uses the Github API) |
| MOCK_RATE_LIMIT | 20 | The mock fetcher simulates an API rate limiter. This is the number of requests allowed per MOCK_RATE_LIMIT_WINDOW_SECONDS |
| MOCK_RATE_LIMIT_WINDOW_SECONDS | 60 | The window for the rate limit (Max MOCK_RATE_LIMIT requests per window) |
| MOCK_FETCHER_AVG_REQUEST_SECONDS | 0.5 | The average time for an API request to return in the mock fetcher (max time=avg x2) |
| FETCH_TIMEOUT_SECONDS | 0.98 | The timeout for fetching data from the fetcher |
| SLEEPOVER_DURATION_SECONDS | 5 | When the API rate limit is exceeded, the workers will sleep until the reset time plus this duration |
curl 'localhost:5000/ping'curl 'localhost:5000/repos'curl 'localhost:5000/repos?language=python&license=apache&has_open_issues=true'curl 'localhost:5000/stats'# REPOS
# Call every second (useful for seeing the effect of the in-memory cache)
while true ; do curl -s 'localhost:5000/repos' > /dev/null ; sleep 1 ; done# STATS
# Call every second (useful for seeing the effect of the in-memory cache)
while true ; do curl -s 'localhost:5000/stats' > /dev/null ; sleep 1 ; doneI have written some tests for this code base, but they are not exhaustive. I have:
- written tests mock and live implementations of the fetcher.
- written tests memory and redis implementations of the database.
These tests helped with development of the parallel processing logic in the worker without hitting the Github API.
The memory implmenetation of the database could be useful for testing the API server without needing a redis instance,
however, due to time constraints the memory implementation of the db interface is not complete.
# Run all tests (including Github API tests)
# This will use some of your rate limit
go test ./...Skip: use -short to skip the Github API tests
# Run all tests (short skips Github API tests)
go test -short ./...- The documentation below includes diagrams created using Mermaid (https://mermaid.js.org/).
- Github supports mermaid when rendering markdown.
To view the diagrams in your IDE, please install one of the following plugins:
- Vscode plugin: https://marketplace.visualstudio.com/items?itemName=bierner.markdown-mermaid
- Jetbrains plugin: https://plugins.jetbrains.com/plugin/20146-mermaid
- Other Options: https://mermaid.js.org/ecosystem/integrations-community.html#productivity-tools
- Query the Github API and return the last 100 public GitHub repositories
- Process the results in parallel to retrieve information concerning the programming languages they use
- Allow filtering of the results based on different parameters
- Aggregate data and view statistics with the language selected
- Must be written in go
- Particular attention must be paid to performance and updating the scale of the project. The project must respond as quickly as possible and be able to be used by as many simultaneous users as possible.
- Any provision put in place to make the project maintainable over time is a plus.
- High quality of code and documentation
I will limit the scope of this exercise to the following:
- Use the Github unauthenticated API, which has a rate limit of 60 requests per hour.
- Allow filtering of the results based on different parameters
- I will allow filtering on the results returned from Github.
- I will not filter the results using the call to the Github API as this would breach the rate limit and be slow.
- I will not implement the following as they are not required in this exercise, however I will include some of them
in my proposed design:
- Load Balancing
- Rate Limiting
- CDN Cache Headers in responses
- Authentication
- Tracing
- Profiling
- ZooKeeping / Service Registration
"as many simultaneous users as possible"... Let's go with 1 million requests per minute!
- Requests per minute (RPM)
- 1 Million RPM
- ~17k requests per second (RPS)
- Request size:
- 10 filter parameters @ 8 bytes each = approx 100 bytes per request
- Data transferred:
- 1.7 MB/s request data
- Response size:
- Assumptions:
- 100 repositories @ 400 bytes each = 40 KB per request
- Aggregated Statistics:
- 100 languages @ 32 bytes each = 3.2 KB
- Total response size approx 50KB
- Data transferred:
- 850 MB/s response data (uncompressed)
- Assumptions:
When deciding on the system design we will consider the following objectives:
- Large number of requests per second... (17K RPS)
- Horizontally scalable
- Stateless
- Load balancing
- Minimise bandwidth used (requests and responses) as it is costly
- Cache request/responses
- Local cache to reduce load on the shared cache
- Rate limiting
- by IP address only (no authenticated users)
- to reduce the number of requests from malicious or naive clients
- Efficient design of data in API
- Appropriate cache headers in http response
- Minimise the compute power used (also costly)
- Cache
- Language choice (go)
- Code efficiency
A good approach to this problem is to treat the system as a set of services. Each service has a specific responsibility and can be scaled independently* can be monitored independently.
In this application there are two main services and some supporting services:
The worker is an essential component of the system. It is responsible for periodically fetching the latest data from the Github API, processing the data, and storing it in the database. It also processes the language data in parallel. It gracefully manages the rate limits of the Github API.
- Periodically query the Github API to get the latest data
- Process the data
- Store the data
The API server is responsible for processing the requests from the client. It retrieves the data from the database and returns it to the client. It also processes the requests for the aggregated statistics.
- Respond to API requests
- Load balancer [Not implemented]
- Distribute requests to the service
- Rate Limiting in Load Balancer [Not implemented]
- rate limiter can be horizontally scalable.
- internal to the load balancer for low latency and highly scalable as part of cloud provider services.
- periodically caches data to a shared cache
- Database Service
- Store the results of the Github API calls
- Provide aggregation logic
- Optionally store pre-aggregated data [pre-aggregated data not implemented]
- Options:
- Redis (We'll go with redis, as it fits the task well)
- pros: distributed and scalable, fast, recoverable
- cons: external dependency, data limited by memory available
- Relational DB: Postgres, MySQL
- pros: distributed and scalable, can handle large amounts of data
- cons: external dependency, slower than Redis
- Redis (We'll go with redis, as it fits the task well)
- Caching API Requests/Responses
- Option 1: Local in-memory cache in each API Server instance (basic implementation as an example for this exercise)
- Option 2: Cache service in local network (such as varnish)
- Option 3: CDN Content Delivery Network (external cache)
graph TD
C(Client) -->|Request| LB[Load Balancer]
subgraph "Load Balancing"
LB[Load Balancer]
RATE[[Rate Limiter in LB]]
LB <-->|Every request, low latency| RATE
end
LB <---->|Request| API1
subgraph APIServers[API Servers]
API1[Server 1]
API2[Server 2]
API3[Server ...]
end
subgraph Github Worker
Worker[Worker Process]
end
API1 ---|Get Data/Stats| REDIS[("Database [Redis]")]
API2 --- REDIS
Worker -->|Cache Github Data/Stats| REDIS
Worker -->|Get| GITHUB(((Github API)))
GITHUB -->|Data| Worker
A request to the Github API is rate limited, slow and expensive. Here is the output from a curl request to Github:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 668k 0 668k 0 0 264k 0 --:--:-- 0:00:02 --:--:-- 264k
It takes 2-3 seconds to download 668k of data.
With thousands of requests per second, we don't want to make thousands of requests to the Github API. Therefore, we cache the Github data, and update it periodically respecting the rate limits.
The worker is responsible for keeping the local copy of the Github data up to date.
The API servers are responsible for processing the /repos and /stats requests (17k RPS).
If the API Servers were also responsible for retrieving the Github data, we would need to manage locking or queuing, which could cause contention and bugs. By separating the worker from the API servers, we can avoid this problem.
- Write once, read many times
- Single responsibility
The API Server and the Worker follow a simplified version of the clean architecture pattern. https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html
In the clean architecture, the application is divided into layers:
- Entities
- Use Cases
- Interface Adapters
- Frameworks and Drivers
- External Interfaces
Within our applications, we combine 3 of the layers into a single layer for simplicity:
- Interfaces (
/interfaces), which contains all three:- External Interfaces
- Frameworks and Drivers
- Interface Adapters
- Use Cases (
/usecases) - Entities (
/entities)
Entities : This layer contains internal data structures and "high level policies" used by the application.
Usecases : This layer implements the application-specific logic. In the case of the worker, we coordinate the fetching of data from the Github API, the processing and the storage of the data. In the case of the API server, we coordinate the processing of requests and responses.
Interfaces : This layer contains the external interfaces, the frameworks and drivers, and the interface adapters. The external interfaces are the interfaces to the external services (Github API, Redis). The frameworks and drivers are the implementations of the external interfaces. The interface adapters are the implementations of the interfaces to the usecases.
There are two main applications: the API Server and the Worker. They each need to be built and deployed independently, however they share some common code.
- /apiServer Contains the code for the API Server
- /worker Contains the code for the Worker
- /common Contains the common code shared between the API Server and the Worker
Both the worker and the apiServer applications have the same directory structure. The common code
follows a similar convention.
The following sequence diagram shows the process of the worker.
- Get the latest 100 public repositories from the Github API
- Process the data
- Store the data in the database
- Create a queue of language requests
- Process the language requests in parallel
- Store the language data in the database
- Repeat
sequenceDiagram
participant DB
participant W as Worker
participant G as Github
loop repeat when langRequest queue is empty
note over W: Get Latest 100
opt Rate Limited
W --> W: Sleep
end
W ->> G: Request Latest 100
G -->> W: response
W ->> W: Process Data
W ->> DB: Store Response
W ->> W: Create Queue of Language Requests
note over W: Parallel
par Process the language requests in parallel
opt if langReq Queue is empty
note over W: Break
end
W ->> W: Get LangReq from Queue
opt Rate Limited
W --> W: Sleep
end
W ->> G: Request Languages
G -->> W: response
W ->> W: Process Lang Data
W ->> DB: Store Lang Data
end
note over W: Back to top of loop
end
The API server is responsible for processing the requests from the client and allows some filtering.
Endpoint /repos can be filtered with the following query parameters:
* name
* language
* license
* allow_forking
* has_open_issues
* has_projects [not implemented]
* has_downloads [not implemented]
* has_wiki [not implemented]
* has_pages [not implemented]
* has_discussions [not implemented]
curl 'localhost:5000/repos?language=go&license=apache&has_open_issues=false'The /stats endpoint returns the aggregated statistics for the repositories.
The data is aggregated by:
- language to average number of forks per repository
- language to average number of open issues per repository
- language to average size of the repository
- language to number of repositories
The following is a CURL command to get the last 100 public repositories:
curl -H 'Accept: application/vnd.github.json' \
-o /tmp/list.json \
'https://api.github.com/search/repositories?q=type:public&per_page=100&sort=created&order=desc'The following is a CURL command to gets the languages for a specific repository:
curl -v -H 'Accept: application/vnd.github.json' \
-o /tmp/languages.json \
'https://api.github.com/repos/pandas-dev/pandas-stubs/languages'To exceed the API rate limit and try to get all the data at once, we can use the following command:
dir=./worker/interfaces/fetcher/mock/data
for i in `grep languages_url $dir/list.json | awk '{print $2}' | sed 's/[",]//g'` ; do \
echo $i ; \
file=$dir/`echo $i | sed 's,/,_,g'`.json ; \
if [ -f $file ] ; then \
echo "Skipping $i" ; \
continue ; \
fi ; \
curl -H 'Accept: application/vnd.github.json' \
$i -s -o $file ; \
done- From this canvas, respond to the project which has been communicated to you by our team
- Feel free to change everything
docker compose up
Application will be then running on port 5000
$ curl localhost:5000/ping
{ "status": "pong" }