Skip to content

Commit 0bbaa33

Browse files
authored
Merge pull request #8 from grey-box/sp25ccsu-api
Cumulative end-of-semester API/Middleware
2 parents 25b65ee + f6b37e0 commit 0bbaa33

File tree

13 files changed

+688
-97
lines changed

13 files changed

+688
-97
lines changed

.github/workflows/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
- name: Install dependencies
1717
run: |
1818
python -m pip install --upgrade pip
19-
cd api
19+
cd fastapi/
2020
pip install -r requirements.txt
2121
python -m nltk.downloader punkt
2222

extras/uvicornrun.png

36.9 KB
Loading

fastapi/.editorconfig

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
root = true
2+
3+
[*]
4+
indent_style = space
5+
indent_size = 4
6+
tab_width = 4
7+
end_of_line = lf
8+
charset = utf-8
9+
trim_trailing_whitespace = true
10+
insert_final_newline = true
11+
12+
[*.md]
13+
trim_trailing_whitespace = false

fastapi/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ __pycache__/
22
.idea/
33
.venv/
44
venv/
5+
.env

fastapi/README.MD

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# BACKEND GUIDE
2+
3+
## I. **RUNNING APP**
4+
5+
Refer to [INSTALLATION.md](../INSTALLATION.md) for more in depth configuration details
6+
and dependency installation. This section is purely for running the backend.
7+
8+
1) Ensure that your current working directory is `fastapi` (i.e. run `cd fastapi`)
9+
2) If you have not already, set up your virtual environment (`python3 -m venv venv`)
10+
3) Switch to the virtual environment (`source venv/bin/activate` or `venv\bin\activate.bat` depending on OS)
11+
4) If you have not already, install dependencies (`pip install -r requirements.txt`)
12+
5) Consider enabling debug logging:
13+
1) Create `.env` inside the `fastapi` folder
14+
2) Add properties to enable FastAPI and general debug logging:
15+
```properties
16+
LOG_LEVEL=DEBUG
17+
FASTAPI_DEBUG=True
18+
```
19+
6) Finally, run the app with Uvicorn (`uvicorn app.main:app --reload`)
20+
* The app runs on port 8000 by default. To use another port, add `--port <number>` to the command.
21+
22+
You should see logs similar to the following:
23+
![result](../extras/uvicornrun.png)
24+
25+
To test the endpoints, use a tool like POSTMAN or curl in order to make GET or POST requests and receive formatted JSON responses.
26+
27+
## II. **FILE STRUCTURE**
28+
29+
`app` is the root package of the python project.
30+
31+
`app/ai` contains the backend ML functionality.
32+
33+
`app/api` contains the API endpoints exposed by the web server.
34+
35+
`app/model` contains request and response structures for communication between the front and back end.
36+
37+
## III. **CURRENT FUNCTIONALITY**
38+
39+
### `/symmetry/v1/wiki/articles`
40+
41+
**Protocol:** `GET`
42+
43+
**Parameters:**
44+
* `query`: A Wikipedia article URL or title
45+
* `lang`: Optional. A language short code (i.e. "en" or "fr") defaulting to "en" for English.
46+
47+
**Response:**
48+
49+
```json
50+
{
51+
"sourceArticle": "<article content>",
52+
"articleLanguages": [
53+
// Article language data
54+
]
55+
}
56+
```
57+
58+
### `/symmetry/v1/articles/compare`
59+
60+
> [!NOTE]
61+
> This endpoint is a work in progress and currently only returns dummy data.
62+
63+
**Protocol:** `POST`
64+
65+
**Parameters:**
66+
```json
67+
{
68+
"article_text_blob_1": "<article 1 content>",
69+
"article_text_blob_2": "<article 2 content>",
70+
"article_text_blob_1_language": "<article 1 short language code>",
71+
"article_text_blob_2_language": "<article 2 short language code>",
72+
// Float ranging from 0-1 representing similarity percentage.
73+
"comparison_threshold": 0.8,
74+
"model_name": "<model name string>"
75+
}
76+
```
77+
78+
Response:
79+
```json
80+
{
81+
"comparisons": [
82+
// Array of comparison objects.
83+
{
84+
"left_article_array": [
85+
// Array of article 1 content divided into subsections.
86+
],
87+
"right_article_array": [
88+
// Array of article 2 content divided into subsections.
89+
],
90+
"left_article_missing_info_index": [
91+
// Array of indices of article 1 subsections that are not present in 2.
92+
],
93+
"right_article_extra_info_index": [
94+
// Array of indices of article 2 subsections that are not present in 1.
95+
]
96+
}
97+
]
98+
}
99+
```
100+
101+
## IV. **CONCEPTS & RESEARCH**
102+
103+
### Simple UI
104+
The UI is supposed to be as simple as possible - the UI team is supposed to focus their
105+
efforts on improving the user experience, not covering all possible cases.
106+
That means that the API/middleware is responsible for the bulk of logic related to input
107+
validation.
108+
109+
For example, the dynamic search box allowing queries by URL or article title is handled
110+
by the wiki articles endpoint rather than the UI.
111+
112+
### Semantic Comparison Without Translation
113+
The language comparison model in use does not require that the input texts be in the same language.
114+
115+
### JSON vs XML
116+
We were tasked with investigating the benefits of using XML over JSON for communications
117+
as it might be easier for some ML models to work with.
118+
119+
Our findings were that the primary benefit of using XML for some ML models is that it can
120+
be used to more easily guard against prompt injection.
121+
It is otherwise largely harder to work with, and is a model-specific detail.
122+
123+
### CORS
124+
Cross-origin resource sharing is useful for protecting against certain attacks.
125+
By restricting resources to whitelisted domains, we can prevent this.
126+
127+
However, Symmetry currently operates as a local server. CORS will not be important until
128+
a future state where a standalone Symmetry server exists.
129+
130+
### Endpoint Naming Conventions
131+
132+
A quick read on RESTful resource naming: https://restfulapi.net/resource-naming/
133+
134+
The current format is `/symmetry/v1/<path>/<to>/<resource>`.
135+
136+
For a more dense and in-depth view of the RESTful philosophy, the above article
137+
links Roy Fielding's dissertation:
138+
https://ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm#sec_5_2_1_1
139+
140+
## V. **WHAT'S NEXT**
141+
142+
### External Request Rejection
143+
As Symmetry does run a local web server, it would be a good idea to ensure that all
144+
remote requests are rejected.
145+
146+
### CORS Configurability
147+
As stated above, CORS is not particularly necessary for Symmetry in its current state.
148+
However, in the future, it should likely be configurable to allow other origins.
149+
150+
### Caching
151+
There is a cache implementation in place for articles. However, it is global and
152+
designed only for use in Wikipedia article fetching. With a little effort, ArticleCache
153+
could be repurposed into a more general and reusable content cache.
154+
155+
### Live Comparison Data
156+
The [comparison endpoint](#symmetryv1articlescompare) currently returns dummy data.
157+
When the ML team completes their side, this endpoint will need to be connected to
158+
return real data.
159+

fastapi/app/api/cache.py

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
import hashlib
2+
import logging
3+
from time import time
4+
from typing import Dict, List, Optional, Tuple
5+
from collections import OrderedDict
6+
from sys import getsizeof
7+
8+
CACHE_LIMIT = 10 # Max number of cached articles
9+
TTL_SECONDS = 4000 # Time to live for cached items in seconds
10+
11+
"""
12+
This module implements an LRU (Least Recently Used) cache for storing articles fetched from Wikipedia.
13+
of the endpoint path. The above parameters are passed to the function as arguments which will determine
14+
the time to live as well as the amount of elements which can be stored in the cache to prevent memory leaks.
15+
This current implementation is exclusively insantiated and used in the wiki_article.py file, but can be
16+
extraported and used in other files in future implemenetations.
17+
"""
18+
19+
20+
# Internal LRU cache manager
21+
class ArticleCache:
22+
# Initializes the cache
23+
def __init__(self, max_size: int = CACHE_LIMIT, ttl: int = TTL_SECONDS):
24+
self.cache: "OrderedDict[str, Dict]" = OrderedDict()
25+
self.max_size = max_size
26+
self.ttl = ttl
27+
self.current_size = 0 # Approximate memory usage in bytes
28+
# Creates cache key
29+
def _get_cache_key(self, key: str) -> str:
30+
return hashlib.md5(key.encode()).hexdigest()
31+
32+
def get(self, key: str) -> Tuple[Optional[str], Optional[List[str]]]:
33+
cache_key = self._get_cache_key(key)
34+
cached_data = self.cache.get(cache_key)
35+
# Checks cache for articles, misses if none are found
36+
if not cached_data:
37+
logging.info(f"[CACHE MISS] No cache entry for key: {cache_key}")
38+
return None, None
39+
# Cached article expires and is evicted if it exists longer than 4000 seconds (approx. 1.1 hours)
40+
if time() - cached_data["timestamp"] > self.ttl:
41+
self._evict(cache_key, reason="expired")
42+
return None, None
43+
44+
self.cache.move_to_end(cache_key)
45+
logging.info(f"[CACHE HIT] Returning cached data for key: {cache_key}")
46+
return cached_data["content"], cached_data["languages"]
47+
# Sets cached article if it has not been cached yet
48+
def set(self, key: str, content: str, languages: List[str]) -> None:
49+
cache_key = self._get_cache_key(key)
50+
item = {
51+
"content": content,
52+
"languages": languages,
53+
"timestamp": time(),
54+
}
55+
# Determines size of article
56+
item_size = getsizeof(item)
57+
58+
# Checks cache key and moves it accordingly based on LRU to enable effective eviction if needed
59+
if cache_key in self.cache:
60+
self.current_size -= getsizeof(self.cache[cache_key])
61+
self.cache.move_to_end(cache_key)
62+
elif len(self.cache) >= self.max_size:
63+
evicted_key, evicted_val = self.cache.popitem(last=False)
64+
self.current_size -= getsizeof(evicted_val)
65+
logging.info(f"[CACHE EVICTED] LRU item: {evicted_key}")
66+
67+
self.cache[cache_key] = item
68+
self.current_size += item_size
69+
70+
logging.info(f"[CACHE SET] Key: {cache_key} | Size: {len(self.cache)}/{self.max_size} | Approx. Memory: {self.current_size} bytes")
71+
# Nukes LRU article in the cache
72+
def _evict(self, key: str, reason: str = "manual") -> None:
73+
if key in self.cache:
74+
self.current_size -= getsizeof(self.cache[key])
75+
del self.cache[key]
76+
logging.info(f"[CACHE {reason.upper()}] Evicted key: {key}")
77+
78+
# Instantiate global cache object
79+
_article_cache = ArticleCache()
80+
81+
# For external use
82+
def get_article_cache_key(key: str) -> str:
83+
return _article_cache._get_cache_key(key)
84+
85+
def get_cached_article(title: str) -> Tuple[Optional[str], Optional[List[str]]]:
86+
return _article_cache.get(title)
87+
88+
def set_cached_article(key: str, content: str, languages: List[str]) -> None:
89+
_article_cache.set(key, content, languages)

fastapi/app/api/comparison.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
from app.model.response import CompareResponse, ComparisonResult
2+
from fastapi import APIRouter
3+
from app.model.request import CompareRequest
4+
5+
router = APIRouter(prefix="/symmetry/v1", tags=["comparison"])
6+
7+
8+
@router.post("/articles/compare", response_model=CompareResponse)
9+
def compare_articles(payload: CompareRequest):
10+
"""
11+
This endpoint requests a comparison of two blobs of text.
12+
The request includes the articles, the languages of the articles, the comparison threshold, and model name.
13+
14+
The response is an array of comparison results, allowing support for a future state where
15+
output may be requested from multiple ML models in a single request.
16+
17+
The schema for this response is defined in the model/response.py file.
18+
19+
------------------------------------------------------------------------------------------------
20+
THIS ENDPOINT IS NOT FINALIZED!!!
21+
22+
This is currently returning dummy data, and still needs to be integrated with the ML model
23+
in order to return the actual comparison results from the backend.
24+
------------------------------------------------------------------------------------------------
25+
"""
26+
27+
"""
28+
29+
"""
30+
31+
left_article_array = [
32+
"Barack Hussein Obama II is an American politician who served as the 44th president of the United States from 2009 to 2017.",
33+
"Obama previously served as a U.S. senator representing Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004.",
34+
"Obama was born in Honolulu, Hawaii.",
35+
"He graduated from Columbia University in 1983 with a Bachelor of Arts degree in political science and later worked as a community organizer in Chicago.",
36+
"In 1988, Obama enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review.",
37+
"In the 2008 presidential election, after a close primary campaign against Hillary Clinton, he was nominated by the Democratic Party for president.",
38+
"Obama selected Joe Biden as his running mate and defeated Republican nominee John McCain and his running mate Sarah Palin.",
39+
]
40+
41+
right_article_array = [
42+
"Barack Hussein Obama II is an American politician who served as the 44th president of the United States from 2009 to 2017.",
43+
"A member of the Democratic Party, he was the first African-American president in American history.",
44+
"He graduated from Columbia University in 1983 with a Bachelor of Arts degree in political science and later worked as a community organizer in Chicago.",
45+
"In 1988, Obama enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review.",
46+
"He became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004.",
47+
"In 1996, Obama was elected to represent the 13th district in the Illinois Senate, a position he held until 2004, when he successfully ran for the U.S. Senate.",
48+
"In the 2008 presidential election, after a close primary campaign against Hillary Clinton, he was nominated by the Democratic Party for president.",
49+
]
50+
51+
comparison = ComparisonResult(
52+
left_article_array=left_article_array,
53+
right_article_array=right_article_array,
54+
left_article_missing_info_index=[1, 2, 6], # Dummy indices
55+
right_article_extra_info_index=[1, 4, 5], # Dummy indices
56+
)
57+
58+
return CompareResponse(comparisons=[comparison])

0 commit comments

Comments
 (0)