~50 min
Build an ETL pipeline that fetches data from an external API and loads it into the database.
The database starts empty. We can get anonymized data on task completions in Autochecker API. Your job is to build a pipeline that fetches this data and populates your database so the system can serve it through existing endpoints to display as analytics.
- 1. Steps
- 2. Acceptance criteria
1.1. Follow the Git workflow
Follow the Git workflow to complete this task.
-
Create a
GitHubissue titled:[Task] Build the Data Pipeline -
To create a branch for the task,
git checkout main git pull origin main git checkout -b task/1-build-data-pipelineWe named the branch
task/1-build-data-pipelinebecause:- The issue number (
1) ties the branch to the task issue directly. - The short title (
build-data-pipeline) makes branch purpose clear in PR lists andGithistory. - The pattern reduces naming collisions across the team.
- The issue number (
Before writing code, let's explore the autochecker API.
The API has HTTP Basic Auth, we'll use curl to send requests.
-
To fetch the lab/task catalog,
curl \ -u <your-email>@innopolis.university:<github-username><telegram-alias> \ "https://auche.namaz.live/api/items"Replace
<your-email>and<github-username><telegram-alias>with the credentials you entered in autochecker bot.You should see a
JSONarray of labs and tasks from this course:[ {"lab": "lab-01", "task": null, "title": "Lab 01 – ...", "type": "lab"}, {"lab": "lab-01", "task": "setup", "title": "Repository Setup", "type": "task"}, ... ]
Note
If your terminal shows JSON in one long line, you can format the output using an online JSON viewer.
-
To fetch the first 5 check logs,
curl \ -u <your-email>@innopolis.university:<github-username><telegram-alias> \ "https://auche.namaz.live/api/logs?limit=5"You should see a
JSONobject with alogsarray:{ "logs": [ { "id": 1, "student_id": "a1b2c3d4", "group": "B23-CS-01", "lab": "lab-01", "task": "setup", "score": 100.0, "passed": 4, "failed": 0, "total": 4, "checks": [...], "submitted_at": "2026-02-01T14:30:00Z" } ], "count": 5, "has_more": true }
Note
student_idis an anonymized identifier (not a real student ID).has_more: truemeans there are more records — you need to paginate.scoreis a percentage (0.0–100.0).passed,failed, andtotalare the number of individual checks.
-
To fetch only recent logs,
curl \ -u <your-email>@innopolis.university:<github-username><telegram-alias> \ "https://auche.namaz.live/api/logs?since=2026-03-01T00:00:00Z&limit=5"You should see only logs submitted after March 1st 2026.
Note
The since parameter enables incremental sync — you can fetch new data each time.
Your pipeline will use the most recent submitted_at from the database as the since value.
- 1.4.1. Read the code stubs
- 1.4.2. Implement the pipeline
- 1.4.3. Run and test locally
- 1.4.4. Verify the data locally
- 1.4.5. Test idempotency locally
- 1.4.6. Commit and push your work
- 1.4.7. Update and test on the VM
The code stubs in backend/app/etl.py contain detailed TODOs.
-
Open the file:
backend/app/etl.py.This file contains five functions with detailed TODO comments:
Function Role fetch_items()Fetch the lab/task catalog from the API fetch_logs()Fetch check logs with pagination load_items()Insert items into the database load_logs()Insert logs (with learner creation) into the database sync()Orchestrate the full pipeline -
Open the file:
backend/app/routers/pipeline.py.This file provides the
POST /pipeline/syncendpoint that callssync(). -
Read the TODO comments in
etl.pycarefully. They specify:- Which API endpoints to call and how to authenticate.
- How to handle pagination (
has_moreflag). - How to match API data to database models.
- How to ensure idempotent upserts (skip records that already exist).
-
Start the
Qwen codecoding agent in the terminal inside the project directory. -
Give it a prompt that asks for planning, implementation, and explanation:
"Read the TODO comments in
backend/app/etl.pyand implement all five functions one by one. Use the existing models inbackend/app/models/and the settings inbackend/app/settings.py. The API uses HTTP Basic Auth. First give me a short numbered plan, then implement a function, deploy locally, then test, report to me what exactly you've done and explain each function step by step as if teaching a junior engineer. Then confirm with me and proceed to the next function." -
Wait for the agent to generate the implementation.
-
Review the generated code. Make sure it:
- Uses
httpx.AsyncClientwith HTTP Basic Auth for API calls. - Handles pagination in
fetch_logs()(loops whilehas_moreis True). - In
load_items(), maps labs by their short ID (e.g."lab-01"), not by title, so tasks can find their parent. - Passes the raw items catalog to
load_logs()so it can map log fields (e.g."lab-01","setup") to item titles in the DB. - Creates learners by
external_idinload_logs()(find-or-create pattern). - Uses
external_idonInteractionLogfor idempotent upserts (skip if exists). - Returns
{"new_records": N, "total_records": M}fromsync().
- Uses
Tip
To get educational answers from a coding agent, ask for these explicitly:
- "Plan first, then code."
- "Explain each function step by step."
- "Call out assumptions and edge cases."
- "After coding, summarize why this implementation is correct."
-
To deploy your changes locally,
docker compose --env-file .env.docker.secret up --build -d -
Open
Swagger UIathttp://localhost:<caddy-port>/docs.Replace
<caddy-port>with the value ofCADDY_HOST_PORTin.env.docker.secret(default:42002). -
Authorize with your
API_KEY. -
Trigger the pipeline: expand
POST /pipeline/sync, clickTry it out, thenExecute.You should see a
200response with aJSONbody:{ "new_records": 150, "total_records": 150 }The exact numbers depend on how many check results exist in the autochecker.
[!TIP] If you get a
500error, the pipeline code has a bug. Use this debug loop:- Check the container logs:
docker compose --env-file .env.docker.secret logs app --tail 50 - Copy the error traceback and give it to your coding agent.
- Apply the fix, rebuild (
docker compose --env-file .env.docker.secret up --build -d), and try again.
It is normal to repeat this 2–3 times. AI agents often make mistakes with field names, imports, or database constraints on the first try. Each iteration gets you closer.
Troubleshooting
Check that
AUTOCHECKER_EMAILandAUTOCHECKER_PASSWORDare set correctly in.env.docker.secret. The password is<github-username><telegram-alias>(no spaces, no@).To check the container logs for the error,
docker compose --env-file .env.docker.secret logs app --tail 50Common issues: missing import, wrong field name, database constraint violation.
Verify that
AUTOCHECKER_API_URLis set tohttps://auche.namaz.livein.env.docker.secret. - Check the container logs:
-
In local
Swagger UI, tryGET /items/.You should see a list of lab and task items created by the pipeline.
-
Try
GET /learners/.You should see a list of learners with anonymized
external_idvalues and student groups. -
Try
GET /interactions/.You should see interaction records with
score,checks_passed, andchecks_totalfields. -
(Optional) Open
pgAdminand inspect the tables directly.
-
In local
Swagger UI, runPOST /pipeline/syncagain.You should see:
{ "new_records": 0, "total_records": 150 }new_records: 0confirms that the pipeline does not create duplicate records.
Note
Idempotent upserts are important for production pipelines. If the pipeline is interrupted, you can safely re-run it without creating duplicates.
-
Commit your changes.
Use this commit message:
feat: implement ETL pipeline for autochecker data -
To push your task branch,
git push -u origin <task-branch>Replace
<task-branch>.
-
To pull your branch and restart the services on your VM,
cd se-toolkit-lab-5 git fetch origin git checkout <task-branch> git pull origin <task-branch> docker compose --env-file .env.docker.secret up --build -d -
Open
Swagger UIathttp://<your-vm-ip-address>:<caddy-port>/docs.Replace:
<your-vm-ip-address>with your VM's IP address.<caddy-port>with the value ofCADDY_HOST_PORT(default:42002).
-
Authorize with your
API_KEY. -
Run
POST /pipeline/synconce.You should get
200withnew_recordsandtotal_records.
-
Go to your fork on
GitHuband clickPull requests→New pull request. -
Change the base repository to your own fork — by default
GitHubsets the base to the upstream (inno-se-toolkit/se-toolkit-lab-5). Clickbase repositoryand select<your-github-username>/se-toolkit-lab-5instead. -
Set the base branch to
mainand the compare branch to your task branch (e.g.task/1-build-data-pipeline). -
Write a PR title and description. Link the PR to the issue by writing
Closes #<issue-number>in the description. -
Click
Create pull request. -
Ask your partner to review and approve the PR.
-
Merge the PR and close the issue.
Check the task using the autochecker Telegram bot.
- Issue has the correct title.
-
POST /pipeline/syncreturns200with aJSONbody containingnew_recordsandtotal_records. -
GET /items/returns items created by the pipeline (labs and tasks). -
GET /learners/returns learners created by the pipeline. -
GET /interactions/returns interactions with scores. - Running
POST /pipeline/synca second time returnsnew_records: 0(idempotency). - PR is approved.
- PR is merged.