Build the Data Pipeline

Time

~50 min

Purpose

Build an ETL pipeline that fetches data from an external API and loads it into the database.

Context

The database starts empty. We can get anonymized data on task completions in Autochecker API. Your job is to build a pipeline that fetches this data and populates your database so the system can serve it through existing endpoints to display as analytics.

1. Steps
2. Acceptance criteria

1. Steps

1.1. Follow the `Git workflow`

Follow the Git workflow to complete this task.

1.2. Create a `Lab Task` issue

Create a GitHub issue titled:
```
[Task] Build the Data Pipeline
```
To create a branch for the task,

run in the VS Code Terminal:
```
git checkout main
git pull origin main
git checkout -b task/1-build-data-pipeline
```
We named the branch task/1-build-data-pipeline because:
- The issue number (1) ties the branch to the task issue directly.
- The short title (build-data-pipeline) makes branch purpose clear in PR lists and Git history.
- The pattern reduces naming collisions across the team.

Before writing code, let's explore the autochecker API.

The API has HTTP Basic Auth, we'll use curl to send requests.

1.3.1. Fetch the item catalog

To fetch the lab/task catalog,

run in the VS Code Terminal:

curl \
  -u <your-email>@innopolis.university:<github-username><telegram-alias> \
  "https://auche.namaz.live/api/items"

Replace <your-email> and <github-username><telegram-alias> with the credentials you entered in autochecker bot.

You should see a JSON array of labs and tasks from this course:

[
  {"lab": "lab-01", "task": null, "title": "Lab 01 – ...", "type": "lab"},
  {"lab": "lab-01", "task": "setup", "title": "Repository Setup", "type": "task"},
  ...
]

Note

If your terminal shows JSON in one long line, you can format the output using an online JSON viewer.

1.3.2. Fetch check logs

To fetch the first 5 check logs,

run in the VS Code Terminal:

curl \
  -u <your-email>@innopolis.university:<github-username><telegram-alias> \
  "https://auche.namaz.live/api/logs?limit=5"

You should see a JSON object with a logs array:

{
  "logs": [
    {
      "id": 1,
      "student_id": "a1b2c3d4",
      "group": "B23-CS-01",
      "lab": "lab-01",
      "task": "setup",
      "score": 100.0,
      "passed": 4,
      "failed": 0,
      "total": 4,
      "checks": [...],
      "submitted_at": "2026-02-01T14:30:00Z"
    }
  ],
  "count": 5,
  "has_more": true
}

Note

student_id is an anonymized identifier (not a real student ID).
has_more: true means there are more records — you need to paginate.
score is a percentage (0.0–100.0).
passed, failed, and total are the number of individual checks.

1.3.3. Test incremental sync

To fetch only recent logs,

run in the VS Code Terminal:

curl \
  -u <your-email>@innopolis.university:<github-username><telegram-alias> \
  "https://auche.namaz.live/api/logs?since=2026-03-01T00:00:00Z&limit=5"

You should see only logs submitted after March 1st 2026.

Note

The since parameter enables incremental sync — you can fetch new data each time. Your pipeline will use the most recent submitted_at from the database as the since value.

1.4. Part B: Build the pipeline

1.4.1. Read the code stubs
1.4.2. Implement the pipeline
1.4.3. Run and test locally
1.4.4. Verify the data locally
1.4.5. Test idempotency locally
1.4.6. Commit and push your work
1.4.7. Update and test on the VM

1.4.1. Read the code stubs

The code stubs in backend/app/etl.py contain detailed TODOs.

Open the file: backend/app/etl.py.

This file contains five functions with detailed TODO comments:

Function	Role
`fetch_items()`	Fetch the lab/task catalog from the API
`fetch_logs()`	Fetch check logs with pagination
`load_items()`	Insert items into the database
`load_logs()`	Insert logs (with learner creation) into the database
`sync()`	Orchestrate the full pipeline

Open the file: backend/app/routers/pipeline.py.

This file provides the POST /pipeline/sync endpoint that calls sync().
Read the TODO comments in etl.py carefully. They specify:
- Which API endpoints to call and how to authenticate.
- How to handle pagination (has_more flag).
- How to match API data to database models.
- How to ensure idempotent upserts (skip records that already exist).

1.4.2. Implement the pipeline

Start the Qwen code coding agent in the terminal inside the project directory.
Give it a prompt that asks for planning, implementation, and explanation:

"Read the TODO comments in backend/app/etl.py and implement all five functions one by one. Use the existing models in backend/app/models/ and the settings in backend/app/settings.py. The API uses HTTP Basic Auth. First give me a short numbered plan, then implement a function, deploy locally, then test, report to me what exactly you've done and explain each function step by step as if teaching a junior engineer. Then confirm with me and proceed to the next function."
Wait for the agent to generate the implementation.
Review the generated code. Make sure it:
- Uses httpx.AsyncClient with HTTP Basic Auth for API calls.
- Handles pagination in fetch_logs() (loops while has_more is True).
- In load_items(), maps labs by their short ID (e.g. "lab-01"), not by title, so tasks can find their parent.
- Passes the raw items catalog to load_logs() so it can map log fields (e.g. "lab-01", "setup") to item titles in the DB.
- Creates learners by external_id in load_logs() (find-or-create pattern).
- Uses external_id on InteractionLog for idempotent upserts (skip if exists).
- Returns {"new_records": N, "total_records": M} from sync().

Tip

To get educational answers from a coding agent, ask for these explicitly:

"Plan first, then code."
"Explain each function step by step."
"Call out assumptions and edge cases."
"After coding, summarize why this implementation is correct."

1.4.3. Run and test locally

To deploy your changes locally,

run in the VS Code Terminal:

docker compose --env-file .env.docker.secret up --build -d

Open Swagger UI at http://localhost:<caddy-port>/docs.

Replace <caddy-port> with the value of CADDY_HOST_PORT in .env.docker.secret (default: 42002).
Authorize with your API_KEY.
Trigger the pipeline: expand POST /pipeline/sync, click Try it out, then Execute.

You should see a 200 response with a JSON body:
```
{
  "new_records": 150,
  "total_records": 150
}
```
The exact numbers depend on how many check results exist in the autochecker.
[!TIP] If you get a 500 error, the pipeline code has a bug. Use this debug loop:
1. Check the container logs:
```
docker compose --env-file .env.docker.secret logs app --tail 50
```
2. Copy the error traceback and give it to your coding agent.
3. Apply the fix, rebuild (docker compose --env-file .env.docker.secret up --build -d), and try again.
It is normal to repeat this 2–3 times. AI agents often make mistakes with field names, imports, or database constraints on the first try. Each iteration gets you closer.
Troubleshooting

401 Unauthorized from the autochecker API

Check that AUTOCHECKER_EMAIL and AUTOCHECKER_PASSWORD are set correctly in .env.docker.secret. The password is <github-username><telegram-alias> (no spaces, no @).

500 Internal Server Error

To check the container logs for the error,

run in the VS Code Terminal:
```
docker compose --env-file .env.docker.secret logs app --tail 50
```
Common issues: missing import, wrong field name, database constraint violation.

Connection refused to the autochecker API

Verify that AUTOCHECKER_API_URL is set to https://auche.namaz.live in .env.docker.secret.

1.4.4. Verify the data locally

In local Swagger UI, try GET /items/.

You should see a list of lab and task items created by the pipeline.
Try GET /learners/.

You should see a list of learners with anonymized external_id values and student groups.
Try GET /interactions/.

You should see interaction records with score, checks_passed, and checks_total fields.
(Optional) Open pgAdmin and inspect the tables directly.

1.4.5. Test idempotency locally

In local Swagger UI, run POST /pipeline/sync again.

You should see:
```
{
  "new_records": 0,
  "total_records": 150
}
```
new_records: 0 confirms that the pipeline does not create duplicate records.

Note

Idempotent upserts are important for production pipelines. If the pipeline is interrupted, you can safely re-run it without creating duplicates.

1.4.6. Commit and push your work

Commit your changes.

Use this commit message:

feat: implement ETL pipeline for autochecker data

To push your task branch,

run in the VS Code Terminal:
```
git push -u origin <task-branch>
```
Replace <task-branch>.

1.4.7. Update and test on the VM

To pull your branch and restart the services on your VM,

run in the VS Code Terminal:

cd se-toolkit-lab-5
git fetch origin
git checkout <task-branch>
git pull origin <task-branch>
docker compose --env-file .env.docker.secret up --build -d

Open Swagger UI at http://<your-vm-ip-address>:<caddy-port>/docs.

Replace:
- <your-vm-ip-address> with your VM's IP address.
- <caddy-port> with the value of CADDY_HOST_PORT (default: 42002).
Authorize with your API_KEY.
Run POST /pipeline/sync once.

You should get 200 with new_records and total_records.

1.5. Finish the task

Go to your fork on GitHub and click Pull requests → New pull request.
Change the base repository to your own fork — by default GitHub sets the base to the upstream (inno-se-toolkit/se-toolkit-lab-5). Click base repository and select <your-github-username>/se-toolkit-lab-5 instead.
Set the base branch to main and the compare branch to your task branch (e.g. task/1-build-data-pipeline).
Write a PR title and description. Link the PR to the issue by writing Closes #<issue-number> in the description.
Click Create pull request.
Ask your partner to review and approve the PR.
Merge the PR and close the issue.

1.6. Check the task using the autochecker

Check the task using the autochecker Telegram bot.

2. Acceptance criteria

Issue has the correct title.
POST /pipeline/sync returns 200 with a JSON body containing new_records and total_records.
GET /items/ returns items created by the pipeline (labs and tasks).
GET /learners/ returns learners created by the pipeline.
GET /interactions/ returns interactions with scores.
Running POST /pipeline/sync a second time returns new_records: 0 (idempotency).
PR is approved.
PR is merged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build the Data Pipeline

Time

Purpose

Context

Table of contents

1. Steps

1.1. Follow the `Git workflow`

1.2. Create a `Lab Task` issue

1.3. Part A: Explore the API

1.3.1. Fetch the item catalog

1.3.2. Fetch check logs

1.3.3. Test incremental sync

1.4. Part B: Build the pipeline

1.4.1. Read the code stubs

1.4.2. Implement the pipeline

1.4.3. Run and test locally

401 Unauthorized from the autochecker API

500 Internal Server Error

Connection refused to the autochecker API

1.4.4. Verify the data locally

1.4.5. Test idempotency locally

1.4.6. Commit and push your work

1.4.7. Update and test on the VM

1.5. Finish the task

1.6. Check the task using the autochecker

2. Acceptance criteria

FilesExpand file tree

task-1.md

Latest commit

History

task-1.md

File metadata and controls

Build the Data Pipeline

Time

Purpose

Context

Table of contents

1. Steps

1.1. Follow the Git workflow

1.2. Create a Lab Task issue

1.3. Part A: Explore the API

1.3.1. Fetch the item catalog

1.3.2. Fetch check logs

1.3.3. Test incremental sync

1.4. Part B: Build the pipeline

1.4.1. Read the code stubs

1.4.2. Implement the pipeline

1.4.3. Run and test locally

401 Unauthorized from the autochecker API

500 Internal Server Error

Connection refused to the autochecker API

1.4.4. Verify the data locally

1.4.5. Test idempotency locally

1.4.6. Commit and push your work

1.4.7. Update and test on the VM

1.5. Finish the task

1.6. Check the task using the autochecker

2. Acceptance criteria

1.1. Follow the `Git workflow`

1.2. Create a `Lab Task` issue