Skip to content

Update CIA CSV Data #21

Update CIA CSV Data

Update CIA CSV Data #21

name: Update CIA CSV Data
# Refreshes every CSV already tracked under data/cia/** and cia-data/** from the
# upstream Hack23/cia sample-data tree:
# https://github.com/Hack23/cia/tree/master/service.data.impl/sample-data
#
# Upstream layout is NOT flat — root-level files coexist with sub-folders
# (distinct_values/, framework-validation/, risk-rule-tests/). The workflow
# therefore builds a basename→upstream-path index via the GitHub Tree API
# (recursive) before downloading, so every tracked local CSV is resolved to
# its correct upstream path regardless of sub-folder.
#
# The workflow is intentionally update-only: it never pulls in CSV files that
# are not already tracked in this repository.
#
# After the CSV refresh it runs the production-stats pipeline
# (scripts/load-cia-stats.ts + scripts/update-stats-from-cia.ts) so the
# homepage statistics (index*.html) stay in sync with
# cia-data/production-stats.json derived from extraction_summary_report.csv.
#
# When either stage produces changes, a single pull request is opened.
on:
schedule:
# Daily at 03:30 UTC (~04:30 CET / 05:30 CEST), safely after the upstream
# CIA extraction job which completes around 02:57 UTC. The 30-minute margin
# absorbs upstream runtime variance and ensures we refresh from the latest
# extraction rather than the previous day's snapshot.
- cron: '30 3 * * *'
workflow_dispatch:
inputs:
ref:
description: 'Upstream Hack23/cia git ref (branch, tag, or SHA) to download from'
required: false
type: string
default: 'master'
permissions:
contents: read
jobs:
refresh:
name: Refresh CIA CSV data and production stats
runs-on: ubuntu-latest
permissions:
contents: write
pull-requests: write
env:
UPSTREAM_REF: ${{ github.event.inputs.ref || 'master' }}
UPSTREAM_REPO: Hack23/cia
UPSTREAM_PATH: service.data.impl/sample-data
steps:
- name: Harden Runner
uses: step-security/harden-runner@a5ad31d6a139d249332a2605b85202e8c0b78450 # v2.19.1
with:
egress-policy: audit
allowed-endpoints: >
api.github.com:443
github.com:443
raw.githubusercontent.com:443
objects.githubusercontent.com:443
codeload.github.com:443
registry.npmjs.org:443
nodejs.org:443
- name: Checkout repository
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
token: ${{ secrets.GITHUB_TOKEN }}
fetch-depth: 1
- name: Setup Node.js
uses: actions/setup-node@48b55a011bda9f5d6aeb4c2d9c7362e8dae4041e # v6.4.0
with:
node-version: '26'
cache: 'npm'
cache-dependency-path: |
package-lock.json
.github/workflows/update-cia-csv-data.yml
- name: Install dependencies
run: npm ci
- name: Resolve upstream ref → SHA
id: resolve
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
set -euo pipefail
# URL-encode the ref so branch names that legitimately contain '/'
# (e.g. "feature/foo") survive interpolation into the GitHub API
# path. jq is used as a dependency-free URL-encoder.
encoded_ref="$(printf '%s' "$UPSTREAM_REF" | jq -rR @uri)"
echo "🔎 Resolving ${UPSTREAM_REPO}@${UPSTREAM_REF} (encoded: ${encoded_ref})"
# Ask the commits endpoint for the fully-resolved SHA. This accepts
# branches, tags, and raw SHAs uniformly and returns the tip commit
# SHA. All downstream URLs (tree API + raw downloads) use that SHA,
# so there's no risk of a mid-run upstream update producing an
# inconsistent CSV / stats snapshot, and no further URL-encoding
# is required because a SHA is always [0-9a-f].
sha="$(curl --silent --show-error --fail --location --retry 3 --max-time 30 \
-H "Accept: application/vnd.github.v3+json" \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/${UPSTREAM_REPO}/commits/${encoded_ref}" \
| jq -r '.sha')"
if [ -z "$sha" ] || [ "$sha" = "null" ]; then
echo "::error::Failed to resolve ${UPSTREAM_REPO}@${UPSTREAM_REF} to a commit SHA"
exit 1
fi
echo "✅ Resolved SHA: ${sha}"
echo "sha=${sha}" >> "$GITHUB_OUTPUT"
- name: Build upstream basename → path index
id: index
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
UPSTREAM_SHA: ${{ steps.resolve.outputs.sha }}
run: |
set -euo pipefail
echo "📡 Fetching recursive tree for ${UPSTREAM_REPO}@${UPSTREAM_SHA}"
tree_json="$(mktemp)"
curl --silent --show-error --fail --location --retry 3 --max-time 60 \
-H "Accept: application/vnd.github.v3+json" \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/${UPSTREAM_REPO}/git/trees/${UPSTREAM_SHA}?recursive=1" \
-o "$tree_json"
# The GitHub "get a tree" endpoint returns truncated=true when the
# tree exceeds ~100k entries or 7MB. The upstream Hack23/cia tree is
# currently well under those limits (<10k entries), so a single
# recursive call is sufficient. If the repo ever grows past the
# threshold, this step fails fast rather than silently producing an
# incomplete index — at which point the workflow should be updated
# to fall back to per-subdirectory tree fetches.
truncated="$(jq -r '.truncated' "$tree_json")"
if [ "$truncated" = "true" ]; then
echo "::error::GitHub tree response is truncated; upstream repo has grown past the single-call limit. Update the workflow to paginate sub-trees."
exit 1
fi
# index.tsv: <basename>\t<upstream-relative-path> (duplicate basenames rejected)
index_file="${RUNNER_TEMP:-/tmp}/cia_sample_data_index.tsv"
jq -r --arg prefix "${UPSTREAM_PATH}/" '
.tree[]
| select(.type == "blob")
| select(.path | startswith($prefix))
| select(.path | endswith(".csv"))
| .path | ltrimstr($prefix)
' "$tree_json" | awk -F/ '{ print $NF "\t" $0 }' | sort > "$index_file"
total_csvs="$(wc -l < "$index_file")"
dup_count="$(cut -f1 "$index_file" | sort | uniq -d | wc -l)"
if [ "$dup_count" -gt 0 ]; then
echo "::error::Upstream contains ${dup_count} duplicate CSV basenames; basename-based lookup is ambiguous:"
cut -f1 "$index_file" | sort | uniq -d
exit 1
fi
echo "📚 Upstream index: ${total_csvs} CSVs (no duplicate basenames)"
echo "index_file=$index_file" >> "$GITHUB_OUTPUT"
echo "total_upstream=$total_csvs" >> "$GITHUB_OUTPUT"
rm -f "$tree_json"
- name: Refresh tracked CSV files from upstream
id: refresh
env:
INDEX_FILE: ${{ steps.index.outputs.index_file }}
UPSTREAM_SHA: ${{ steps.resolve.outputs.sha }}
run: |
set -euo pipefail
# Use the resolved SHA (not the raw ref) so downloads are pinned to
# the exact same commit the index was built from, and no URL-encoding
# is needed for the raw.githubusercontent.com path.
BASE_URL="https://raw.githubusercontent.com/${UPSTREAM_REPO}/${UPSTREAM_SHA}/${UPSTREAM_PATH}"
echo "📥 Source base URL: ${BASE_URL}"
echo "📚 Using index: ${INDEX_FILE}"
# Only touch CSVs that are ALREADY tracked in the repository under
# data/cia/** or cia-data/**. We never introduce new files.
mapfile -t LOCAL_CSVS < <(
find data/cia cia-data -type f -name '*.csv' 2>/dev/null | sort -u
)
total=${#LOCAL_CSVS[@]}
updated=0
unchanged=0
missing=0
skipped=0
failed=0
# Locally-curated files with no upstream equivalent; these must not be
# overwritten by the upstream sample-data tree.
declare -a SKIP_PATTERNS=(
# Local hand-curated mini-samples used by dashboard demos; their
# schemas (name,ministry,influence / impact / riskScore / quarter)
# do not exist in upstream service.data.impl/sample-data.
'data/cia/ministry/sample_influence.csv'
'data/cia/ministry/sample_decision_impact.csv'
'data/cia/ministry/sample_risk_levels.csv'
'data/cia/ministry/sample_productivity.csv'
# Locally-generated forecast products (election scenarios) that
# are not published in the upstream CIA sample-data tree.
'cia-data/election/election_forecast.csv'
'cia-data/election/coalition_scenarios.csv'
)
is_skipped() {
local path="$1"
for pat in "${SKIP_PATTERNS[@]}"; do
if [ "$path" = "$pat" ]; then
return 0
fi
done
return 1
}
# Resolve a local basename to its upstream-relative path. Falls back
# to the "<stem>_sample.csv" alias, which the repo uses for a few
# locally-renamed canonical files (e.g. view_riksdagen_committee_decisions.csv).
resolve_upstream_path() {
local base="$1"
local hit
hit="$(awk -v b="$base" -F'\t' '$1==b {print $2; exit}' "$INDEX_FILE")"
if [ -n "$hit" ]; then
printf '%s' "$hit"
return 0
fi
local alias="${base%.csv}_sample.csv"
hit="$(awk -v b="$alias" -F'\t' '$1==b {print $2; exit}' "$INDEX_FILE")"
if [ -n "$hit" ]; then
printf '%s' "$hit"
return 0
fi
return 1
}
tmpdir="$(mktemp -d)"
trap 'rm -rf "$tmpdir"' EXIT
: > "$tmpdir/updated.txt"
: > "$tmpdir/missing.txt"
: > "$tmpdir/failed.txt"
for local_path in "${LOCAL_CSVS[@]}"; do
base="$(basename "$local_path")"
if is_skipped "$local_path"; then
echo " ⏭ skip $local_path (local-only sample, no upstream)"
skipped=$((skipped + 1))
continue
fi
if ! upstream_rel="$(resolve_upstream_path "$base")"; then
echo " ❔ miss $local_path (no upstream basename match)"
missing=$((missing + 1))
echo "$local_path" >> "$tmpdir/missing.txt"
continue
fi
url="${BASE_URL}/${upstream_rel}"
tmpfile="$tmpdir/$base"
http_code="$(curl --silent --show-error --location --retry 3 \
--max-time 60 --output "$tmpfile" --write-out '%{http_code}' \
"$url" || echo '000')"
case "$http_code" in
200)
if [ ! -s "$tmpfile" ]; then
echo " ⚠️ empty $local_path ($url)"
failed=$((failed + 1))
echo "$local_path" >> "$tmpdir/failed.txt"
continue
fi
if cmp --silent "$tmpfile" "$local_path"; then
unchanged=$((unchanged + 1))
else
mkdir -p "$(dirname "$local_path")"
mv "$tmpfile" "$local_path"
updated=$((updated + 1))
echo "$local_path <- ${upstream_rel}" >> "$tmpdir/updated.txt"
echo " ✅ update $local_path <- ${upstream_rel}"
fi
;;
404)
echo " ❔ miss $local_path (404 at ${upstream_rel})"
missing=$((missing + 1))
echo "$local_path" >> "$tmpdir/missing.txt"
;;
*)
echo " ❌ http=$http_code $local_path ($url)"
failed=$((failed + 1))
echo "$local_path" >> "$tmpdir/failed.txt"
;;
esac
done
{
echo "total=$total"
echo "updated=$updated"
echo "unchanged=$unchanged"
echo "missing=$missing"
echo "skipped=$skipped"
echo "failed=$failed"
} >> "$GITHUB_OUTPUT"
{
echo "## 📊 CIA CSV Refresh Summary"
echo ""
echo "| Metric | Count |"
echo "| --- | ---: |"
echo "| Total tracked CSVs | $total |"
echo "| ✅ Updated | $updated |"
echo "| 🟰 Unchanged | $unchanged |"
echo "| ❔ Missing upstream | $missing |"
echo "| ⏭ Skipped (local-only) | $skipped |"
echo "| ❌ Failed downloads | $failed |"
echo ""
echo "- Upstream ref: \`${UPSTREAM_REF}\`"
echo "- Source base: ${BASE_URL}"
echo "- Upstream index size: ${{ steps.index.outputs.total_upstream }} CSVs"
if [ "$updated" -gt 0 ]; then
echo ""
echo "### Updated files"
echo ""
sed 's/^/- /' "$tmpdir/updated.txt"
fi
if [ "$missing" -gt 0 ]; then
echo ""
echo "### Missing upstream"
echo ""
sed 's/^/- /' "$tmpdir/missing.txt"
fi
if [ "$failed" -gt 0 ]; then
echo ""
echo "### Failed downloads"
echo ""
sed 's/^/- /' "$tmpdir/failed.txt"
fi
} >> "$GITHUB_STEP_SUMMARY"
if [ "$failed" -gt 0 ]; then
echo "::error::$failed CSV downloads failed; see job summary"
exit 1
fi
- name: Refresh CIA production statistics
id: stats
env:
UPSTREAM_SHA: ${{ steps.resolve.outputs.sha }}
# Pin the statistics fetch to the exact same commit SHA the CSV
# refresh just used. `scripts/load-cia-stats.ts` respects the
# CIA_EXTRACTION_SUMMARY_URL env var as an override of its default
# master-branch URL, so the injected counts stay consistent with the
# CSV snapshot when the workflow is dispatched with a non-master ref.
CIA_EXTRACTION_SUMMARY_URL: https://raw.githubusercontent.com/${{ env.UPSTREAM_REPO }}/${{ steps.resolve.outputs.sha }}/${{ env.UPSTREAM_PATH }}/extraction_summary_report.csv
run: |
set -euo pipefail
echo "📊 Fetching CIA production statistics from ${CIA_EXTRACTION_SUMMARY_URL}"
npx --no-install tsx scripts/load-cia-stats.ts
if [ ! -f "cia-data/production-stats.json" ]; then
echo "::error::cia-data/production-stats.json was not produced"
exit 1
fi
total_persons="$(jq -r '.counts.total_persons' cia-data/production-stats.json)"
total_votes="$(jq -r '.counts.total_votes' cia-data/production-stats.json)"
last_updated="$(jq -r '.metadata.last_updated' cia-data/production-stats.json)"
{
echo "total_persons=$total_persons"
echo "total_votes=$total_votes"
echo "last_updated=$last_updated"
} >> "$GITHUB_OUTPUT"
echo "🖊 Injecting statistics into index*.html (14 language variants)"
npx --no-install tsx scripts/update-stats-from-cia.ts
{
echo ""
echo "## 📈 Production Statistics Refresh"
echo ""
echo "| Metric | Value |"
echo "| --- | ---: |"
echo "| Total Persons | ${total_persons} |"
echo "| Total Votes | ${total_votes} |"
echo "| Last Extraction | ${last_updated} |"
echo "| Pinned commit | \`${UPSTREAM_SHA}\` |"
} >> "$GITHUB_STEP_SUMMARY"
- name: Detect git changes
id: diff
run: |
set -e
# Paths touched by this workflow: CSV trees + stats JSON + injected HTML
paths=(data/cia cia-data/production-stats.json cia-data/*.csv cia-data/*/*.csv index.html index_*.html)
if [ -n "$(git status --porcelain -- "${paths[@]}")" ]; then
echo "has_changes=true" >> "$GITHUB_OUTPUT"
echo "📝 Changes detected"
git status --porcelain -- "${paths[@]}" | head -30
else
echo "has_changes=false" >> "$GITHUB_OUTPUT"
echo "✅ No changes detected"
fi
- name: Create pull request with refreshed data
if: steps.diff.outputs.has_changes == 'true'
uses: peter-evans/create-pull-request@5f6978faf089d4d20b00c7766989d076bb2fc7f1 # v8.1.1
with:
token: ${{ secrets.GITHUB_TOKEN }}
add-paths: |
data/cia/**
cia-data/**
index.html
index_*.html
commit-message: |
data(cia): refresh CSV sample data + production stats from ${{ env.UPSTREAM_REPO }}@${{ steps.resolve.outputs.sha }}
- Upstream ref: ${{ env.UPSTREAM_REF }} (resolved to ${{ steps.resolve.outputs.sha }})
- CSV files: updated ${{ steps.refresh.outputs.updated }} / ${{ steps.refresh.outputs.total }} tracked
- Production stats: ${{ steps.stats.outputs.total_persons }} persons, ${{ steps.stats.outputs.total_votes }} votes
- Last extraction: ${{ steps.stats.outputs.last_updated }}
branch: data/cia-csv-refresh
delete-branch: true
title: '📊 Refresh CIA data (${{ steps.refresh.outputs.updated }} CSVs + production stats)'
body: |
## 📊 Automated CIA data refresh
Pulled the latest `service.data.impl/sample-data/` CSVs from
[`${{ env.UPSTREAM_REPO }}@${{ env.UPSTREAM_REF }}`](https://github.com/${{ env.UPSTREAM_REPO }}/tree/${{ steps.resolve.outputs.sha }}/${{ env.UPSTREAM_PATH }})
(resolved to commit [`${{ steps.resolve.outputs.sha }}`](https://github.com/${{ env.UPSTREAM_REPO }}/commit/${{ steps.resolve.outputs.sha }}))
and re-ran the production-statistics pipeline against the same commit.
### CSV refresh
| Metric | Count |
| --- | ---: |
| Total tracked CSVs | ${{ steps.refresh.outputs.total }} |
| ✅ Updated | ${{ steps.refresh.outputs.updated }} |
| 🟰 Unchanged | ${{ steps.refresh.outputs.unchanged }} |
| ❔ Missing upstream | ${{ steps.refresh.outputs.missing }} |
| ⏭ Skipped (local-only) | ${{ steps.refresh.outputs.skipped }} |
| ❌ Failed downloads | ${{ steps.refresh.outputs.failed }} |
### Production statistics
- Total persons: **${{ steps.stats.outputs.total_persons }}**
- Total votes: **${{ steps.stats.outputs.total_votes }}**
- Last extraction: **${{ steps.stats.outputs.last_updated }}**
### How resolution works
The workflow never introduces new files — it only refreshes CSVs
already tracked under `data/cia/**` and `cia-data/**`. Upstream paths
are resolved through a recursive `basename → upstream-path` index
built from the GitHub Tree API, so files living in
sub-folders (`distinct_values/`, `framework-validation/`,
`risk-rule-tests/`) are located correctly even though local paths
are organised differently. An `<stem>_sample.csv` alias is applied
for the handful of locally-renamed canonical files.
### Review checklist
- [ ] CSV diffs look reasonable (no schema regressions)
- [ ] `production-stats.json` counts move monotonically or within reason
- [ ] Dashboards still render with the refreshed data
- [ ] No unexpected file additions or removals
> Generated by `.github/workflows/update-cia-csv-data.yml`.
labels: |
automated-pipeline
data-update
cia-intelligence
# ISMS Compliance
# - ISO 27001:2022 A.5.33 Protection of records — Git audit trail + PR review gate
# - ISO 27001:2022 A.8.3 Information lifecycle management — nightly refresh + change detection
# - ISO 27001:2022 A.8.10 Information deletion — auto-delete PR branch after merge
# - ISO 27001:2022 A.8.19 Security in use — HTTPS-only downloads, SHA-pinned actions
# - NIST CSF 2.0 PR.DS-5 Data integrity — byte-level cmp vs upstream
# - NIST CSF 2.0 DE.CM-1 Network monitoring — step-security/harden-runner egress audit
# - CIS Controls v8.1 3.1 Data inventory — explicit local-file discovery
# - CIS Controls v8.1 3.14 Data integrity validation — diff + PR review before merge
# - GDPR Article 6(1)(e) — public interest processing (democratic transparency)
# - Swedish Offentlighetsprincipen — public access to government information