Add simple Hugo configuration for static PINVAL generation by jeancochrane · Pull Request #40 · ccao-data/homeval

jeancochrane · 2025-05-05T23:28:02Z

This PR adds a simple Hugo site that we can use to generate PINVAL reports. For now, the Hugo configuration is defined in parallel to the existing Quarto configuration, so that we can maintain the legacy Quarto process while we continue work on the Hugo process. Once the Hugo process is production-ready, I'll put up a follow-up PR that removes the legacy Quarto doc for the sake of cleanliness.

Testing

To test out the Hugo site:

Open WSL
Clone or navigate to the pinval repo
Make sure you have Hugo installed: sudo snap install hugo
- This will prompt you for your terminal user password
- Run hugo version to confirm that it installed correctly
  - You might need to open a new shell before the CLI is available in your path
Navigate to the pinval/pinval/ subdirectory
Run hugo serve
Navigate to http://localhost:1313/example-single-card/ to view the sample single-card report
Navigate to http://localhost:1313/example-multi-card/ to view the sample multi-card report

Benchmarking

Build time

I was curious how fast Hugo can build this report, and how much memory/CPU we'd need to build a full tri worth of reports. I used a quick command like this one to copy the single card report N times:

$ export N=500000
$ time for i in $(seq 1 $N); do cp content/example-single-card.md "content/example-single-card-${i}.md"; done

I tried running on 500k reports but ended up running out of memory on my 16GB laptop. Public repo GitHub runners have 16GB RAM while private repo runners have 7GB RAM, so this indicates that we probably can't build an entire tri in one shot.

I started at 100k reports and incrementally increased the number of reports until I ran out of memory. On my machine, memory starts to max out and things start to slow down around 250k reports, but 250k reports run very fast:

$ time hugo build
Start building sites …
hugo v0.147.0-7d0039b86ddd6397816cc3383cb0cfa481b15f32+extended linux/amd64 BuildDate=2025-04-25T15:26:28Z VendorInfo=snap:0.147.0

                   |   EN
-------------------+---------
  Pages            | 250002
  Paginator pages  |      0
  Non-page files   |      0
  Static files     |      7
  Processed images |      0
  Aliases          |      0
  Cleaned          |      0

Total in 88588 ms

real    1m30.243s
user    12m14.501s
sys     1m50.962s

The important number there is the real wall clock time: 90 seconds to generate 200k reports.

While this benchmarking means we likely can't build a whole tri in one shot on a GitHub public repo runner (let alone a private repo runner), we do have one option to increase the number of reports we generate in one go without increasing the memory allocation on our runners: Segmenting our reports by township. Since the biggest town (Lake) has ~190k PINs, public repo runners should have enough RAM to generate the biggest town in one shot, so we can segment by town and then call hugo build with the --renderSegments flag for each town in the tri in order to build an entire tri in one job. (If we want to get even fancier/more performant, we could parallelize each township in a separate GitHub workflow using dynamic job matrices.) If this seems good to you, I'll open a follow-up issue to perform this segmenting and we can pick it up once we're further along with #37.

Disk usage

While I was benchmarking performance, I was also curious about realistic disk usage.

200k reports use roughly 7.7GB of storage on disk:

$ du -sh public
7.7G    public

This is because each report is about 28KB:

$ du -sh public/example-single-card/index.html
28K     public/example-single-card/index.html

We can cut about a third off of that using the --minify argument to hugo build:

$ hugo build --minify
...
Total in 99ms
$ du -sh public/example-single-card/index.html
20K     public/example-single-card/index.html

This suggests a total disk usage of about 500k X 20KB ~= 10GB per tri. However, we should expect this to be larger, because not all predictors are yet present in the characteristics table. I would guess the final report will be on the order of 1.5x as large as this test report, or ~15GB per tri.

jeancochrane · 2025-05-06T18:48:01Z

hugo/content/example-multi-card.md

@@ -0,0 +1,278 @@
+---


This is an example of the Markdown frontmatter that Hugo needs for a multi-card PIN in order to render a PINVAL report. The higher-level vision here is that eventually we'll have a GitHub workflow (#37) that runs a Python script (#38) that queries the PINVAL tables in Athena (ccao-data/data-architecture#793) and generates one of these Markdown files in the hugo/content/ directory for every PIN in a tri (or in a list of input PINs); then it will call the hugo command to compile those Markdown files into output HTML pages using the template defined in the hugo/layouts/ subdir below.

Makes sense to me. It's a sort of params file through which our html layout dynamically generates the static webpage html file.

jeancochrane · 2025-05-06T18:49:34Z

hugo/content/example-single-card.md

@@ -0,0 +1,144 @@
+---


Here's a single-card example. The main difference is that the cards array only has one element.

… version

jeancochrane · 2025-05-07T20:59:30Z

hugo/config.toml

+baseURL = 'http://example.org/'
+languageCode = 'en-us'
+title = 'PINVAL'
+disableKinds = ['sitemap', 'rss']


This is just Hugo boilerplate. It's not really important at this point.

jeancochrane · 2025-05-07T21:00:21Z