Skip to content

feat: generate sitemap.xml from ES index with correct last_updated dates#2900

Open
reakaleek wants to merge 1 commit intomainfrom
feature/es-index-based-sitemap
Open

feat: generate sitemap.xml from ES index with correct last_updated dates#2900
reakaleek wants to merge 1 commit intomainfrom
feature/es-index-based-sitemap

Conversation

@reakaleek
Copy link
Member

Summary

  • Add new assembler sitemap CLI command that queries the ES lexical index for real last_updated dates and generates sitemap.xml with correct per-URL lastmod values
  • Uses search_after with PIT for efficient pagination over >10k documents
  • Build command still produces a sitemap with DateTime.UtcNow for backwards compatibility; assembler sitemap overwrites it after ES indexing

CI workflow change required

After merging, update .shared.assemble-build-and-deploy.yml in docs-internal-workflows to add after the assembler index step:

- name: 'docs-builder assembler sitemap --environment ${{ inputs.environment }}'
  run: |
    docs-builder assembler sitemap -c remote --environment "${ENVIRONMENT}"
  env:
    ENVIRONMENT: ${{ inputs.environment }}
    DOCUMENTATION_ELASTIC_APIKEY: ${{ steps.es_creds.outputs.es_apikey }}
    DOCUMENTATION_ELASTIC_URL: ${{ steps.es_creds.outputs.es_url }}

- name: Upload sitemap to S3
  run: |
    aws s3 cp .artifacts/assembly/docs/sitemap.xml "s3://${S3_BUCKET}/docs/sitemap.xml"
  env:
    S3_BUCKET: ${{ inputs.aws-s3-bucket }}

Test plan

  • Unit tests for SitemapBuilder.Generate (correct XML, ordering, directory creation)
  • Unit tests for EsSitemapReader.BuildSearchBody (pagination, hidden filter, escaping)
  • Integration test: run assembler sitemap against a dev ES cluster
  • Verify CI workflow produces correct sitemap after adding the new steps

🤖 Generated with Claude Code

Add a new CLI command that queries the Elasticsearch lexical index for
real last_updated dates and generates sitemap.xml with correct per-URL
lastmod values, replacing the DateTime.UtcNow placeholder.

The build command still produces a sitemap with DateTime.UtcNow for
backwards compatibility. The new `assembler sitemap` command overwrites
it with correct dates when run after ES indexing in the CI workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

Label error. Requires exactly 1 of: automation, breaking, bug, changelog:skip, chore, ci, dependencies, documentation, enhancement, feature, fix, redesign. Found:

Comment on lines +47 to +65
foreach (var hit in hits)
{
if (hit is not IDictionary<string, object> dict
|| dict["_source"] is not IDictionary<string, object> source)
continue;

var url = source["url"]?.ToString();
var lastUpdatedStr = source["last_updated"]?.ToString();

if (url is null || lastUpdatedStr is null)
continue;

// Use sort array from the hit for search_after cursor
if (dict.TryGetValue("sort", out var sortObj) && sortObj is object[] sortValues)
lastSortValues = sortValues;

var lastUpdated = DateTimeOffset.Parse(lastUpdatedStr, CultureInfo.InvariantCulture);
yield return new SitemapEntry(url, lastUpdated);
}
Comment on lines +104 to +107
catch (Exception ex)
{
logger.LogWarning(ex, "Failed to close PIT (non-fatal)");
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant