Skip to content

Conversation

@mishig25
Copy link
Contributor

@mishig25 mishig25 commented Nov 13, 2025

Summary

this PR improves #506

This PR introduces a new command and workflow for populating the search engine with documentation from the HF doc-build dataset.

Changes

  • New markdown splitting logic: Split markdown content by headings (h1-h6) with hierarchy preservation
  • **New module **: Fetches and processes pre-built docs from dataset
  • **New CLI command **: Replaces the old command (which has been removed)
  • New GitHub Actions workflow: Automated documentation processing with for dependency management
  • Skip embeddings flag: Allows testing doc processing without generating embeddings

Key Features

  1. Downloads pre-built documentation from HuggingFace dataset API
  2. Processes only markdown files (ignores HTML)
  3. Splits markdown by headings while maintaining hierarchy
  4. Generates proper URLs matching the HF docs structure
  5. Supports selective library processing for testing
  6. Uses for fast dependency management

Testing

The workflow includes a flag for initial testing without the full embedding pipeline.

- Add new markdown splitting logic based on headings (h1-h6)
- Implement process_hf_docs.py to fetch and process docs from HF dataset
- Add populate-search-engine CLI command (replaces build-embeddings)
- Add GitHub Actions workflow for automated doc processing
- Support downloading pre-built docs from hf-doc-build/doc-build dataset
- Handle markdown chunking with heading hierarchy preservation
- Add skip-embeddings flag for testing without embedding generation
…workflow was responsible for building embeddings for various Hugging Face repositories and has been deprecated in favor of the new populate-search-engine command and workflow.
- Gradio docs use separate JSON format from gradio/docs dataset
- Gradio is not in hf-doc-build dataset, needs separate processing
- Uncomment cleanup-job and update to depend on both jobs
- Remove accidental 'on: push' trigger
@mishig25 mishig25 force-pushed the add-populate-search-engine branch from 80954a9 to 79219c3 Compare December 15, 2025 09:44
…k" in their names during processing. Added logging for skipped libraries.
@mishig25 mishig25 force-pushed the add-populate-search-engine branch from aa89954 to f1d3299 Compare December 15, 2025 09:56
…ine workflow and simplify command execution. This change streamlines the workflow by eliminating unnecessary complexity related to library filtering.
This reverts commit 37a1630.
This reverts commit a6f3bc5.
@mishig25 mishig25 merged commit 833093d into main Dec 15, 2025
4 checks passed
@mishig25 mishig25 deleted the add-populate-search-engine branch December 15, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants