This Python demo application scrapes online documentation and YouTube videos and loads it into a Spanner database as a knowledge graph using LangChain.
- Python 3.9+
- uv - An extremely fast Python package installer and resolver.
- A Google Cloud Project.
First, you need to authenticate with Google Cloud. If you have the Google Cloud CLI installed, you can run the following command:
gcloud auth application-default loginYou need a Spanner instance with Graph capabilities and a database.
Create a Spanner instance:
gcloud spanner instances create <your-instance-id> --config=regional-us-central1 --description="Spanner KB instance" --nodes=1Set the instance ID in the .env file.
Create a Spanner database:
gcloud spanner databases create <your-database-id> --instance=<your-instance-id>Set the database ID in the .env file.
This loader will create the necessary tables and graph.
This application requires the following environment variables to be set to connect to your Spanner database:
SPANNER_PROJECT: Your Google Cloud project ID.SPANNER_INSTANCE: Your Spanner instance ID.SPANNER_DATABASE: Your Spanner database ID.
Make a copy of .env.example into a .env file and set your variables there.
It is recommended to use a virtual environment to manage dependencies.
uv venvThis will create a .venv directory in your project folder.
source .venv/bin/activateInstall the necessary Python packages from the requirements.txt file.
uv pip install -r requirements.txtOnce the setup is complete, you can run the application with the following command:
python main.pyThe script will then:
- Scrape the documentation from the predefined list of URLs.
- Create LangChain
Documentobjects. - Save these documents to the specified Spanner table.
You can also run this application as a Cloud Run job, which is useful for long-running processes.
Enable the Artifact Registry and Cloud Run APIs:
gcloud services enable artifactregistry.googleapis.com run.googleapis.comCreate a repository to store your container images:
gcloud artifacts repositories create <your-repo-name> --repository-format=docker --location=us-central1Build the container image using Cloud Build and push it to Artifact Registry.
gcloud builds submit --tag us-central1-docker.pkg.dev/<your-project-id>/<your-repo-name>/spanner-kb-generatorCreate or update the Cloud Run job. Since this job can take a while, we'll set the timeout to 1 hour (3600 seconds).
gcloud run jobs update spanner-kb-generator-job --image us-central1-docker.pkg.dev/<your-project-id>/<your-repo-name>/spanner-kb-generator --timeout=3600 --region us-central1Execute the Cloud Run job.
gcloud run jobs execute spanner-kb-generator-job --region us-central1 --wait