A Spark SQL–based query engine for analyzing MusicBrainz data, served via gRPC. Designed to support concurrent top-ten artist queries to a local PostgreSQL mirror of the MusicBrainz database (see the MusicBrainz Docker mirror project).
- Spark SQL engine for large-scale querying of MusicBrainz data
- gRPC API for remote, language-agnostic access to query results
- Docker and compose setup for local development
.
├── client/ # gRPC client code and examples
├── proto/ # Protocol buffers definitions
├── query_engine/ # Spark SQL application logic
├── server/ # gRPC server implementation
├── jars/ # Compiled dependencies and artifacts
├── Dockerfile # Container image definition
├── docker-compose.yml # Development compose configuration
├── requirements.txt # Python service dependencies
└── setup_musicbrainz_lite.sh # Optional local MusicBrainz mirror setup
- Python 3
- Java (JDK 8/11/17 depending on your Spark/PySpark version)
- PySpark (installed via pip/requirements.txt)
- PostgreSQL with a MusicBrainz mirror (local container or local install)
- PostgreSQL JDBC driver jar
- Create a local MusicBrainz PostgreSQL mirror (use setup_musicbrainz_lite.sh or external tools).
- Build Spark query engine artifacts and gRPC server code.
- Configure database connection in server settings.
docker compose up --buildUse the provided client stubs (in client/) for sending queries to the server. See proto/ for RPC definitions.
Query logic lives in query_engine/
Protocol definitions in proto/ drive both server and client interfaces
Docker ensures consistent environment for local testing