Open Cinema Index

Open Cinema Index (OCI) is a data ingestion and enrichment pipeline for building a structured, open index of films and related entities. It is designed for research, recommendation engines, and archival purposes. The repository contains scripts to fetch film data from open sources, enrich it with metadata, and prepare it for downstream applications.

It is not a recommendation engine, a rating platform, or an editorial system. Its sole responsibility is to build a reliable, inspectable index of film knowledge that other systems can depend on.

OCI is designed to treat cinema as it actually exists: messy, disputed, multilingual, and full of partial truths.

Features

Fetch films from Wikidata by year and retrieve basic metadata
Enrich films with properties such as genre, language, and age ratings
Designed for offline processing, production applications consume pre-built datasets

How OCI Thinks About Films

OCI is built around a few guiding ideas:

Films are stable entities; facts about them are not
Titles, genres, runtimes, and even credits are claims, not facts
Different sources disagree, and that disagreement is meaningful

Rather than flattening every into a single record, OCI keeps track of:

who said what
when they said it
and how confident we are

Ambiguity is preserved, not "cleaned up".

See the Film Schema for more details.

The Ingestion Pipeline

OCI is structured as a pipeline of explicit, repeatable steps:

fetch -> normalize -> resolve -> enrich -> export

Each step has a narrow responsibility.

Fetch

Retrieves raw data from a source without interpretation or transformation.

Default Data Sources

OCI ships with two preconfigured sources:

tmdb — REST (https://api.themoviedb.org/3), seeded with a safe 40-requests-per-10-seconds limit (matching historical official limits).
wikidata — REST (https://query.wikidata.org/sparql), seeded with a 1-request-per-1-second limit (matching WDQS usage policy).

You can adjust limits, capabilities, refresh policies, credentials, or disable them by editing the corresponding rows after running migrations.

Normalize

Maps raw data into OCI's canonical schema.

Resolve

Handles duplicates, identity collisions, and uncertainty between entities.

Enrich

Adds secondary metadata (genres, assets, keywords, etc.) additively.

Export

Emits the indexed data in formats suitable for downstream systems.

Provenance and Confidence

Every piece of data stored by OCI is associated with:

a source (see Data Sources)
a fetch timestamp
an optional confidence level

Conflicting data is expected and preserved. "Unknown" and "uncertain" are valid outcomes.

Why This Exists

Film culture is broader and stranger than most databases allow.

Many existing systems:

flatten ambiguity
privilege a single source
erase minority or regional perspectives

OCI exists to preserve the richness of cinema history without pretending it's tidy.

Project Status

Open Cinema Index is under active development.

The schema and CLI are expected to evolve. Detailed CLI instructions can be found in the CLI Usage Guide.

Contributions are welcome, especially those that respect the project's archival philosophy.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
data		data
docs		docs
migrations		migrations
src/open_cinema_index		src/open_cinema_index
tests		tests
.gitignore		.gitignore
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE.md		LICENSE.md
README.md		README.md
TODO.md		TODO.md
alembic.ini		alembic.ini
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Cinema Index

Features

How OCI Thinks About Films

The Ingestion Pipeline

Fetch

Default Data Sources

Normalize

Resolve

Enrich

Export

Provenance and Confidence

Why This Exists

Project Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Open Cinema Index

Features

How OCI Thinks About Films

The Ingestion Pipeline

Fetch

Default Data Sources

Normalize

Resolve

Enrich

Export

Provenance and Confidence

Why This Exists

Project Status

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages