Skip to content

Conversation

@mare5x
Copy link
Contributor

@mare5x mare5x commented Oct 23, 2025

Adapted the schema inspection from Lighthouse to portus. Closes #23.

The PR is huge mostly due to copying a lot of files from Lighthouse.

You can review either this master PR or start with smaller ones (but later PRs make modifications to earlier ones):

  1. DiskCache DiskCache from Lighthouse #37
  2. DataSource files Copy DataSource from Lighthouse #38
  3. DuckDBCollection DuckDBCollection data source #39
  4. DataEngine Adapt DataEngine from Lighthouse #40
  5. Sync and async API for DataEngine and DataSource Sync and async API for DataEngine and DataSource #36

TODO:

  • Copy all tests (other tests require jetstat)
  • Full Lighthouse agent capabilities (at least list_all_tables) More Lighthouse agent features #43
  • Cleanup docstrings and comments to not reference Lighthouse internals?

mare5x added 30 commits October 21, 2025 09:26
Apply necessary ruff and mypy changes.
# Conflicts:
#	portus/__init__.py
#	pyproject.toml
#	uv.lock
Not hooked up to agents yet.
Remove query caching
@mare5x mare5x marked this pull request as ready for review October 23, 2025 16:05
@mare5x
Copy link
Contributor Author

mare5x commented Oct 23, 2025

There is potentially a lot we could simplify, like removing SemanticDict from portus. But in that case, we can't reuse portus as a library in Lighthouse (without major changes in LH) and we end up with two similar but diverging code bases. WDYT?

@mare5x
Copy link
Contributor Author

mare5x commented Oct 23, 2025

This is what the inspected schema looks like for demo.py: *NEW example in #45

# Database schema
 Database type: duckdb

## Table `db1.public.netflix_shows`
 - `show_id` (VARCHAR) Contains no duplicated values. Contains no null values. Contains strings of character length varying between 2 and 5 (the mean being 4.87). The column consists of strings which contain:- only lowercase characters- numbers. 
 - `type` (VARCHAR) Contains 2 unique values. Contains no null values. Contains strings of character length varying between 5 and 7 (the mean being 5.61).  Values: Movie, TV Show
 - `title` (VARCHAR) Contains no duplicated values. Contains no null values. Contains strings of character length varying between 1 and 104 (the mean being 17.73). The column consists of strings which contain:- numbers- punctuation. 
 - `director` (VARCHAR) Contains duplicate values. Contains 29.91% null values. Contains strings of character length varying between 2 and 208 (the mean being 15.37). The column consists of strings which contain:- numbers- punctuation. 
 - `cast_members` (VARCHAR) Contains duplicate values. Contains 9.37% null values. Contains strings of character length varying between 3 and 771 (the mean being 119.75). The column consists of strings which contain:- numbers- punctuation. 
 - `country` (VARCHAR) Contains 748 unique values. Contains 9.44% null values. Contains strings of character length varying between 4 and 123 (the mean being 12.58). The column consists of strings which contain:- punctuation. 
 - `date_added` (DATE) Contains duplicate values. Contains almost no null values. 
 - `release_year` (INTEGER) Contains 74 unique values. Contains no null values. The values vary between 1925.0 and 2021.0 with a mean of 2014.18.
 - `rating` (VARCHAR) Contains 17 unique values. Contains almost no null values. Contains strings of character length varying between 1 and 8 (the mean being 4.43). The column consists of strings which contain:- numbers- punctuation. 
 - `duration` (VARCHAR) Contains 220 unique values. Contains almost no null values. Contains strings of character length varying between 5 and 10 (the mean being 7.04). The column consists of strings which contain:- numbers. 
 - `listed_in` (VARCHAR) Contains 514 unique values. Contains no null values. Contains strings of character length varying between 6 and 79 (the mean being 33.41). The column consists of strings which contain:- punctuation. 
 - `description` (VARCHAR) Contains duplicate values. Contains no null values. Contains strings of character length varying between 61 and 248 (the mean being 143.3). The column consists of strings which contain:- numbers- punctuation. 

## Table `temp.main.df1`
 - `show_id` (VARCHAR) Contains no duplicated values. Contains no null values. Contains strings of character length varying between 4 and 5 (the mean being 4.67). The column consists of strings which contain:- only lowercase characters- numbers.  Values: s1032, s1253, s706
 - `cancelled` (BOOLEAN) Contains duplicate values. Contains no null values. 

@kosstbarz
Copy link
Contributor

Based on the example I can suggest make schema as small as possible until performance drops.
For example:
Contains no duplicated values. -> No duplicates.
Contains no null values. -> No NULLs
Contains strings of character length varying between 2 and 5 (the mean being 4.87). -> String length from 2 to 5, mean 4.87.
Contains duplicate values. -> Duplicates
Contains almost no null values. -> Almost no NULLs
My intuition says such changes should not affect performance.

@mare5x mare5x mentioned this pull request Oct 24, 2025
4 tasks
* Make column_value_stats_summary.jinja more readable

* Shorten column_value_stats_summary.jinja text descriptions

* Fix missing DatabaseSchema.description

* Simplify "Single constant" output
# Conflicts:
#	examples/ecom-customers.ipynb
#	portus/core/session.py
# Conflicts:
#	portus/core/pipe.py
@mare5x
Copy link
Contributor Author

mare5x commented Nov 5, 2025

There is potentially a lot we could simplify, like removing SemanticDict from portus. But in that case, we can't reuse portus as a library in Lighthouse (without major changes in LH) and we end up with two similar but diverging code bases. WDYT?

We discussed that we will not forcefully maintain compatibility with Lighthouse and we would rather have diverging code bases. Therefore, I will simplify and remove unnecessary baggage left over from the Lighthouse repo.

@mare5x mare5x marked this pull request as draft November 5, 2025 18:04
@mare5x mare5x marked this pull request as ready for review November 5, 2025 21:59
@Rauf-Kurbanov
Copy link
Collaborator

@mare5x @kosstbarz I took another critical look at the PR and reflected n our conversation. I think we came to a point where we're confident to diverge from the Lighthouse codebase and have less features to reduce the repo entropy.

Given that let's drop most of this PR and only keep the disc cache part. Everything else is not relevant for our user in the forceable future.

Copy link
Collaborator

@Rauf-Kurbanov Rauf-Kurbanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets only keep disk mini-PR cache and drop everything else

@mare5x mare5x closed this Nov 7, 2025
@mare5x mare5x deleted the lh-inspection/all-in-one branch November 21, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port database schema inspection from LH

4 participants