Schema inspection from Lighthouse (all in one) #42

mare5x · 2025-10-23T15:59:00Z

Adapted the schema inspection from Lighthouse to portus. Closes #23.

The PR is huge mostly due to copying a lot of files from Lighthouse.

You can review either this master PR or start with smaller ones (but later PRs make modifications to earlier ones):

DiskCache DiskCache from Lighthouse #37
DataSource files Copy DataSource from Lighthouse #38
DuckDBCollection DuckDBCollection data source #39
~~DataEngine Adapt DataEngine from Lighthouse #40~~
Sync and async API for ~~DataEngine~~ and DataSource Sync and async API for DataEngine and DataSource #36

TODO:

Copy all tests (other tests require jetstat)
Full Lighthouse agent capabilities (at least list_all_tables) More Lighthouse agent features #43
Cleanup docstrings and comments to not reference Lighthouse internals?

Apply necessary ruff and mypy changes.

# Conflicts: # portus/__init__.py # pyproject.toml # uv.lock

Not hooked up to agents yet.

Remove query caching

This reverts commit 8d4ea32

mare5x · 2025-10-23T16:09:49Z

There is potentially a lot we could simplify, like removing SemanticDict from portus. But in that case, we can't reuse portus as a library in Lighthouse (without major changes in LH) and we end up with two similar but diverging code bases. WDYT?

mare5x · 2025-10-23T16:18:45Z

This is what the inspected schema looks like for demo.py: *NEW example in #45

# Database schema
 Database type: duckdb

## Table `db1.public.netflix_shows`
 - `show_id` (VARCHAR) Contains no duplicated values. Contains no null values. Contains strings of character length varying between 2 and 5 (the mean being 4.87). The column consists of strings which contain:- only lowercase characters- numbers. 
 - `type` (VARCHAR) Contains 2 unique values. Contains no null values. Contains strings of character length varying between 5 and 7 (the mean being 5.61).  Values: Movie, TV Show
 - `title` (VARCHAR) Contains no duplicated values. Contains no null values. Contains strings of character length varying between 1 and 104 (the mean being 17.73). The column consists of strings which contain:- numbers- punctuation. 
 - `director` (VARCHAR) Contains duplicate values. Contains 29.91% null values. Contains strings of character length varying between 2 and 208 (the mean being 15.37). The column consists of strings which contain:- numbers- punctuation. 
 - `cast_members` (VARCHAR) Contains duplicate values. Contains 9.37% null values. Contains strings of character length varying between 3 and 771 (the mean being 119.75). The column consists of strings which contain:- numbers- punctuation. 
 - `country` (VARCHAR) Contains 748 unique values. Contains 9.44% null values. Contains strings of character length varying between 4 and 123 (the mean being 12.58). The column consists of strings which contain:- punctuation. 
 - `date_added` (DATE) Contains duplicate values. Contains almost no null values. 
 - `release_year` (INTEGER) Contains 74 unique values. Contains no null values. The values vary between 1925.0 and 2021.0 with a mean of 2014.18.
 - `rating` (VARCHAR) Contains 17 unique values. Contains almost no null values. Contains strings of character length varying between 1 and 8 (the mean being 4.43). The column consists of strings which contain:- numbers- punctuation. 
 - `duration` (VARCHAR) Contains 220 unique values. Contains almost no null values. Contains strings of character length varying between 5 and 10 (the mean being 7.04). The column consists of strings which contain:- numbers. 
 - `listed_in` (VARCHAR) Contains 514 unique values. Contains no null values. Contains strings of character length varying between 6 and 79 (the mean being 33.41). The column consists of strings which contain:- punctuation. 
 - `description` (VARCHAR) Contains duplicate values. Contains no null values. Contains strings of character length varying between 61 and 248 (the mean being 143.3). The column consists of strings which contain:- numbers- punctuation. 

## Table `temp.main.df1`
 - `show_id` (VARCHAR) Contains no duplicated values. Contains no null values. Contains strings of character length varying between 4 and 5 (the mean being 4.67). The column consists of strings which contain:- only lowercase characters- numbers.  Values: s1032, s1253, s706
 - `cancelled` (BOOLEAN) Contains duplicate values. Contains no null values.

kosstbarz · 2025-10-24T09:36:40Z

Based on the example I can suggest make schema as small as possible until performance drops.
For example:
Contains no duplicated values. -> No duplicates.
Contains no null values. -> No NULLs
Contains strings of character length varying between 2 and 5 (the mean being 4.87). -> String length from 2 to 5, mean 4.87.
Contains duplicate values. -> Duplicates
Contains almost no null values. -> Almost no NULLs
My intuition says such changes should not affect performance.

* Make column_value_stats_summary.jinja more readable * Shorten column_value_stats_summary.jinja text descriptions * Fix missing DatabaseSchema.description * Simplify "Single constant" output

# Conflicts: # examples/ecom-customers.ipynb # portus/core/session.py

# Conflicts: # portus/core/pipe.py

Fixes failing tests.

mare5x · 2025-11-05T15:20:14Z

There is potentially a lot we could simplify, like removing SemanticDict from portus. But in that case, we can't reuse portus as a library in Lighthouse (without major changes in LH) and we end up with two similar but diverging code bases. WDYT?

We discussed that we will not forcefully maintain compatibility with Lighthouse and we would rather have diverging code bases. Therefore, I will simplify and remove unnecessary baggage left over from the Lighthouse repo.

Rauf-Kurbanov · 2025-11-07T09:01:23Z

@mare5x @kosstbarz I took another critical look at the PR and reflected n our conversation. I think we came to a point where we're confident to diverge from the Lighthouse codebase and have less features to reduce the repo entropy.

Given that let's drop most of this PR and only keep the disc cache part. Everything else is not relevant for our user in the forceable future.

Rauf-Kurbanov

Lets only keep disk mini-PR cache and drop everything else

mare5x added 30 commits October 21, 2025 09:26

Add dev setup instructions to README

81c1087

Copy DiskCache from LH

a91ff4e

Make DiskCache implement Cache

12e099c

Copy DiskCache tests from LH

7bdc802

fixup! Make DiskCache implement Cache

029437c

Merge DiskCache classes

2260386

Copy relevant data source files from LH (hash 33ffffa4)

93767b2

Apply necessary ruff and mypy changes.

Type alias for SemanticDict

28bbf3c

Consistent use of qualified table names during inspection

c9a0cbf

Add the schema name to the cache key

6a7f61d

Apply nest_asyncio workaround

8d4ea32

# Conflicts: # portus/__init__.py # pyproject.toml # uv.lock

Use qualified table names in schema summaries

87897e9

Compact schema summarization option

e728862

Refactor duckdb schema inspection utils

587ab32

Add sqlalchemy duckdb-engine dependency

1d040b6

Add DuckDBCollection

cbeeaa3

Not hooked up to agents yet.

Copy DataEngine from LH

5f8e9a8

Rename database_schema.py to schema_summary.py

5aeb97d

Move DataSource to the core package

765568e

Copy schema_inspection functions from LH[database_schema.py]

ed61da8

Simplify DataEngine

359b6b1

Remove query caching

Rename data_source package to data

26db856

Use DataEngine in agents

259823b

Merge LH changes from 7250a960

251a441

Remove DataSource.hash

1bc2457

Add execute_sync to DataEngine & DataSource

1cb41b6

Sync and async sqlalchemy_source.py

b1c62a0

Remove schema_inspection.py with cache_final_results

f462e1a

Sync and async DataEngine

12baa9f

Revert "Apply nest_asyncio workaround"

1e6816f

This reverts commit 8d4ea32

mare5x marked this pull request as ready for review October 23, 2025 16:05

mare5x requested review from Rauf-Kurbanov, kosstbarz, mrMakaronka and pickleerik October 23, 2025 16:12

Merge branch 'main' into lh-inspection/all-in-one

f467f48

mare5x mentioned this pull request Oct 24, 2025

Shorter schema inspection summaries #45

Merged

4 tasks

mare5x added 7 commits October 29, 2025 12:08

Shorter schema inspection summaries (#45)

12271e8

* Make column_value_stats_summary.jinja more readable * Shorten column_value_stats_summary.jinja text descriptions * Fix missing DatabaseSchema.description * Simplify "Single constant" output

Merge branch 'main' into lh-inspection/all-in-one

9995d12

# Conflicts: # examples/ecom-customers.ipynb # portus/core/session.py

Merge branch 'main' into lh-inspection/all-in-one

1ced35c

# Conflicts: # portus/core/pipe.py

Merge branch 'main' into lh-inspection/all-in-one

7a3a596

Workaround for DuckDBCollection when streaming

ea0c6e0

README fixup

4bd98a9

Use temporary directories for a file-based DuckDBCollection.

6d924ab

Fixes failing tests.

mare5x added 4 commits November 5, 2025 18:12

Remove LH DiskCache methods

196d40d

Remove SemanticDict

0c9614d

Remove reading data source configs

b9a632e

Rename parameter in docstring

83f9103

mare5x marked this pull request as draft November 5, 2025 18:04

Remove DataEngine and move data connection creation to AgentExecutor

29d8a07

mare5x marked this pull request as ready for review November 5, 2025 21:59

Reuse schema inspection results in AgentExecutor

8901c64

Rauf-Kurbanov reviewed Nov 7, 2025

View reviewed changes

mare5x closed this Nov 7, 2025

mare5x deleted the lh-inspection/all-in-one branch November 21, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema inspection from Lighthouse (all in one) #42

Schema inspection from Lighthouse (all in one) #42

Uh oh!

mare5x commented Oct 23, 2025 •

edited

Loading

Uh oh!

mare5x commented Oct 23, 2025

Uh oh!

mare5x commented Oct 23, 2025 •

edited

Loading

Uh oh!

kosstbarz commented Oct 24, 2025

Uh oh!

mare5x commented Nov 5, 2025

Uh oh!

Rauf-Kurbanov commented Nov 7, 2025

Uh oh!

Rauf-Kurbanov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Schema inspection from Lighthouse (all in one) #42

Schema inspection from Lighthouse (all in one) #42

Uh oh!

Conversation

mare5x commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mare5x commented Oct 23, 2025

Uh oh!

mare5x commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosstbarz commented Oct 24, 2025

Uh oh!

mare5x commented Nov 5, 2025

Uh oh!

Rauf-Kurbanov commented Nov 7, 2025

Uh oh!

Rauf-Kurbanov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mare5x commented Oct 23, 2025 •

edited

Loading

mare5x commented Oct 23, 2025 •

edited

Loading