-
Couldn't load subscription status.
- Fork 0
Apps #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Apps #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take a look at the items I highlighted, and we can sync on it?
|
|
||
| | Catalog Size | Discovery Time* | Generation Time** | UI Response*** | | ||
| |-------------|----------------|-------------------|----------------| | ||
| | Small (1K objects) | <2s | 5-15min | <100ms | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you've benchmarked this at a thousand tables taking 5-15 minutes to tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it imputed values from the amount of time it took to generate 20 objects. And there wasn't any tagging here, just metadata generation. I could do a test run of 100-200 pretty easily and that may be a better baseline to benchmark off of. The "Discovery Time" portion is a bit misleading as it's a quick query on the information_schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so its just a query on the information schema itself, that makes sense
I was wondering, because the batch version is much slower, but it does a wide variety of queries against different components of metadata and data
| class PIIDetector: | ||
| """ | ||
| Self-contained PII detection with pattern matching and data analysis. | ||
| Provides similar functionality to Presidio but embedded and lightweight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you benchmarked this against presidio OOB? I'd be very interested in exploring this further if the metrics are similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should remove that actually. I used vibe coding to resemble presidio, but it's not nearly as good per my review of Presidio, to be candid. The most functioning part of the PII detection at this moment is regex pattern detection, the LLM has sort of been hit and miss but I haven't prompted it well and it's an area I have to build out better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - I absolutely do have the same challenges with Presidio- the installs with customers' networking can be a trial. But from what we've seen with our benchmarking, the LLM can beat presidio, but it has to be well prompted, and although Presidio tends to have a really low precision, it gets a really high recall when combined with a well-prompted LLM
| 'confidence': confidence | ||
| } | ||
|
|
||
| def _determine_classification(self, pii_types: List[str]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before putting this in main, we should align on how we should be classifying data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that's the wrong approach though, happy to discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to either approach. In its current design, it encapsulates the entire PII Detection logic and classification is just a task within it. Classification in the app is really just informative, which is less useful than dbxmetagen which I understand facilitates domain-identifying data objects. My only concern about combining it into main is the challenges it may introduce to maintain and extend it. But yeah, let's discuss.
| The **Quality tab** provides comprehensive metadata quality assessment and governance analytics: | ||
|
|
||
| **🏆 Metadata Quality Assessment Header**: | ||
| - **Trophy icon** emphasizing excellence and quality focus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you review some of this? I think there's a lot of content here that could be cleaned out - I'm working on doing the same with the main readme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me take a look this week and whittle it down
| - Llama models: ~3-8s per batch | ||
| - Gemma models: ~1-3s per batch (fastest) | ||
| - Claude models: ~2-6s per batch | ||
| - **Batch Size Impact**: Larger batches (20-50 objects) reduce total time but increase individual request time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are objects columns or tables here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
objects represent schemas, tables, and columns (basically just the load)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it, so 5 tables with 800 columns each, plus the schema, would be like 4006 'objects'?
I think it'd be helpful to clarify in the readme what objects are in this case
| @@ -0,0 +1 @@ | |||
| This folder contains apps that support organizations seeking to streamline metadata generation. The first is UC Metadata Assistant, a self-contained app designed for business users looking to generate metadata. The second is built off of dbxmetagen and provides an interface for executing the utilities contained in the remainder of this repo and is provided for organizations with enterprise demands including CI/CD support, domain identification, and mature PII detection and data classification. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a gap analysis or a differentiator for this in its current state - what is different about the UC Metadata Assistant?
It's not that business users are generating metadata, that's PART of it, but really, your governance management tooling is what is particularly useful there, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think at its current state, the app is really designed as a lightweight mechanism for business users to populate metadata. That's where it's strongest (the remaining components need to be built out better though they are functional today...I have ideas on where I want to take them). We can put this as a discussion item when we connect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - I think that we want to include this in the readme in a clear way before we can share this widely, and we should clarify in the main readme the distinction between the two
| @@ -0,0 +1,870 @@ | |||
| # 🏢 Unity Catalog Metadata Assistant | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, could we remove emojis that don't serve any purpose, just as a style guide item?
In the app - sure, but in the readme I'd really prefer removing them. Seems too much ai generated to me. Happy to discuss this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it was largely ai generated - i just provided inputs along the way and had it build out sections and then reviewed them here and there. I don't really have a strong opinion on emojis lol, so we can cut them out that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not trying to be picky about these things, just thinking it would be good to have a general style - in the app its one thing but in readmens and regular code, not idea
added apps folder, basic readme with description of the two apps, and then the uc-metadata-assistant app