feat: LLM-as-a-judge for REPLACE evaluation

### Priority Level

Medium (Nice to have)

### Is your feature request related to a problem?

Hard to tell if all the PII was found and replaced well when using Replace.

### Describe the solution you'd like

The evaluation for rewrite is great and we would like to offer the option to extend it to replace in some capacity.

Ideal eval is a human reviewing, but in absence of that, can an LLM optionally review to help answer "Did it actually (1) find all the PII and (2) replace it in a contextually relevant way?"

This could also be very helpful when trying to run Anonymizer on a language one is unfamiliar with to have a sense of whether Anonymizer can perform well on that language. And ultimately be able to provide more benchmark info.

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM-as-a-judge for REPLACE evaluation #98

Priority Level

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: LLM-as-a-judge for REPLACE evaluation #98

Description

Priority Level

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions