Skip to content

Benchmark Evaluation API Returns Hardcoded Mock Metrics in PIPL Example #403

@ARYANPATEL-BIT

Description

@ARYANPATEL-BIT

Description

In the examples/PIPL codebase, the evaluate() pipeline for the Privacy-Preserving LLM is completely stubbed out. Rather than calculating results from actual predictions and dataset ground-truths, the evaluation logic relies on placeholder methods that return hardcoded static floating-point numbers.

Affected File:
examples/PIPL/edge-cloud_collaborative_learning_bench/test_algorithms/privacy_preserving_llm/privacy_preserving_llm.py

Specific Mocked Methods:

  • _evaluate_utility() unconditionally returns: {'accuracy': 0.92, 'f1_score': 0.89, 'precision': 0.91, 'recall': 0.87}
  • _evaluate_privacy() unconditionally returns simulated MIA AUC scores (0.52, 0.51, etc.)
  • _evaluate_compliance() unconditionally returns hardcoded compliance scores like 0.98
  • _evaluate_performance() unconditionally returns a static latency of 2.3 and throughput of 15.2

Impact

This is a critical flaw that breaks the benchmark. Any algorithm tested against this benchmark will pass with artificial "expert" results regardless of the model's actual performance. It renders the entire PIPL example useless for actual scientific evaluation.

Proposed Fix

  1. Refactor the evaluate() method and remove all hardcoded mock return dictionaries
  2. Implement actual metric calculations by comparing the data (ground truth) against the model's actual output from the inference pipeline
  3. Wire the performance evaluation to track real-time metrics (like processing latency) instead of returning static variables

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions