Skip to content

Latest commit

 

History

History
597 lines (455 loc) · 18.2 KB

File metadata and controls

597 lines (455 loc) · 18.2 KB

CRITICAL ASSESSMENT: MNLNP Package

Date: 2025-11-18 Assessment Type: Comprehensive Package Review Overall Grade: D+ (60% - Functional prototype, not production-ready)


EXECUTIVE SUMMARY

Current State

Functional prototype with significant gaps

The MNLNP package successfully demonstrates core concepts for comparing MNL and MNP models, but falls short of production quality in several critical areas. While basic functionality works and tests pass, the package relies on fabricated benchmark data, has statistical validity concerns, and lacks essential documentation.

Key Findings

STRENGTHS:

  • Core model fitting functions work reliably
  • Robust error handling and fallback mechanisms
  • Flexible handling of various model specifications
  • Good test coverage for basic functionality (27 tests passing)
  • Honest labeling of limitations

CRITICAL WEAKNESSES:

  • Fake benchmark data undermines entire recommendation system
  • test_iia() statistical validity concerns (simplified implementation)
  • No vignettes or comprehensive documentation
  • Untested edge cases (missing data, rare alternatives, large models)
  • MNP dependency not properly managed

Risk Level: HIGH

Users could make poor methodological choices based on fabricated benchmarks or misinterpret results due to inadequate documentation.


DETAILED FINDINGS

1. FAKE BENCHMARK DATA - CRITICAL ISSUE ⚠️

Severity: CRITICAL Impact: Undermines package credibility

Problem: The entire recommendation system (recommend_model(), quick_decision(), required_sample_size()) relies on mnl_mnp_benchmark.rda which contains fabricated data:

  • Convergence rates are "educated guesses," not empirical findings
  • Win rates and RMSE values are simulated fiction
  • n_replications = 0 explicitly marks data as fake
  • Dataset labeled data_type = "illustrative_placeholder"

Example Risk:

recommend_model(n = 300, correlation = 0.5)
# Claims: "74% MNP convergence at n=250"
# Reality: Could be 20% or 95% - WE DON'T ACTUALLY KNOW

What Users Might Do:

  • Trust convergence estimates and choose wrong model
  • Cite package recommendations in publications
  • Base research design on false assumptions
  • Waste time/money collecting wrong sample sizes

Mitigation Status:

  • ✅ Data clearly labeled with warnings
  • ✅ Script ready to generate real benchmarks (run_pilot_benchmark.R)
  • ❌ Real benchmarks not yet run

Recommendation: Run pilot simulation immediately (1,800 reps, ~2 hours) before any serious use.


2. test_iia() STATISTICAL VALIDITY CONCERNS ⚠️

Severity: HIGH Impact: Could produce misleading test results

Issues Identified:

Issue 2a: Simplified Implementation

Code Evidence:

# R/iia_and_decision.R, lines 131-135
# For valid Hausman test, need same dimension
# Simplified: use only coefficients that match
min_len <- min(length(full_vec), length(restricted_vec))
full_vec <- full_vec[1:min_len]
restricted_vec <- restricted_vec[1:min_len]

Problem: The code explicitly says "Simplified" - this is NOT a rigorous Hausman test implementation. Proper implementation requires:

  • Careful matching of corresponding coefficients
  • Proper handling of different alternative structures
  • Correct variance-covariance matrix construction

Issue 2b: Frequent Warning

Observed: "Variance difference matrix not positive definite" appears regularly

Implication: The variance-covariance matrix difference should be positive definite for valid Hausman test. When it's not, the test statistic may be invalid. Current code:

  1. Adds small diagonal values (1e-6) to force positive definiteness
  2. Falls back to simplified chi-squared calculation

Risk: Test may produce:

  • False positives (rejecting IIA when it holds)
  • False negatives (failing to reject when IIA violated)
  • Unreliable p-values

Issue 2c: Documentation vs Implementation Gap

Documentation says: "Implements the Hausman-McFadden test" Reality: Implements simplified approximation with fallbacks

Recommendation:

  1. Get econometrician review of implementation
  2. Add explicit warnings about simplified approach
  3. Consider using established packages (mlogit has IIA test)
  4. Validate against known IIA violations

3. DOCUMENTATION GAPS

Severity: HIGH Impact: Users won't understand how to use package properly

Missing Components:

Component Status Impact
Package vignette ❌ None Users have no introduction
Function examples ⚠️ Use fake data Don't show real-world usage
Interpretation guide ❌ None Users may misinterpret results
Workflow tutorials ❌ None Don't know where to start
?mnlChoice help ❌ None Package-level docs missing

Example Problem: Current documentation shows:

#' @examples
#' \dontrun{
#' dat <- generate_choice_data(n = 300, correlation = 0.4, seed = 123)
#' iia_test <- test_iia(choice ~ x1 + x2, data_obj = dat$data)
#' }

What's Wrong: Users need examples showing:

  • How to prepare their own data
  • How to interpret test results
  • What to do when tests fail
  • Real datasets from different domains

Needed:

  1. "Getting Started" vignette
  2. "Interpreting IIA Tests" guide
  3. "Model Selection Workflow" tutorial
  4. "Publishing Results" guide with publication_table() examples

4. UNTESTED EDGE CASES

Severity: MEDIUM-HIGH Impact: Package may crash or give wrong results

What Happens When:

4a. Binary Choice (2 alternatives)

dat <- data.frame(choice = factor(c(rep("A", 50), rep("B", 50))), x1 = rnorm(100))
test_iia(choice ~ x1, data_obj = dat)
# Error: "IIA test requires at least 3 alternatives"

Problem: Error message is technically correct but unhelpful. Should say: "IIA test not applicable to binary choice. Use binary logit/probit instead."

4b. Rare Alternatives

# One alternative has only 5 observations
test_iia(..., omit_alternative = "Rare")
# Warning: "Restricted dataset very small (n < 50)"
# But still runs!

Problem: Test proceeds despite insufficient data. Should refuse to run.

4c. Missing Data

dat$x1[5] <- NA
test_iia(choice ~ x1, data_obj = dat)
# Crashes with cryptic error

Problem: No missing data handling anywhere in package.

4d. Perfect Separation

Problem: MNL won't converge but package doesn't detect or handle this.

4e. Very Large Models (50+ predictors)

Problem: Tested only up to 10 predictors. May fail or produce garbage with many predictors.

Recommendation: Add input validation and informative error messages for all edge cases.


5. MNP DEPENDENCY MANAGEMENT

Severity: MEDIUM Impact: Silent failures, misleading results

Problem: MNP package is in "Suggests" not "Depends", but functions assume it's available:

compare_mnl_mnp(...)  # Claims to compare BOTH models
# If MNP not installed: silently falls back to MNL-only
# User may think they got comparison when they didn't!

Issues:

  1. No clear warnings when MNP unavailable
  2. publication_table() doesn't warn about missing MNP comparison
  3. quick_decision() doesn't check if MNP is actually available
  4. Results structures inconsistent between MNP/no-MNP cases

Recommendation:

# Add to all functions using MNP:
if (!requireNamespace("MNP", quietly = TRUE)) {
  message("MNP package not available. Install with: install.packages('MNP')")
  message("Proceeding with MNL-only analysis.")
}

6. ARCHITECTURE & CODE QUALITY

Issues:

6a. R/create_data.R - Bad Practice

Problem: Script file that runs on source(), not a function

  • Creates side effects
  • Pollutes environment
  • Can't be unit tested
  • Runs every time package sources

Should Be: Function that's called explicitly

6b. No S3/S4 Classes

Problem: Results are raw lists with inconsistent structure

Current:

result <- compare_mnl_mnp(...)
str(result)  # Just a list
print(result)  # Ugly output

Should Have:

class(result) <- "mnlnp_comparison"
print.mnlnp_comparison()
summary.mnlnp_comparison()
plot.mnlnp_comparison()

6c. Parameter Naming Hack

Problem: test_iia(formula_obj, data_obj, ...) uses non-standard names

Why: Workaround for scoping issues with formula() and data() base R functions

Impact:

  • Violates R conventions
  • Confusing for users
  • Inconsistent with other functions
  • Suggests deeper architectural problems

Better Solution: Proper package environment management


7. TEST COVERAGE LIMITATIONS

What's Tested:

  • Basic functionality with clean data
  • 2-10 predictors
  • 3-5 alternatives
  • Core workflows

What's NOT Tested:

  • Error handling
  • Missing data
  • Convergence failures
  • Edge cases (rare alternatives, perfect separation)
  • Large models (20+ predictors)
  • Real datasets with messy data
  • Memory limits
  • Reproducibility across R versions
  • Integration with other packages

Test Quality Issues:

  • Tests verify functions run, not that results are correct
  • No comparison to known ground truth
  • No statistical validation of test statistics
  • No regression tests

8. USER EXPERIENCE PROBLEMS

8a. Confusing Function Names

  • fit_mnp_safe() - What does "safe" mean to users?
  • compare_mnl_mnp_cv() - CV not obvious (cross-validation)
  • quantify_model_choice_consequences() - Too verbose

8b. Inconsistent Interfaces

recommend_model(n, correlation, ...)       # Takes parameters
test_iia(formula_obj, data_obj, ...)       # Takes data
quick_decision(n, n_predictors, ...)       # Takes parameters

Why the inconsistency?

8c. Redundant Return Fields

result <- quick_decision(300, 4)
result$model          # "MNL"
result$recommendation # "MNL" (same!)
result$reason         # Text
result$reasoning      # Same text! (WHY TWO NAMES?)

9. COMPARISON TO PRODUCTION STANDARDS

Feature MNLNP Status Grade Notes
Empirical validation ❌ Fake data F Critical flaw
Vignettes ❌ None F Must have
Edge case testing ❌ Minimal D Basic only
Performance optimization ❌ None C Works but slow
S3/S4 classes ❌ Raw lists D Poor UX
CRAN checks ❓ Untested ? Unknown
CI/CD ❌ None F No automation
Code coverage ❓ ~50%? ? Not measured
Real examples ❌ Synthetic only D Need real data
Benchmark vs alternatives ❌ None F Not compared

Overall Grade: D+

Works for basic cases but not production-ready. Would be rejected from CRAN.


STRENGTHS (For Balance)

  1. Core fitting works - fit_mnp_safe() reliably fits or falls back
  2. Robust error handling - Functions don't crash easily
  3. Good basic test coverage - 27 tests all passing
  4. Flexibility - Handles various specifications (2-10 predictors, 3-5 alternatives)
  5. Honest about limitations - Fake data clearly labeled
  6. Quick decision tool useful - Provides instant guidance with reasoning
  7. Publication table generation - LaTeX/HTML/markdown output works

The foundation is solid. The package demonstrates good software engineering practices in error handling and testing. The problems are primarily about scientific validity, documentation, and production polish rather than fundamental implementation flaws.


RISK ASSESSMENT

If Someone Uses This Package Today:

HIGH RISK:

  1. Makes methodological choice based on fake benchmarks → Wrong model selection
  2. Trusts IIA test with small/rare groups → False conclusions
  3. Applies to non-standard data (missing values, rare categories) → Crashes
  4. Uses for publication without validation → Potential retraction

MEDIUM RISK:

  1. Doesn't realize MNP not installed → Thinks they compared both models
  2. Misinterprets output → Poor documentation leads to errors
  3. Doesn't validate results → Accepts incorrect conclusions

LOW RISK:

  1. Exploratory analysis only → Damage limited to wasted time
  2. Cross-validates with other methods → Catches errors

WHAT SHOULD BE DONE

CRITICAL (Must Fix Before Any Release)

Priority 1: Run Real Benchmarks

# Execute pilot simulation (ready to run)
Rscript run_pilot_benchmark.R
# This will generate 1,800 real simulations (~2 hours)
# Replace illustrative data with empirical results

Priority 2: Fix test_iia() Implementation

  • Get econometrician to review Hausman test code
  • Fix coefficient matching logic
  • Add proper warnings about simplified approach
  • Validate against known IIA violations from literature
  • Consider just wrapping mlogit's IIA test instead

Priority 3: Add Comprehensive Documentation

  • Package vignette: "Introduction to MNLNP"
  • Function examples using real datasets
  • Interpretation guide for test results
  • Common workflows and use cases

Priority 4: Handle MNP Dependency Properly

  • Add clear warnings when MNP unavailable
  • Document what features require MNP
  • Ensure graceful degradation
  • Make dependency status transparent to users

HIGH PRIORITY (Before CRAN Submission)

Priority 5: Write Vignettes

  1. "Getting Started with MNLNP"
  2. "When to Use MNL vs MNP" (with real examples)
  3. "Interpreting IIA Tests"
  4. "Creating Publication Tables"

Priority 6: Create S3 Classes

class(result) <- c("mnlnp_comparison", "list")
print.mnlnp_comparison <- function(x, ...) { ... }
summary.mnlnp_comparison <- function(object, ...) { ... }
plot.mnlnp_comparison <- function(x, ...) { ... }

Priority 7: Test Edge Cases

  • Missing data handling
  • Rare alternatives (< 10 obs)
  • Large models (50+ predictors)
  • Perfect separation
  • Convergence failures
  • Memory limits

Priority 8: Performance Improvements

  • Add progress bars for long operations
  • Estimate computation time before starting
  • Warn about memory requirements
  • Add parallel processing options

MEDIUM PRIORITY (Polish)

Priority 9: Standardize Interfaces

  • Consistent parameter names across functions
  • Consistent return structures
  • Consistent documentation style

Priority 10: Add Real Datasets

  • Transportation choice (already have commuter_choice)
  • Marketing/brand choice
  • Voting behavior
  • Healthcare choices
  • Multiple domains for validation

Priority 11: Benchmark Against Alternatives

  • Compare to mlogit package
  • Compare to VGAM package
  • Validate against published results
  • Document when MNLNP is better/worse

Priority 12: Performance Optimization

  • Parallel processing for simulations
  • Caching of expensive computations
  • Algorithm efficiency improvements

TIMELINE ESTIMATE

To Production Quality: 4-6 weeks of focused work

Task Time Estimate Dependencies
Run real benchmarks 1 week None - ready now
Fix test_iia() 2-3 days Expert review needed
Write vignettes 3-4 days Real benchmarks done
Add S3 classes 2 days None
Edge case testing 1 week None
Real dataset examples 2-3 days Datasets available
CRAN checks 2-3 days All above done
External review 2 weeks Package polished

Critical Path: Real benchmarks → Documentation → External review


BOTTOM LINE

Should This Be Published to CRAN?

Absolutely not in current state.

The fake benchmark data alone disqualifies it. CRAN reviewers would immediately question the scientific validity.

Is This Package Useful?

Potentially yes, but not yet.

The concept is sound and there's genuine need for MNL vs MNP guidance. But execution needs substantial improvement before users can trust it.

What's the Immediate Next Step?

Run the pilot benchmark simulation.

The script is ready (run_pilot_benchmark.R), it will take ~2 hours, and it will:

  1. Replace fake data with real results
  2. Validate/invalidate current recommendations
  3. Provide empirical foundation for package
  4. Enable honest scientific publication

Can This Become a Good Package?

Yes, absolutely.

The foundation is solid:

  • Core functionality works
  • Good error handling
  • Reasonable test coverage
  • Honest about limitations
  • Demonstrates real need

With 4-6 weeks of focused work, this could become a genuinely valuable contribution to the R ecosystem.


RECOMMENDATIONS

For Package Authors

Immediate Actions:

  1. ✅ Run pilot benchmark TODAY (script ready)
  2. ✅ Get test_iia() reviewed by statistician
  3. ✅ Write basic "Getting Started" vignette
  4. ✅ Test with real messy data from different domains
  5. ✅ Add input validation for edge cases

Before Any Public Use:

  • Replace all illustrative data with empirical results
  • Add comprehensive warnings about limitations
  • Document all known issues clearly
  • Get external review from domain expert

For CRAN Submission:

  • Complete all HIGH PRIORITY items above
  • Pass R CMD check with no errors/warnings
  • Add unit tests for edge cases
  • Benchmark against existing packages
  • Get 2-3 users to beta test

For Potential Users

Right Now:

  • ❌ Don't use for publication
  • ❌ Don't cite in papers
  • ❌ Don't recommend to others
  • ❌ Don't trust benchmark recommendations

Exploratory Use Only:

  • ✅ Can explore basic functionality
  • ✅ Can test workflow concepts
  • ✅ Can identify additional needs
  • ✅ Must validate all results independently

In 4-6 Weeks:

  • ✅ Re-evaluate after real benchmarks run
  • ✅ Check if test_iia() has been validated
  • ✅ Review updated documentation
  • ✅ Then consider for real use

FINAL ASSESSMENT

Current State: Interesting prototype demonstrating sound concepts Scientific Validity: Questionable due to fabricated benchmarks Production Readiness: 60% - functional but not trustworthy User Risk: HIGH without proper caveats

Potential: With systematic attention to the critical issues identified above, this package could become a valuable tool for applied researchers. The need is real, the approach is reasonable, but the execution requires substantial improvement before public release.

Grade: D+ (60/100)

  • Passing, but just barely
  • Needs significant work
  • Not ready for prime time
  • But fixable with effort

Compiled: 2025-11-18 Reviewer: Comprehensive Package Assessment Next Review: After pilot benchmarks complete