CRITICAL ASSESSMENT: MNLNP Package

Date: 2025-11-18 Assessment Type: Comprehensive Package Review Overall Grade: D+ (60% - Functional prototype, not production-ready)

EXECUTIVE SUMMARY

Current State

Functional prototype with significant gaps

The MNLNP package successfully demonstrates core concepts for comparing MNL and MNP models, but falls short of production quality in several critical areas. While basic functionality works and tests pass, the package relies on fabricated benchmark data, has statistical validity concerns, and lacks essential documentation.

Key Findings

✅ STRENGTHS:

Core model fitting functions work reliably
Robust error handling and fallback mechanisms
Flexible handling of various model specifications
Good test coverage for basic functionality (27 tests passing)
Honest labeling of limitations

❌ CRITICAL WEAKNESSES:

Fake benchmark data undermines entire recommendation system
test_iia() statistical validity concerns (simplified implementation)
No vignettes or comprehensive documentation
Untested edge cases (missing data, rare alternatives, large models)
MNP dependency not properly managed

Risk Level: HIGH

Users could make poor methodological choices based on fabricated benchmarks or misinterpret results due to inadequate documentation.

DETAILED FINDINGS

1. FAKE BENCHMARK DATA - CRITICAL ISSUE ⚠️

Severity: CRITICAL Impact: Undermines package credibility

Problem: The entire recommendation system (recommend_model(), quick_decision(), required_sample_size()) relies on mnl_mnp_benchmark.rda which contains fabricated data:

Convergence rates are "educated guesses," not empirical findings
Win rates and RMSE values are simulated fiction
n_replications = 0 explicitly marks data as fake
Dataset labeled data_type = "illustrative_placeholder"

Example Risk:

recommend_model(n = 300, correlation = 0.5)
# Claims: "74% MNP convergence at n=250"
# Reality: Could be 20% or 95% - WE DON'T ACTUALLY KNOW

What Users Might Do:

Trust convergence estimates and choose wrong model
Cite package recommendations in publications
Base research design on false assumptions
Waste time/money collecting wrong sample sizes

Mitigation Status:

✅ Data clearly labeled with warnings
✅ Script ready to generate real benchmarks (run_pilot_benchmark.R)
❌ Real benchmarks not yet run

Recommendation: Run pilot simulation immediately (1,800 reps, ~2 hours) before any serious use.

2. test_iia() STATISTICAL VALIDITY CONCERNS ⚠️

Severity: HIGH Impact: Could produce misleading test results

Issues Identified:

Issue 2a: Simplified Implementation

Code Evidence:

# R/iia_and_decision.R, lines 131-135
# For valid Hausman test, need same dimension
# Simplified: use only coefficients that match
min_len <- min(length(full_vec), length(restricted_vec))
full_vec <- full_vec[1:min_len]
restricted_vec <- restricted_vec[1:min_len]

Problem: The code explicitly says "Simplified" - this is NOT a rigorous Hausman test implementation. Proper implementation requires:

Careful matching of corresponding coefficients
Proper handling of different alternative structures
Correct variance-covariance matrix construction

Issue 2b: Frequent Warning

Observed: "Variance difference matrix not positive definite" appears regularly

Implication: The variance-covariance matrix difference should be positive definite for valid Hausman test. When it's not, the test statistic may be invalid. Current code:

Adds small diagonal values (1e-6) to force positive definiteness
Falls back to simplified chi-squared calculation

Risk: Test may produce:

False positives (rejecting IIA when it holds)
False negatives (failing to reject when IIA violated)
Unreliable p-values

Issue 2c: Documentation vs Implementation Gap

Documentation says: "Implements the Hausman-McFadden test" Reality: Implements simplified approximation with fallbacks

Recommendation:

Get econometrician review of implementation
Add explicit warnings about simplified approach
Consider using established packages (mlogit has IIA test)
Validate against known IIA violations

3. DOCUMENTATION GAPS

Severity: HIGH Impact: Users won't understand how to use package properly

Missing Components:

Component	Status	Impact
Package vignette	❌ None	Users have no introduction
Function examples	⚠️ Use fake data	Don't show real-world usage
Interpretation guide	❌ None	Users may misinterpret results
Workflow tutorials	❌ None	Don't know where to start
?mnlChoice help	❌ None	Package-level docs missing

Example Problem: Current documentation shows:

#' @examples
#' \dontrun{
#' dat <- generate_choice_data(n = 300, correlation = 0.4, seed = 123)
#' iia_test <- test_iia(choice ~ x1 + x2, data_obj = dat$data)
#' }

What's Wrong: Users need examples showing:

How to prepare their own data
How to interpret test results
What to do when tests fail
Real datasets from different domains

Needed:

"Getting Started" vignette
"Interpreting IIA Tests" guide
"Model Selection Workflow" tutorial
"Publishing Results" guide with publication_table() examples

4. UNTESTED EDGE CASES

Severity: MEDIUM-HIGH Impact: Package may crash or give wrong results

What Happens When:

4a. Binary Choice (2 alternatives)

dat <- data.frame(choice = factor(c(rep("A", 50), rep("B", 50))), x1 = rnorm(100))
test_iia(choice ~ x1, data_obj = dat)
# Error: "IIA test requires at least 3 alternatives"

Problem: Error message is technically correct but unhelpful. Should say: "IIA test not applicable to binary choice. Use binary logit/probit instead."

4b. Rare Alternatives

# One alternative has only 5 observations
test_iia(..., omit_alternative = "Rare")
# Warning: "Restricted dataset very small (n < 50)"
# But still runs!

Problem: Test proceeds despite insufficient data. Should refuse to run.

4c. Missing Data

dat$x1[5] <- NA
test_iia(choice ~ x1, data_obj = dat)
# Crashes with cryptic error

Problem: No missing data handling anywhere in package.

4d. Perfect Separation

Problem: MNL won't converge but package doesn't detect or handle this.

4e. Very Large Models (50+ predictors)

Problem: Tested only up to 10 predictors. May fail or produce garbage with many predictors.

Recommendation: Add input validation and informative error messages for all edge cases.

5. MNP DEPENDENCY MANAGEMENT

Severity: MEDIUM Impact: Silent failures, misleading results

Problem: MNP package is in "Suggests" not "Depends", but functions assume it's available:

compare_mnl_mnp(...)  # Claims to compare BOTH models
# If MNP not installed: silently falls back to MNL-only
# User may think they got comparison when they didn't!

Issues:

No clear warnings when MNP unavailable
publication_table() doesn't warn about missing MNP comparison
quick_decision() doesn't check if MNP is actually available
Results structures inconsistent between MNP/no-MNP cases

Recommendation:

# Add to all functions using MNP:
if (!requireNamespace("MNP", quietly = TRUE)) {
  message("MNP package not available. Install with: install.packages('MNP')")
  message("Proceeding with MNL-only analysis.")
}

6. ARCHITECTURE & CODE QUALITY

Issues:

6a. R/create_data.R - Bad Practice

Problem: Script file that runs on source(), not a function

Creates side effects
Pollutes environment
Can't be unit tested
Runs every time package sources

Should Be: Function that's called explicitly

6b. No S3/S4 Classes

Problem: Results are raw lists with inconsistent structure

Current:

result <- compare_mnl_mnp(...)
str(result)  # Just a list
print(result)  # Ugly output

Should Have:

class(result) <- "mnlnp_comparison"
print.mnlnp_comparison()
summary.mnlnp_comparison()
plot.mnlnp_comparison()

6c. Parameter Naming Hack

Problem: test_iia(formula_obj, data_obj, ...) uses non-standard names

Why: Workaround for scoping issues with formula() and data() base R functions

Impact:

Violates R conventions
Confusing for users
Inconsistent with other functions
Suggests deeper architectural problems

Better Solution: Proper package environment management

7. TEST COVERAGE LIMITATIONS

What's Tested: ✅

Basic functionality with clean data
2-10 predictors
3-5 alternatives
Core workflows

What's NOT Tested: ❌

Error handling
Missing data
Convergence failures
Edge cases (rare alternatives, perfect separation)
Large models (20+ predictors)
Real datasets with messy data
Memory limits
Reproducibility across R versions
Integration with other packages

Test Quality Issues:

Tests verify functions run, not that results are correct
No comparison to known ground truth
No statistical validation of test statistics
No regression tests

8. USER EXPERIENCE PROBLEMS

8a. Confusing Function Names

fit_mnp_safe() - What does "safe" mean to users?
compare_mnl_mnp_cv() - CV not obvious (cross-validation)
quantify_model_choice_consequences() - Too verbose

8b. Inconsistent Interfaces

recommend_model(n, correlation, ...)       # Takes parameters
test_iia(formula_obj, data_obj, ...)       # Takes data
quick_decision(n, n_predictors, ...)       # Takes parameters

Why the inconsistency?

8c. Redundant Return Fields

result <- quick_decision(300, 4)
result$model          # "MNL"
result$recommendation # "MNL" (same!)
result$reason         # Text
result$reasoning      # Same text! (WHY TWO NAMES?)

9. COMPARISON TO PRODUCTION STANDARDS

Feature	MNLNP Status	Grade	Notes
Empirical validation	❌ Fake data	F	Critical flaw
Vignettes	❌ None	F	Must have
Edge case testing	❌ Minimal	D	Basic only
Performance optimization	❌ None	C	Works but slow
S3/S4 classes	❌ Raw lists	D	Poor UX
CRAN checks	❓ Untested	?	Unknown
CI/CD	❌ None	F	No automation
Code coverage	❓ ~50%?	?	Not measured
Real examples	❌ Synthetic only	D	Need real data
Benchmark vs alternatives	❌ None	F	Not compared

Overall Grade: D+

Works for basic cases but not production-ready. Would be rejected from CRAN.

STRENGTHS (For Balance)

✅ Core fitting works - fit_mnp_safe() reliably fits or falls back
✅ Robust error handling - Functions don't crash easily
✅ Good basic test coverage - 27 tests all passing
✅ Flexibility - Handles various specifications (2-10 predictors, 3-5 alternatives)
✅ Honest about limitations - Fake data clearly labeled
✅ Quick decision tool useful - Provides instant guidance with reasoning
✅ Publication table generation - LaTeX/HTML/markdown output works

The foundation is solid. The package demonstrates good software engineering practices in error handling and testing. The problems are primarily about scientific validity, documentation, and production polish rather than fundamental implementation flaws.

RISK ASSESSMENT

If Someone Uses This Package Today:

HIGH RISK:

Makes methodological choice based on fake benchmarks → Wrong model selection
Trusts IIA test with small/rare groups → False conclusions
Applies to non-standard data (missing values, rare categories) → Crashes
Uses for publication without validation → Potential retraction

MEDIUM RISK:

Doesn't realize MNP not installed → Thinks they compared both models
Misinterprets output → Poor documentation leads to errors
Doesn't validate results → Accepts incorrect conclusions

LOW RISK:

Exploratory analysis only → Damage limited to wasted time
Cross-validates with other methods → Catches errors

WHAT SHOULD BE DONE

CRITICAL (Must Fix Before Any Release)

Priority 1: Run Real Benchmarks

# Execute pilot simulation (ready to run)
Rscript run_pilot_benchmark.R
# This will generate 1,800 real simulations (~2 hours)
# Replace illustrative data with empirical results

Priority 2: Fix test_iia() Implementation

Get econometrician to review Hausman test code
Fix coefficient matching logic
Add proper warnings about simplified approach
Validate against known IIA violations from literature
Consider just wrapping mlogit's IIA test instead

Priority 3: Add Comprehensive Documentation

Package vignette: "Introduction to MNLNP"
Function examples using real datasets
Interpretation guide for test results
Common workflows and use cases

Priority 4: Handle MNP Dependency Properly

Add clear warnings when MNP unavailable
Document what features require MNP
Ensure graceful degradation
Make dependency status transparent to users

HIGH PRIORITY (Before CRAN Submission)

Priority 5: Write Vignettes

"Getting Started with MNLNP"
"When to Use MNL vs MNP" (with real examples)
"Interpreting IIA Tests"
"Creating Publication Tables"

Priority 6: Create S3 Classes

class(result) <- c("mnlnp_comparison", "list")
print.mnlnp_comparison <- function(x, ...) { ... }
summary.mnlnp_comparison <- function(object, ...) { ... }
plot.mnlnp_comparison <- function(x, ...) { ... }

Priority 7: Test Edge Cases

Missing data handling
Rare alternatives (< 10 obs)
Large models (50+ predictors)
Perfect separation
Convergence failures
Memory limits

Priority 8: Performance Improvements

Add progress bars for long operations
Estimate computation time before starting
Warn about memory requirements
Add parallel processing options

MEDIUM PRIORITY (Polish)

Priority 9: Standardize Interfaces

Consistent parameter names across functions
Consistent return structures
Consistent documentation style

Priority 10: Add Real Datasets

Transportation choice (already have commuter_choice)
Marketing/brand choice
Voting behavior
Healthcare choices
Multiple domains for validation

Priority 11: Benchmark Against Alternatives

Compare to mlogit package
Compare to VGAM package
Validate against published results
Document when MNLNP is better/worse

Priority 12: Performance Optimization

Parallel processing for simulations
Caching of expensive computations
Algorithm efficiency improvements

TIMELINE ESTIMATE

To Production Quality: 4-6 weeks of focused work

Task	Time Estimate	Dependencies
Run real benchmarks	1 week	None - ready now
Fix test_iia()	2-3 days	Expert review needed
Write vignettes	3-4 days	Real benchmarks done
Add S3 classes	2 days	None
Edge case testing	1 week	None
Real dataset examples	2-3 days	Datasets available
CRAN checks	2-3 days	All above done
External review	2 weeks	Package polished

Critical Path: Real benchmarks → Documentation → External review

BOTTOM LINE

Should This Be Published to CRAN?

Absolutely not in current state.

The fake benchmark data alone disqualifies it. CRAN reviewers would immediately question the scientific validity.

Is This Package Useful?

Potentially yes, but not yet.

The concept is sound and there's genuine need for MNL vs MNP guidance. But execution needs substantial improvement before users can trust it.

What's the Immediate Next Step?

Run the pilot benchmark simulation.

The script is ready (run_pilot_benchmark.R), it will take ~2 hours, and it will:

Replace fake data with real results
Validate/invalidate current recommendations
Provide empirical foundation for package
Enable honest scientific publication

Can This Become a Good Package?

Yes, absolutely.

The foundation is solid:

Core functionality works
Good error handling
Reasonable test coverage
Honest about limitations
Demonstrates real need

With 4-6 weeks of focused work, this could become a genuinely valuable contribution to the R ecosystem.

RECOMMENDATIONS

For Package Authors

Immediate Actions:

✅ Run pilot benchmark TODAY (script ready)
✅ Get test_iia() reviewed by statistician
✅ Write basic "Getting Started" vignette
✅ Test with real messy data from different domains
✅ Add input validation for edge cases

Before Any Public Use:

Replace all illustrative data with empirical results
Add comprehensive warnings about limitations
Document all known issues clearly
Get external review from domain expert

For CRAN Submission:

Complete all HIGH PRIORITY items above
Pass R CMD check with no errors/warnings
Add unit tests for edge cases
Benchmark against existing packages
Get 2-3 users to beta test

For Potential Users

Right Now:

❌ Don't use for publication
❌ Don't cite in papers
❌ Don't recommend to others
❌ Don't trust benchmark recommendations

Exploratory Use Only:

✅ Can explore basic functionality
✅ Can test workflow concepts
✅ Can identify additional needs
✅ Must validate all results independently

In 4-6 Weeks:

✅ Re-evaluate after real benchmarks run
✅ Check if test_iia() has been validated
✅ Review updated documentation
✅ Then consider for real use

FINAL ASSESSMENT

Current State: Interesting prototype demonstrating sound concepts Scientific Validity: Questionable due to fabricated benchmarks Production Readiness: 60% - functional but not trustworthy User Risk: HIGH without proper caveats

Potential: With systematic attention to the critical issues identified above, this package could become a valuable tool for applied researchers. The need is real, the approach is reasonable, but the execution requires substantial improvement before public release.

Grade: D+ (60/100)

Passing, but just barely
Needs significant work
Not ready for prime time
But fixable with effort

Compiled: 2025-11-18 Reviewer: Comprehensive Package Assessment Next Review: After pilot benchmarks complete

FilesExpand file tree

CRITICAL_ASSESSMENT.md

Latest commit

History