Known Limitations

Word Counting

LLMs cannot reliably count exact words. The rhythm system uses clause-count and qualitative length categories (fragment/short/medium/long) instead of exact word counts. The approximate word references in rhythm-tables.md are rough guidance only.

Impact: Rhythm rules 2 and 3 are reliable (clause counting works). Rule 1 (no 3 consecutive same-length category) works well. Exact word-count constraints like "25-word cap" are replaced with "3-clause cap."

Language Depth

English, Russian, Ukrainian, and German have the deepest coverage with native-level tone markers and cultural notes. French, Spanish, Portuguese, Italian, and Polish have been enhanced in v4 but may still have gaps compared to the primary four languages. Community contributions from native speakers are welcome.

Edge Cases

Poetry and creative writing: Apply with lightest possible touch. Use scenarios/creative-writing.md.
Code-heavy text: Only humanize prose sections. Code, commands, and config examples are never modified.
Medical, legal, safety-critical text: Skip the pipeline entirely. These require exact preservation.
Mixed-language text: Detect primary language. Quoted foreign-language passages are preserved verbatim.
Very short text (<30 words): Limited room for pipeline stages. Cleanup only in most cases.

LLM Variance

Results vary significantly between models:

GPT-4, Claude 3.5+, Gemini 2.0: recommended, consistent results
Claude 3, GPT-4o-mini, DeepSeek V3: adequate, may need parameter tuning
Smaller models (<13B parameters): may struggle with the full 6-stage pipeline, produce inconsistent self-evaluation scores, or miss subtle AI markers

The self-evaluation score (Stage 5.6) is generated by the same LLM that did the rewriting. It is a self-assessment, not an independent verification. Always review output before publishing.

No Automated Quality Guarantee

The skill provides no external validation. The Quality Score is an LLM self-assessment. The EVAL.md framework provides a protocol for independent evaluation, but requires a separate LLM call with access to the evaluation prompt and reference files.

False Positives

The pre-flight guard (AI Probability scoring) uses heuristic markers. Real human text that happens to use words like "moreover," "robust," or "leverage" (legitimately, in context) may score higher than 20 and trigger the pipeline unnecessarily. If the pipeline degrades human text, use "audit mode" first to verify.

Pipeline Determinism

Running the same input through different LLM instances or even the same instance twice may produce different output. The skill guides the LLM but does not enforce deterministic behavior. For reproducible results, use low-temperature settings (0.1-0.3) on the LLM API.

Integration Limitations

The PRESERVATION MODE relies on the LLM correctly identifying and tagging protected elements (SEO keywords, bias markers). There is no programmatic enforcement. The LLM may miss protected elements or over-protect non-critical ones.

Conjunction and Fragment Frequency

The tone profiles specify qualitative spacing targets ("every 3-5 sentences") rather than exact per-100-word metrics. LLMs approximate these targets with reasonable accuracy in most cases, but exact adherence is not guaranteed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Known Limitations

Word Counting

Language Depth

Edge Cases

LLM Variance

No Automated Quality Guarantee

False Positives

Pipeline Determinism

Integration Limitations

Conjunction and Fragment Frequency

FilesExpand file tree

KNOWN_LIMITATIONS.md

Latest commit

History

KNOWN_LIMITATIONS.md

File metadata and controls

Known Limitations

Word Counting

Language Depth

Edge Cases

LLM Variance

No Automated Quality Guarantee

False Positives

Pipeline Determinism

Integration Limitations

Conjunction and Fragment Frequency