LLMs cannot reliably count exact words. The rhythm system uses clause-count and qualitative length categories (fragment/short/medium/long) instead of exact word counts. The approximate word references in rhythm-tables.md are rough guidance only.
Impact: Rhythm rules 2 and 3 are reliable (clause counting works). Rule 1 (no 3 consecutive same-length category) works well. Exact word-count constraints like "25-word cap" are replaced with "3-clause cap."
English, Russian, Ukrainian, and German have the deepest coverage with native-level tone markers and cultural notes. French, Spanish, Portuguese, Italian, and Polish have been enhanced in v4 but may still have gaps compared to the primary four languages. Community contributions from native speakers are welcome.
- Poetry and creative writing: Apply with lightest possible touch. Use
scenarios/creative-writing.md. - Code-heavy text: Only humanize prose sections. Code, commands, and config examples are never modified.
- Medical, legal, safety-critical text: Skip the pipeline entirely. These require exact preservation.
- Mixed-language text: Detect primary language. Quoted foreign-language passages are preserved verbatim.
- Very short text (<30 words): Limited room for pipeline stages. Cleanup only in most cases.
Results vary significantly between models:
- GPT-4, Claude 3.5+, Gemini 2.0: recommended, consistent results
- Claude 3, GPT-4o-mini, DeepSeek V3: adequate, may need parameter tuning
- Smaller models (<13B parameters): may struggle with the full 6-stage pipeline, produce inconsistent self-evaluation scores, or miss subtle AI markers
The self-evaluation score (Stage 5.6) is generated by the same LLM that did the rewriting. It is a self-assessment, not an independent verification. Always review output before publishing.
The skill provides no external validation. The Quality Score is an LLM self-assessment. The EVAL.md framework provides a protocol for independent evaluation, but requires a separate LLM call with access to the evaluation prompt and reference files.
The pre-flight guard (AI Probability scoring) uses heuristic markers. Real human text that happens to use words like "moreover," "robust," or "leverage" (legitimately, in context) may score higher than 20 and trigger the pipeline unnecessarily. If the pipeline degrades human text, use "audit mode" first to verify.
Running the same input through different LLM instances or even the same instance twice may produce different output. The skill guides the LLM but does not enforce deterministic behavior. For reproducible results, use low-temperature settings (0.1-0.3) on the LLM API.
The PRESERVATION MODE relies on the LLM correctly identifying and tagging protected elements (SEO keywords, bias markers). There is no programmatic enforcement. The LLM may miss protected elements or over-protect non-critical ones.
The tone profiles specify qualitative spacing targets ("every 3-5 sentences") rather than exact per-100-word metrics. LLMs approximate these targets with reasonable accuracy in most cases, but exact adherence is not guaranteed.