Summary: Large Language Models are probabilistic text generators, not knowledge databases. They predict likely word sequences based on training data -- they do not "know" things or retrieve facts from a database. Feeding relevant data into a prompt significantly reduces hallucination but does not eliminate it. Temperature=0 makes outputs more consistent, not more truthful. Every output must be reviewed by a qualified human before operational use. This document corrects common misconceptions and sets realistic expectations.
- What LLMs Actually Are
- Hallucination -- The Critical Reality
- Temperature Settings -- What They Actually Control
- Common Questions and Honest Answers
- Setting Realistic Expectations
- References
A Large Language Model is a statistical pattern-matching system trained on massive amounts of text data. Understanding what that means -- and what it does not mean -- is essential before deploying one in any professional environment.
An LLM generates text one word (technically, one "token") at a time. For each token, the model calculates the probability of every possible next token and selects one. This process repeats until the response is complete.
Input: "The suspect was last seen heading..."
Model's internal process (simplified):
"north" → 18% probability
"south" → 15% probability
"toward" → 12% probability
"east" → 10% probability
"westbound" → 8% probability
...thousands more options with decreasing probability
Output: "north" (selected as highest probability)
This is fundamentally different from a database query. A database retrieves a stored record. An LLM generates a statistically probable continuation of the input text.
| Common Misconception | Reality |
|---|---|
| "It's an AI that knows things" | It predicts probable text completions based on patterns in training data |
| "It retrieves facts from its training data" | It generates text that resembles patterns from training data -- it has no retrieval mechanism |
| "If I feed it data, it will only use that data" | It blends provided context with patterns from training data -- it cannot fully isolate one from the other |
| "It understands what I'm asking" | It processes statistical relationships between tokens -- the question of "understanding" remains an open research debate |
| "It's like a search engine" | Search engines retrieve existing documents; LLMs generate new text that may or may not be accurate |
LLMs are trained on text data up to a specific date. They have no awareness of events, laws, policies, or case developments after that cutoff. This means:
- A model trained through January 2024 has no knowledge of legal developments from February 2024 onward
- Policy changes, new case law, and updated procedures after the training cutoff are invisible to the model
- The model will not tell you its information is outdated -- it will generate responses as if its training data is current
This is one of the strongest arguments for providing relevant context documents directly in the prompt rather than relying on the model's internal knowledge. For more on how to feed context into prompts effectively, see the Prompting Guide.
"Hallucination" is the term for when an LLM generates text that is plausible-sounding but factually incorrect. This is not a bug that will be fixed in a future update -- it is a fundamental characteristic of how probabilistic text generation works.
A common claim is that feeding relevant data to a model "eliminates hallucination risks." The accurate statement is:
Providing relevant context in the prompt significantly reduces hallucination but does NOT eliminate it.
Even with perfect, complete context provided in the prompt, a model can still:
- Misattribute information -- Assign a statement to the wrong person or document
- Misquote -- Generate a close paraphrase and present it as a direct quote
- Incorrectly synthesize -- Combine facts from two separate sections in a way that creates a false conclusion
- Invent plausible details -- Fill gaps in the provided context with statistically likely but fabricated specifics
- Hallucinate structure -- Generate realistic-looking case numbers, dates, or identifiers that do not exist in the source material
Hallucination frequency correlates with model size. Smaller models hallucinate more frequently and more confidently than larger models.
| Model Size | Approximate Hallucination Risk | Practical Implication |
|---|---|---|
| 3B-7B parameters | Higher frequency | Requires careful review of every output; best used for simple, structured tasks |
| 8B-13B parameters | Moderate frequency | Suitable for drafting and analysis with thorough human review |
| 30B-70B parameters | Lower frequency | More reliable for complex tasks, but still requires review |
| 70B+ parameters | Lowest frequency | Closest to commercial model quality, but hallucination is never zero |
Models in the 7B-8B range -- which are what a standard workstation with 16GB VRAM can run (see Hardware Guide) -- fall in the moderate-risk category. They are useful tools, but the human reviewer is not optional.
These examples illustrate the types of errors that occur in law enforcement analytical contexts:
Example 1: Misattribution
- Source material: Report A states the vehicle was blue. Report B states the vehicle was black.
- Model output: "According to Report A, the vehicle was described as black."
- What happened: The model correctly identified both vehicle descriptions but attributed Report B's description to Report A.
Example 2: Fabricated Specificity
- Source material: "The incident occurred in the early morning hours."
- Model output: "The incident occurred at approximately 0315 hours."
- What happened: The model converted a vague time reference into a specific time that appears authoritative but was never in the source.
Example 3: Incorrect Synthesis
- Source material: Suspect A was seen at Location X on Monday. Suspect B was seen at Location X on Tuesday.
- Model output: "Suspects A and B were observed together at Location X."
- What happened: The model linked two true facts (same location) into a false conclusion (seen together) by omitting the different dates.
Every output from a local LLM must be reviewed by a qualified human before operational use. This is not a limitation that can be configured away. It is a fundamental characteristic of the technology.
The value of a local LLM is that it produces a first draft in minutes instead of hours. The human reviewer then validates, corrects, and approves. The total workflow -- AI draft plus human review -- is still significantly faster than manual drafting from scratch. But the human review step is non-negotiable.
Temperature is a parameter that controls how much randomness the model uses when selecting the next token. It is frequently misunderstood.
A common claim is that setting temperature to 0 produces "deterministic, factual" responses. The accurate statement is:
Temperature=0 makes outputs more consistent and repeatable, not more truthful. The model has no "factual mode."
| Temperature | Behavior | What This Means |
|---|---|---|
| 0.0 | Model selects the single highest-probability token every time | Most consistent output; running the same prompt twice tends to produce the same response |
| 0.1-0.3 | Very slight randomness | Nearly identical to 0 for most tasks; occasionally varies word choice |
| 0.4-0.7 | Moderate randomness | More varied phrasing; useful for creative or drafting tasks |
| 0.8-1.0 | High randomness | Wide variety of outputs; useful for brainstorming but unreliable for analysis |
| >1.0 | Very high randomness | Increasingly incoherent; not recommended for professional use |
Temperature controls which of the probable next tokens the model selects. It does not change what the model considers probable. If the model's highest-probability completion of a sentence is factually incorrect, temperature=0 will consistently produce that incorrect output.
Consider:
- At temperature=0.7, a model might hallucinate a different incorrect detail each time
- At temperature=0, the model hallucinate the same incorrect detail every time
Consistency is not accuracy. A model that reliably produces the same wrong answer is not more trustworthy than one that produces different wrong answers -- it is simply more predictable.
Even at temperature=0, some LLM implementations introduce minor randomness through:
- Top-k sampling -- Some runtimes use a small top-k window even at temp=0
- Floating-point non-determinism -- GPU computation can produce slightly different results across runs due to parallel processing order
- Batching effects -- The same prompt may produce slightly different outputs depending on what other requests the system is processing
For most practical purposes, temperature=0 produces highly consistent output. But it should not be treated as a guarantee of determinism or correctness.
For law enforcement analytical work:
| Use Case | Recommended Temperature | Rationale |
|---|---|---|
| Summarization | 0.0 - 0.1 | Consistency matters; want the same summary each time |
| Report drafting | 0.2 - 0.4 | Slight variation produces more natural-sounding prose |
| Brainstorming / hypotheses | 0.5 - 0.7 | Want diverse ideas and perspectives |
| All CJI-adjacent work | 0.0 | Consistency and reproducibility are paramount |
Regardless of temperature setting, human review of every output remains mandatory. Temperature is a tool for controlling output style, not output accuracy.
These questions reflect the concerns that arise when professionals evaluate LLM technology for operational use.
Honest answer: Probably not in obvious ways, but possibly in subtle ways.
When provided with a complete incident report in the prompt, the model will primarily draw from that text. The vast majority of its output will accurately reflect the source material. However, it may:
- Rephrase in a way that subtly changes meaning
- Add connecting language that implies causation where the source only described correlation
- Select certain details for emphasis based on statistical patterns, not investigative judgment
- Occasionally introduce a detail that sounds right but is not in the source
Best practice: Always verify the model's output against the source material. For critical details -- names, dates, times, locations, charges -- check every one against the original document.
Honest answer: There is no single error rate. It depends on the model, the task, the quality of the prompt, and the complexity of the source material.
To establish a meaningful error rate for a specific deployment:
- Create a test set -- 10-20 representative tasks with known correct answers
- Run each task through the model with the same prompt template
- Score the outputs against the known answers, tracking:
- Factual errors (wrong names, dates, details)
- Omissions (important details left out)
- Fabrications (details not in the source material)
- Misattributions (correct detail, wrong source or person)
- Document the results as a baseline
- Re-test periodically when models are updated or prompts are changed
This benchmarking process is the only honest way to establish an error rate for a specific use case. Vendor-published benchmarks measure different things (usually general knowledge tasks) and do not translate directly to law enforcement analytical workloads.
For more on building effective prompts that reduce errors, see the Prompting Guide. For specific task examples and their observed performance, see Use Cases.
Honest answer: For complex reasoning tasks, commercial models (GPT-4, Claude) at 70B+ parameters significantly outperform local models at the 7B-8B range. For straightforward summarization, drafting, and template-based generation, the gap is smaller than most people expect.
The comparison is not purely about capability:
| Factor | Local 8B Model | Commercial Cloud Model |
|---|---|---|
| Complex reasoning | Weaker | Significantly stronger |
| Simple summarization | Good | Better, but the gap is modest |
| Data privacy | Complete control | Governed by vendor terms and compliance certifications |
| CJI compatibility | Possible with hardening | Requires FedRAMP High + CJIS addendum |
| Cost at scale | Fixed hardware cost | Per-user or per-query recurring cost |
| Internet dependency | None after setup | Required for every query |
The decision is not "which is better" in absolute terms. It is "which fits the operational requirements." See Why Local LLMs for a detailed comparison of when local deployment makes sense versus when cloud alternatives may be more appropriate.
Honest answer: Trust the workflow, not the tool. An LLM produces a draft. A qualified human reviews, corrects, and approves that draft. The human -- not the LLM -- is the author of record for any official product.
This is no different from how spell-checkers, grammar tools, or database queries are used today. The tool assists; the professional is responsible for the final product.
For considerations on disclosure, retention, and evidentiary implications of AI-assisted analysis, see Security Considerations.
The following table summarizes what local LLMs can and cannot reliably do at the 7B-8B parameter range that runs on standard hardware (see Hardware Guide for specifications).
| Task | Why It Works | Typical Quality |
|---|---|---|
| Summarization of provided text | Model stays close to source material | 80-90% usable first draft |
| Report template generation | Structured output from structured prompts | High quality with good prompts |
| Draft writing (memos, briefs, BOLOs) | Template-driven generation is a strength | Good first draft, needs editing |
| Pattern identification in provided data | Can surface connections across multiple inputs | Useful leads, requires verification |
| Training scenario generation | Creative generation from parameters | Good variety, adjust for realism |
| Administrative document drafting | Low-stakes, template-based tasks | High quality |
| Task | Why It Struggles | Recommendation |
|---|---|---|
| Complex multi-step legal reasoning | Requires capabilities beyond 7B-8B scale | Use for initial research only; legal conclusions require human expertise |
| Mathematical calculations | LLMs are text predictors, not calculators | Never trust LLM math; verify independently |
| Broad world knowledge questions | Limited knowledge in smaller models | Provide all relevant context in the prompt; do not rely on the model's training data |
| Long document analysis (>15-20 pages) | Context window limitations at 7B-8B scale | Break documents into sections and process individually; see Knowledge Management for RAG approaches |
| Nuanced policy interpretation | Requires deep domain expertise at scale | Use as a drafting assistant only; policy interpretation requires qualified human judgment |
A practical way to think about local LLM output:
- 80% of the output will be usable, accurate, and save significant time compared to starting from scratch
- 20% of the output will need correction, restructuring, or removal by a qualified reviewer
- The total workflow (AI draft + human review) is still significantly faster than manual drafting from scratch
- The human reviewer is always the final authority and is accountable for the finished product
This 80/20 split varies by task complexity, prompt quality, and model selection. Simpler tasks with well-structured prompts will see higher usable percentages. Complex analytical tasks will see lower ones.
- Huang, L., et al. (2023). "A Survey on Hallucination in Large Language Models." arXiv:2311.05232
- Ji, Z., et al. (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys
- Ollama Documentation -- Temperature and Sampling Parameters
- NIST AI Risk Management Framework -- AI 100-1
This document is part of the Secure and Affordable In-House AI companion resource. It is an educational resource, not official guidance. Consult your agency's CJIS Systems Officer (CSO) for compliance decisions.