Summary: Not all models are equal, and not all tasks require the same model. This guide documents the models actively deployed in production at the Dallas Fusion Center / Real-Time Crime Center -- with exact pull commands, honest capability assessments, VRAM requirements, training data cutoff dates, and task-matching recommendations. The goal is to help practitioners choose the right tool for the right job, avoid wasting time on models that do not fit their hardware, and set realistic expectations about what these models can and cannot do.
- How to Read This Guide
- Deployed Models
- Quick Reference Table
- Model Profiles
- Model-Task Matching Guide
- What These Models Are Bad At
- How to Evaluate Models for Your Workflow
- Quantization Explained
- Running Multiple Models
- References
Every model listed below includes:
- Exact Ollama pull command -- copy-paste ready
- Parameter count -- how large the model is (bigger generally means more capable but slower)
- Approximate VRAM required -- how much GPU memory the model needs
- Fits on 16GB GPU? -- whether the model runs on the recommended RTX 4060 Ti 16GB from the Hardware Guide
- Training Data Cutoff -- the approximate date through which the model's training data was collected; events after this date are outside the model's knowledge
- Best use cases -- what the model does well
- Key strengths -- where the model outperforms peers
- Key weaknesses -- where the model falls short
- Honest assessment -- practical observations from real-world testing
VRAM estimates reflect the specific quantization used in this deployment (noted per model). Actual VRAM usage varies by context length and concurrent sessions.
| Model | Developer | Parameters | VRAM (approx.) | Fits 16GB GPU? | Primary Use Case |
|---|---|---|---|---|---|
| Gemma 4 (E4B) | Google DeepMind | 4.5B (MoE) | ~10 GB | Yes | Multimodal general-purpose workhorse |
| DeepSeek-R1 14B | DeepSeek AI | 14B | ~9 GB | Yes | Chain-of-thought reasoning |
| GPT-OSS 20B | OpenAI | 20.9B (MoE) | ~13 GB | Yes | Agentic tool use and structured output |
| Phi-4 Mini | Microsoft | 3.8B | ~3 GB | Yes | Fast lightweight tasks |
| Granite 4 Micro | IBM | 3B | ~2 GB | Yes | Enterprise RAG and tool calling |
| Nemotron-Cascade-2 | NVIDIA | 30B (MoE) | ~24 GB | No | Advanced reasoning, 256K context |
| GLM-4.7-Flash | Z.ai / ZhipuAI | 30B (MoE) | ~31 GB | No | Bilingual Chinese-English reasoning |
Starting recommendation: Begin with Gemma 4 (E4B) for most analytical tasks. Add DeepSeek-R1 14B for pattern analysis and reasoning tasks. Use GPT-OSS 20B when structured output and tool calling are required.
ollama pull gemma4:latest| Attribute | Detail |
|---|---|
| Developer | Google DeepMind |
| Parameters | 4.5B effective (MoE architecture, E4B variant) |
| VRAM Required | ~10 GB |
| Fits 16GB GPU | Yes |
| Context Window | 128K tokens |
| License | Apache 2.0 (fully permissive) |
| Training Data Cutoff | January 2025 |
Best Use Cases:
- Document summarization and analytical writing
- Image-plus-text tasks -- analyzing charts, diagrams, or evidence photos alongside narrative context
- General drafting: briefings, BOLOs, memos
- Agentic workflows requiring a capable lightweight model
Key Strengths:
- Multimodal -- accepts both text and image input; no other model in this deployment shares this capability
- Most recent training cutoff (January 2025) in the hardware-compatible tier -- best knowledge currency for current events
- MoE architecture delivers efficient performance relative to VRAM usage
- Apache 2.0 license eliminates licensing questions for government deployment
- Google DeepMind's alignment work produces reliable instruction following
Key Weaknesses:
- E4B is the smallest Gemma 4 variant; larger variants (12B, 27B) provide greater capability but require more VRAM
- MoE architecture loads more total parameters than its 4.5B effective count suggests; VRAM usage (~10 GB) is higher than a comparable dense 4.5B model
- Image processing consumes additional context window capacity
Honest Assessment: Gemma 4 E4B is the default recommendation for most law enforcement analytical workflows in this deployment. Its multimodal capability opens use cases unavailable to text-only models: extracting text from evidence photos, reviewing surveillance stills alongside narrative reports, or summarizing image-heavy documents. The January 2025 training cutoff -- the most current among hardware-compatible models here -- reduces the risk of outdated responses on recent events. Start here; escalate to DeepSeek-R1 14B or GPT-OSS 20B when tasks require deeper reasoning or structured output.
ollama pull deepseek-r1:14b| Attribute | Detail |
|---|---|
| Developer | DeepSeek AI (China) |
| Parameters | 14B (dense; distilled from DeepSeek-R1 671B) |
| VRAM Required | ~9 GB |
| Fits 16GB GPU | Yes |
| Context Window | 128K tokens |
| License | MIT License |
| Training Data Cutoff | ~July 2024 |
Best Use Cases:
- Pattern analysis requiring step-by-step logical reasoning
- Evidence synthesis and contradiction detection across multiple source documents
- Timeline reconstruction where the reasoning process needs to be auditable
- Complex analytical tasks where understanding how the model reached a conclusion matters
Key Strengths:
- Chain-of-thought reasoning -- outputs its thinking process before the final answer, making logic auditable
- Distilled from DeepSeek-R1 671B, the 14B scale retains significantly stronger reasoning than 8B alternatives
- MIT license -- permissive for government deployment
- Fits on standard 16GB hardware while providing meaningfully better analytical depth than smaller models
- Transparent reasoning traces make it easier to identify where the model went wrong
Key Weaknesses:
- Developed by a Chinese organization (DeepSeek AI) -- some agencies may have procurement or policy concerns about model provenance; note that all inference runs locally and no data is transmitted to external servers, and model weights are open-source and independently auditable
- Reasoning trace increases response time and token usage compared to direct-answer models
- Training cutoff of July 2024 may produce outdated responses on post-cutoff events
- Less optimized for simple drafting tasks where chain-of-thought reasoning adds unnecessary overhead
Honest Assessment: DeepSeek-R1 14B is the strongest reasoning model in the hardware-compatible tier. For law enforcement analytical tasks -- pattern recognition, evidence synthesis, timeline analysis -- its chain-of-thought approach produces more transparent and auditable outputs. The 14B scale over smaller 8B distill variants provides noticeably better analytical depth. The reasoning trace also functions as a form of analytical provenance: reviewers can follow the model's logic rather than evaluating only the conclusion. The provenance concern applies here as with other Chinese-origin models: evaluate against your agency's specific procurement policies before production deployment.
ollama pull gpt-oss:20b| Attribute | Detail |
|---|---|
| Developer | OpenAI |
| Parameters | 20.9B total / 3.6B active (MoE architecture) |
| VRAM Required | ~13 GB |
| Fits 16GB GPU | Yes (with limited headroom) |
| Context Window | 128K tokens |
| License | Apache 2.0 (fully permissive) |
| Training Data Cutoff | June 2024 |
Best Use Cases:
- Agentic workflows requiring native function calling and tool use
- Structured data extraction with precise output formatting (JSON, tables, templated reports)
- Automated multi-step analytical pipelines
- Workflows where consistent, machine-readable output format is critical
Key Strengths:
- Designed for agentic use -- native function calling and tool-use capabilities are built into the model's training, not bolted on after the fact
- MoE architecture delivers 20B-level analytical capability at ~3.6B active parameter inference speed
- Apache 2.0 license -- fully permissive, no government restrictions
- OpenAI's alignment work produces reliable instruction following and structured output compliance
- Handles multi-turn, tool-calling workflows more reliably than instruction-tuned general models
Key Weaknesses:
- 13 GB VRAM usage leaves limited headroom on a 16GB GPU for very long context tasks -- keep prompts concise for stability
- MoE architecture requires all 20.9B parameters to reside in VRAM during inference, despite only 3.6B being active at any time
- Tool-calling capabilities require a compatible interface (Open WebUI or similar) to fully utilize
- Fewer community fine-tuned variants than models from the Llama ecosystem
Honest Assessment: GPT-OSS 20B is the best choice for structured analytical workflows and automation. Its native tool-use design makes it the strongest performer for function-calling pipelines -- for example, triggering lookups against local databases, formatting outputs in specific agency templates, or chaining tasks together automatically. For simple drafting and summarization, Gemma 4 E4B is lighter on VRAM and easier to work with. GPT-OSS 20B becomes the right choice when the output format and downstream automation matter as much as the content itself. The tight VRAM fit requires care: limit concurrent context length to stay under 16 GB.
ollama pull phi4-mini:latest| Attribute | Detail |
|---|---|
| Developer | Microsoft |
| Parameters | 3.8B (dense) |
| VRAM Required | ~3 GB |
| Fits 16GB GPU | Yes (significant headroom remaining) |
| Context Window | 128K tokens |
| License | MIT License |
| Training Data Cutoff | June 2024 |
Best Use Cases:
- Quick summaries of short documents
- Rapid question-answering for simple analytical queries
- First-pass drafts for short analytical products that will be reviewed and edited
- Environments with limited GPU resources (laptops, shared workstations)
Key Strengths:
- Extremely fast inference -- near-instant responses for most prompts
- ~3 GB VRAM usage allows running alongside other applications or models simultaneously
- MIT license -- maximum legal clarity for government use
- 128K context window despite its small parameter count
- Microsoft's "phi" training philosophy emphasizes reasoning capability per parameter, making it more capable than older 7B models for simple tasks
Key Weaknesses:
- Noticeably less capable than 14B or 20B models for complex analytical tasks
- More prone to hallucination than larger models in this deployment
- Struggles with multi-part instructions or nuanced analytical reasoning
- Output quality degrades for longer or more complex products
Honest Assessment: Phi-4 Mini is the right model when speed and resource efficiency are the priority. For quick document checks, short summaries, and situations where a fast first draft is sufficient, it outperforms models three times its size in those dimensions. Do not use it for complex pattern analysis, evidence synthesis, or tasks where hallucination risk has downstream consequences. For most full analytical workflows in this deployment, Gemma 4 E4B produces better results with acceptable speed. Phi-4 Mini's value is the ~3 GB footprint: it can run persistently in the background or on constrained hardware where larger models cannot.
ollama pull granite4:latest| Attribute | Detail |
|---|---|
| Developer | IBM |
| Parameters | 3B (dense; "Micro" variant) |
| VRAM Required | ~2 GB |
| Fits 16GB GPU | Yes (leaves >14 GB free) |
| Context Window | 128K tokens |
| License | Apache 2.0 (fully permissive) |
| Training Data Cutoff | Not publicly disclosed (estimated 2024–2025) |
Best Use Cases:
- RAG (Retrieval-Augmented Generation) workflows where the model grounds answers in retrieved documents
- Structured question-answering over provided context
- Tool calling and function execution in automated pipelines
- Multilingual document handling (12 supported languages including Spanish, French, Arabic, German)
Key Strengths:
- Purpose-built for enterprise workflows: RAG, tool use, summarization, and multilingual tasks
- ~2 GB VRAM footprint -- the lowest of any model in this deployment; can run as a persistent background service without disrupting primary workloads
- IBM's enterprise focus produces conservative, well-structured output behavior appropriate for formal analytical products
- Apache 2.0 license -- no restrictions for government deployment
- 12-language support is a genuine differentiator for fusion center work involving multilingual source material
Key Weaknesses:
- At 3B parameters, complex analytical depth is limited compared to 14B+ models in this deployment
- Training data cutoff is not publicly disclosed by IBM -- knowledge currency is uncertain; treat as approximate and verify time-sensitive outputs
- IBM Granite community documentation is less extensive than Google or Microsoft model ecosystems
- Not optimized for creative or nuanced prose writing
Honest Assessment: Granite 4 Micro is the most enterprise-focused model in this deployment, reflecting IBM's priority on structured business workflows over general-purpose generation. Its ~2 GB VRAM footprint means it can run persistently as a dedicated RAG or tool-calling service while larger models handle primary analytical tasks. For agencies building multi-model pipelines -- one model for drafting, one for reasoning, one for structured extraction -- Granite 4 Micro is a strong candidate for the extraction and retrieval role. The undisclosed training cutoff is a limitation: for tasks requiring current knowledge, prefer Gemma 4 E4B (January 2025 cutoff) over Granite for the context lookup step.
ollama pull nemotron-cascade-2:latest| Attribute | Detail |
|---|---|
| Developer | NVIDIA |
| Parameters | 30B total / 3B active (MoE; hybrid Transformer + Mamba-2 architecture) |
| VRAM Required | ~24 GB |
| Fits 16GB GPU | No -- requires 24GB+ VRAM (e.g., RTX 3090, RTX 4090, A5000, or equivalent) |
| Context Window | 262,144 tokens (256K) |
| License | NVIDIA Open Model License |
| Training Data Cutoff | ~June 2025 (approximate) |
Best Use Cases:
- Complex analytical reasoning beyond the capability of 8B-14B models
- Very long document analysis leveraging the 256K context window -- case files, lengthy intelligence reports, multi-document batches in a single pass
- Advanced agentic tasks requiring multi-step planning and execution
- Agencies that have outgrown 14B models but cannot support 70B hardware
Key Strengths:
- 256K token context window -- by far the largest in this deployment; can process a full case file or multi-document intelligence package in a single pass without chunking
- Most recent training cutoff (~June 2025) of any model in this deployment -- best knowledge currency
- Hybrid Mamba-2 architecture processes long contexts efficiently without the quadratic cost of standard attention models
- Supports both "thinking" (chain-of-thought) and "instruct" (direct answer) modes -- configurable per task
- 30B parameter scale provides substantially better analytical depth than hardware-compatible 14B alternatives
Key Weaknesses:
- Requires 24GB+ VRAM -- does not run on the standard 16GB RTX 4060 Ti from the Hardware Guide; requires RTX 3090, RTX 4090, A5000, or workstation-class GPU
- NVIDIA Open Model License is more restrictive than Apache 2.0 or MIT -- review license terms before production deployment; consult your legal team, as the license includes restrictions on using the model to compete with NVIDIA products
- Hybrid Mamba-2 architecture has less community tooling and compatibility documentation than standard transformer models
- Hardware requirement is 2-3x the cost of the starter workstation build
Honest Assessment: Nemotron-Cascade-2 is the most capable model in this deployment and the right choice when analytical quality and context length matter more than hardware simplicity. Its 256K context window and ~June 2025 training cutoff represent genuine operational advantages for complex fusion center work. Agencies that have proved the value of local LLMs with 14B models and are ready to invest in 24GB+ GPU hardware should evaluate this model first. The license requires review before production use -- NVIDIA's Open Model License is permissive for most use cases but contains terms that require sign-off in some agency contexts. See the Hardware Guide for GPU options that support 24GB+ VRAM requirements.
ollama pull glm-4.7-flash:q8_0| Attribute | Detail |
|---|---|
| Developer | Z.ai / ZhipuAI (China) |
| Parameters | 30B total / 3B active (MoE architecture) |
| VRAM Required | ~31 GB (Q8_0 quantization) |
| Fits 16GB GPU | No -- requires 32GB+ VRAM (workstation-class GPU: A100, RTX 6000 Ada, or equivalent) |
| Context Window | 198,000 tokens (~198K) |
| License | MIT License |
| Training Data Cutoff | ~Mid-to-late 2024 (approximate; not officially disclosed) |
Best Use Cases:
- Chinese-English bilingual document analysis and cross-language synthesis
- Reasoning-intensive tasks requiring MoE-level analytical capability
- Tool use and coding tasks in multi-language workflows
- Analytical work involving Chinese-language source material
Key Strengths:
- Native Chinese-English bilingual capability -- designed for bilingual workflows, not just translation; particularly valuable for intelligence work involving Chinese-language sources
- Interleaved Thinking mode provides chain-of-thought reasoning with transparent logic output alongside the final answer
- MIT license -- permissive for government deployment
- ~198K context window supports large document packages in a single pass
- MoE architecture delivers 30B-level reasoning at 3B active parameter inference speed
Key Weaknesses:
- Requires 32GB+ VRAM -- the Q8_0 quantization at ~31 GB does not fit on any single consumer GPU under 32 GB; requires enterprise-class hardware (A100, RTX 6000 Ada, dual-GPU consumer setup, or equivalent)
- Developed by a Chinese organization (ZhipuAI) -- the same provenance considerations apply as with DeepSeek-R1; all inference runs locally with no data transmission, and weights are open-source, but agencies with policies governing Chinese-origin software should review before deployment
- Training data cutoff is not officially disclosed in English documentation -- treat as approximate and unverified
- Highest hardware requirement of any model in this deployment; may not be operable on current infrastructure without GPU upgrade
Honest Assessment: GLM-4.7-Flash requires more substantial hardware than the rest of this deployment. Its Q8_0 quantization at ~31 GB places it firmly in the enterprise GPU tier. For agencies with 32GB+ VRAM hardware, its bilingual Chinese-English capability and Interleaved Thinking mode provide genuine analytical value for intelligence work involving Chinese-language source material -- a capability no other model in this deployment shares. For agencies without the required hardware, GLM-4.7-Flash should be noted as an aspirational addition pending GPU investment. For bilingual work on current hardware, Gemma 4 E4B's multilingual capability provides a starting point while hardware is evaluated.
The following table maps common law enforcement analytical tasks to recommended models from this deployment. The "Primary" column is the first model to try. The "Alternative" column is a fallback if primary output quality is insufficient or hardware is unavailable.
| Task | Primary Model | Alternative | Notes |
|---|---|---|---|
| Document summarization | Gemma 4 (E4B) | GPT-OSS 20B | Gemma for multimodal + recent training; GPT-OSS for structured summary formats |
| Data extraction (names, dates, locations) | GPT-OSS 20B | Granite 4 Micro | GPT-OSS for complex documents; Granite for RAG pipeline extraction |
| Research synthesis (multi-source briefings) | DeepSeek-R1 14B | Nemotron-Cascade-2 | DeepSeek for transparent reasoning; Nemotron for very long document sets (256K) |
| Content drafting (reports, memos, BOLOs) | Gemma 4 (E4B) | GPT-OSS 20B | Gemma for general prose; GPT-OSS for templated/structured output |
| Analytical reasoning (pattern analysis, logic) | DeepSeek-R1 14B | Nemotron-Cascade-2 | DeepSeek shows reasoning steps; Nemotron for most complex multi-step analysis |
| Lightweight / fast tasks | Phi-4 Mini | Granite 4 Micro | Phi-4 Mini for quick answers; Granite for RAG-backed fast retrieval |
| Bilingual content (Chinese-English) | GLM-4.7-Flash | Gemma 4 (E4B) | GLM purpose-built for Chinese-English; Gemma as hardware-compatible fallback |
| RAG / document grounding | Granite 4 Micro | GPT-OSS 20B | Granite designed for enterprise RAG; GPT-OSS for tool-calling pipelines |
| Long document analysis (>50K tokens) | Nemotron-Cascade-2 | DeepSeek-R1 14B | Nemotron's 256K context handles full case files; DeepSeek handles up to 128K |
| Agentic / automated workflows | GPT-OSS 20B | Nemotron-Cascade-2 | GPT-OSS designed for tool-calling; Nemotron for advanced multi-step agentic tasks |
| Image + text analysis | Gemma 4 (E4B) | (none in this deployment) | Gemma is the only multimodal model deployed; no hardware-compatible alternative |
Recommendation: Start with Gemma 4 (E4B) as the default for most tasks. Pull DeepSeek-R1 14B as a secondary for analytical reasoning. Use GPT-OSS 20B for structured pipelines. Add Phi-4 Mini or Granite 4 Micro for background/lightweight service roles.
Honest expectations prevent wasted effort. These limitations apply broadly across all models in this deployment.
Models at 8B-14B parameter scale struggle with problems requiring 5+ sequential logical steps. A task like "Compare the timelines from these three reports, identify contradictions, cross-reference with known suspect locations, and assess the likelihood of connection" exceeds reliable single-pass execution for most models. Break tasks into smaller sequential steps and run them separately. DeepSeek-R1 14B and Nemotron-Cascade-2 handle multi-step reasoning best, but even these models benefit from task decomposition.
No model in this deployment was trained on comprehensive legal corpora at the depth required for authoritative legal analysis. "Does this evidence meet the probable cause standard?" is not a reliable question for any local LLM. Use legal research tools and consult legal counsel for these determinations. The models can assist with drafting legal language from provided templates, but cannot reliably interpret statutes or case law.
Despite the context windows listed above, practical quality degrades for all models at context lengths beyond their reliable range. A 200-page case file cannot simply be pasted into a prompt and reliably summarized in a single pass even for Nemotron-Cascade-2 (256K context). For long documents, chunking and sequential processing -- or a dedicated RAG pipeline -- produces more reliable results. See Knowledge Management for RAG architecture guidance.
All models have training data cutoffs. Events after those dates are outside the model's knowledge. The most current cutoff in this deployment is Nemotron-Cascade-2 (~June 2025) and Gemma 4 E4B (January 2025); others range from June to July 2024. For current-events-dependent analysis, always provide the relevant context within the prompt rather than assuming the model knows about recent developments.
All models in this deployment are unreliable for arithmetic, statistics, and mathematical reasoning. If the task involves calculations, use a calculator or spreadsheet and paste the results into the prompt for the model to incorporate into its output. Do not rely on any model here to compute.
At the 3B-30B parameter range running locally, a meaningful quality gap exists compared to frontier commercial models (GPT-4, Claude, Gemini Ultra). Local models are not a replacement for commercial models in raw capability -- they are a replacement in security posture. The tradeoff is intentional: lower capability in exchange for complete data sovereignty.
| Dimension | Local Deployed Models | Commercial Frontier Models |
|---|---|---|
| Reasoning depth | Limited to moderate (Nemotron-Cascade-2 best) | Strong |
| Factual accuracy | Moderate (requires human verification) | Higher (still requires verification) |
| Instruction following | Good to very good | Excellent |
| Long document handling | Up to 256K with quality degradation | Handles 100K+ more reliably |
| Hallucination rate | Higher | Lower (but not zero) |
| Data sovereignty | Complete | Depends on vendor agreement |
| Cost per query | Near zero (electricity) | $0.01–0.10+ per query |
| Availability | 100% (no internet required) | Requires internet + vendor uptime |
Bottom line: Local models handle 70–80% of routine analytical tasks adequately. The remaining 20–30% either requires larger models, different approaches (chunking, RAG), or is genuinely outside the local capability range. See Limitations and Tradeoffs for a deeper discussion.
Model benchmarks published online do not reflect real-world performance on law enforcement-specific tasks. The only reliable evaluation method is testing with representative tasks from actual work.
Compile 5–10 representative tasks that reflect actual analytical work. Use non-CJI training data only. Include:
- A document that needs summarizing
- A briefing that needs drafting from provided source material
- A data extraction task (pull names, dates, and locations from a narrative)
- A reasoning task (identify patterns or contradictions across multiple inputs)
- A formatting task (generate a structured report from unstructured input)
Run the same prompt against each candidate model. Use the prompting techniques from the Prompting Guide to ensure prompts are well-structured.
# Pull the hardware-compatible models
ollama pull gemma4:latest
ollama pull deepseek-r1:14b
ollama pull gpt-oss:20b
ollama pull phi4-mini:latest
ollama pull granite4:latest
# Test each against the same prompt
ollama run gemma4:latest "Your test prompt here"
ollama run deepseek-r1:14b "Your test prompt here"
ollama run gpt-oss:20b "Your test prompt here"
ollama run phi4-mini:latest "Your test prompt here"
ollama run granite4:latest "Your test prompt here"Evaluate each output against these dimensions:
| Criterion | What to Look For |
|---|---|
| Accuracy | Are facts from the source material correctly represented? |
| Completeness | Did the model address all parts of the prompt? |
| Format compliance | Does the output match the requested format? |
| Hallucination | Did the model add information not present in the source? |
| Coherence | Is the output logically structured and readable? |
| Speed | How long did the response take? Is it practical for daily use? |
Record which model performs best for each task type. Expect results like:
- "Gemma 4 produces the best drafts for daily briefings and handles image evidence"
- "DeepSeek-R1 14B provides the most transparent reasoning for pattern analysis"
- "GPT-OSS 20B is most consistent for structured extraction and formatted output"
Ollama supports pulling and running multiple models on the same machine. Switching between models takes seconds.
ollama run gemma4:latest # For drafting, summarization, and image analysis
ollama run deepseek-r1:14b # For analytical reasoning and pattern analysis
ollama run gpt-oss:20b # For structured output and tool-calling workflowsQuantization reduces a model's numerical precision (from 16-bit to 4-bit or 8-bit) to decrease memory footprint and increase speed. Models in this deployment use different quantization levels as noted in each profile.
| Quantization Level | Quality Impact | VRAM Savings | When to Use |
|---|---|---|---|
| Q8 (8-bit) | Minimal quality loss | ~50% vs. full precision | When VRAM is available and quality matters most -- GLM-4.7-Flash uses this |
| Q4_K_M (4-bit, default) | Small quality loss | ~75% vs. full precision | Default for most models -- best balance of quality and resource usage |
| Q4_K_S (4-bit, small) | Moderate quality loss | ~78% vs. full precision | When VRAM is tight |
| Q3_K (3-bit) | Noticeable quality loss | ~80% vs. full precision | Not recommended for analytical work |
Recommendation: Use the default quantization (Q4_K_M) for all models unless a specific quantization is required for hardware fit. GLM-4.7-Flash runs at Q8_0 in this deployment because Q8 is needed to preserve quality at the 30B scale; the hardware requirement reflects this choice.
Ollama stores models on disk and loads them into VRAM on demand. Having multiple models pulled does not consume additional VRAM -- only the actively running model occupies GPU memory.
# Pull the full hardware-compatible deployment stack
ollama pull gemma4:latest
ollama pull deepseek-r1:14b
ollama pull gpt-oss:20b
ollama pull phi4-mini:latest
ollama pull granite4:latest
# Check what's available
ollama list
# Switch between models as needed
ollama run gemma4:latest "Summarize this document..."
ollama run deepseek-r1:14b "Analyze the pattern across these incidents..."
ollama run gpt-oss:20b "Extract all names, dates, and locations in JSON format..."Storage requirements: At Q4_K_M quantization, most 3B-14B models require 2–9 GB of disk space. GPT-OSS 20B at ~13 GB and Nemotron-Cascade-2 at ~24 GB are larger. The recommended 1TB NVMe SSD from the Hardware Guide comfortably holds the full deployment stack plus working space.
Hardware-constrained deployments: Nemotron-Cascade-2 (24 GB) and GLM-4.7-Flash (31 GB) require hardware beyond the standard starter build. Pull these only on machines with confirmed 24GB+ and 32GB+ VRAM respectively. The remaining five models all fit on the standard 16GB RTX 4060 Ti build.
- Ollama Model Library -- ollama.com/library
- Google Gemma 4 -- ai.google.dev/gemma
- DeepSeek-R1 -- github.com/deepseek-ai
- OpenAI GPT-OSS -- openai.com
- Microsoft Phi-4 -- azure.microsoft.com/products/phi-4
- IBM Granite 4 -- ibm.com/granite
- NVIDIA Nemotron-Cascade-2 -- research.nvidia.com
- ZhipuAI GLM-4.7-Flash -- github.com/THUDM
- CJIS Security Policy v6.0 -- FBI CJIS Resource Center
This document is part of the NFCA Open Source LLM companion resource. For prompting techniques to get the most from these models, see the Prompting Guide. For hardware requirements, see the Hardware Guide. This is an educational resource, not official guidance. Consult your agency's CJIS Systems Officer (CSO) for compliance decisions.