This log is organized by learning progress instead of assumed calendar days.
Why this format:
- one real-world day may include multiple chapters
- a chapter may span multiple sessions
- using fixed dates per chapter created inaccurate future-dated entries
Logging rule going forward:
- record by
SessionorChaptermilestone - only use an explicit date when it is certain from the current session
- prefer
Last updatedover inventing a new study date
Last updated: 2026-04-07
- Introduction
- what prompt engineering is and why it matters
- the guide's experiment setup:
gpt-3.5-turbo,temperature=1,top_p=1 - why prompt results vary across models and settings
- prompt engineering is not just writing better prompts; it is the systematic design of inputs for LLM tasks
- prompt quality should be evaluated with an experimental mindset because outputs change with model and sampling settings
temperaturecontrols randomness;top_pcontrols how much of the probability mass is considered during token sampling
- why this guide uses
temperature=1andtop_p=1as defaults - when to change
temperatureversus when to changetop_p
- Introduction review
- LLM Settings warm-up
- reviewed Introduction through self-test and explanation in my own words
- clarified the difference between
temperatureandtop_p - confirmed when to lower
temperaturefor stable structured outputs - started the LLM Settings chapter warm-up
- prompt engineering is about designing inputs for real tasks, not only polishing wording
- lower
temperatureis better for predictable outputs such as JSON top_plimits token choice to a probability-mass subset, whiletemperaturereshapes randomness within sampling- in production, changing multiple decoding knobs at once makes behavior harder to predict and debug
- build stronger intuition for
top_pthrough examples instead of definitions alone - learn how model choice, max tokens, and stop sequences affect output quality
- Prompt Elements
- broke prompts into instruction, context, input data, and output indicator
- reviewed a sentiment classification prompt and identified which element each line belongs to
- passed two rounds of self-test on prompt element identification and failure diagnosis
- not every prompt needs all four elements
- the role of each element is different: task definition, steering information, task payload, and output constraint
- output indicators are small but powerful because they shape answer format and task completion
- a translation prompt can work with only instruction, input data, and output indicator when the task is simple and unambiguous
- audience framing like
for an engineering managerusually acts as context because it steers style and relevance rather than defining the base task
- when should context be included versus omitted for simpler tasks?
- how much output formatting guidance is enough before it becomes over-specified?
- General Tips for Designing Prompts
- reviewed five design principles: start simple, write strong instructions, be specific, avoid imprecise wording, and prefer telling the model what to do
- compared weak and improved prompt examples
- passed one self-test on prompt design principles and one rewrite-focused review round
- prompt design is iterative, so a simple starting point is usually better than an overloaded first draft
- specificity improves reliability, but only when the details are relevant to the task
- direct instructions work better than vague style constraints
- negative-only instructions often fail; positive directions are easier for the model to follow
- large compound tasks should usually be split into smaller prompts so each step stays clear and testable
- vague phrases like
briefly,not too technical, andnot too longshould be replaced by concrete constraints
- how specific is too specific before the prompt becomes noisy?
- when should a task be split into subtasks instead of adding more detail to one prompt?
- Zero-Shot Prompting
- learned that zero-shot prompting means asking the model to do a task directly without examples
- reviewed the sentiment classification example as a zero-shot task
- connected zero-shot capability to instruction tuning and instruction-following behavior
- confirmed the distinction between zero-shot and few-shot through self-check and terminology review
- zero-shot works when the model already understands the task pattern from pretraining and instruction tuning
- zero-shot prompts rely heavily on clear instructions because there are no demonstrations to steer formatting
- if zero-shot performance is weak, the next step is usually few-shot prompting rather than adding random wording
- adding output constraints like JSON format does not break zero-shot as long as no examples are provided
- how to build stronger intuition for the meaning and naming of zero-shot versus few-shot
- zero-shot pass
- ready to move on to few-shot
- Few-Shot Prompting
- how few-shot prompting uses in-context demonstrations to guide model behavior
- why format and label distribution in examples matter more than label correctness (Min et al. 2022)
- the 1-shot word-usage example (whatpu / farduddle) showing in-context learning
- why even randomly assigned labels still improve over no examples
- the failure case: few-shot cannot reliably solve multi-step reasoning (odd numbers sum)
- the connection to chain-of-thought prompting as the next technique
- few-shot works through in-context learning: no weight updates, just soft conditioning from examples
- label correctness matters less than label space and input distribution
- demonstrations primarily teach format and pattern, not facts
- the failure mode of few-shot is exactly where chain-of-thought helps: intermediate reasoning steps
- how many examples is enough before adding more stops helping or hurts?
- how to select representative examples when the input distribution is varied?
- few-shot pass on 2026-04-02
- key insight: "模型学的是范式,不是答案" — model extracts pattern and format, not label correctness
- understood why reasoning fails: missing intermediate steps, not missing problem description
- ready to move on to Chain-of-Thought prompting
- Chain-of-Thought (CoT) Prompting
- how CoT adds intermediate reasoning steps to fix the failure mode of few-shot on reasoning tasks
- three variants: Few-shot CoT, Zero-shot CoT ("Let's think step by step"), and Auto-CoT
- why Zero-shot CoT works: explicit instruction + activation of reasoning patterns from pretraining
- why CoT is an emergent ability limited to large models
- how Auto-CoT uses question clustering + Zero-shot CoT to auto-generate diverse demonstrations
- CoT teaches the model how to solve problems, not just what format to answer in
- "Let's think step by step" activates pretraining patterns, not just issuing an instruction
- Zero-shot CoT vs Few-shot CoT is a trade-off between diversity/autonomy and control/convergence
- Auto-CoT's clustering step is critical — diversity prevents error propagation across demonstrations
- how to evaluate whether a CoT reasoning chain is actually correct vs. plausible-sounding but wrong?
- CoT pass on 2026-04-02
- ready to move on to Self-Consistency prompting
- Meta Prompting
- how Meta Prompting focuses on structure and syntax rather than specific content examples
- the difference between content-driven (few-shot) and structure-driven (meta) approaches
- why Meta Prompting is token-efficient: cognitive work of extracting structure is shifted to the prompt writer
- why it can be seen as a zero-shot variant: no concrete content examples
- its failure condition: relies on model's prior knowledge of the task domain
- Meta Prompting 是把"从示例归纳结构"的工作转移给了人,模型只接收结构指令
- 失效边界与 zero-shot 类似:模型缺乏先验知识时无法填充正确内容
- 数学推导选 CoT 而非 Meta Prompting:CoT 保证每一步计算正确,Meta 只约束形式
- Meta Prompting pass on 2026-04-04
- ready to move on to Self-Consistency prompting
- Self-Consistency
- how Self-Consistency improves on CoT by sampling multiple reasoning paths and using majority voting
- why single-path CoT decoding has no error correction mechanism
- why majority voting works: error paths are diverse (divergent), correct paths converge to the same answer
- why Self-Consistency is inapplicable to open-ended tasks: no objective correct answer to vote on
- the engineering trade-off: token cost multiplies with number of samples
- "文无第一" — open-ended tasks have no ground truth, so voting has no meaning
- Self-Consistency is not a default technique; the high cost means it's reserved for high-stakes reasoning tasks
- 被淘汰的推理路径消耗算力却对最终答案无贡献,工程上需要权衡
- 它更准确地说是对 CoT 采样/解码方式的增强,而不是把 CoT 本身定义成贪婪解码
- Self-Consistency pass on 2026-04-07
- Tree of Thoughts (ToT)
- how ToT generalizes CoT from a single chain into a search tree over intermediate thoughts
- why ToT is different from Self-Consistency: path selection happens during reasoning, not only at the end
- the four key components: thought decomposition, thought generation, state evaluation, and search algorithm
- how BFS and DFS can be used to expand, prune, and backtrack over candidate reasoning states
- why ToT is useful for planning/search-heavy tasks such as Game of 24, creative writing planning, and mini crosswords
- CoT is a chain; ToT is a tree
- Self-Consistency compares complete chains after generation, while ToT evaluates branches during generation
- ToT is best understood as a reasoning-plus-search framework, not just a prompt wording trick
- the power of ToT comes from branching, evaluation, and backtracking, but that also creates its main cost
- ToT should be reserved for tasks where planning or search materially matters
- Tree of Thoughts pass on 2026-04-15
- ready to move on to RAG