Skip to content

Latest commit

 

History

History
287 lines (185 loc) · 10.8 KB

File metadata and controls

287 lines (185 loc) · 10.8 KB

Learning Log

This log is organized by learning progress instead of assumed calendar days.

Why this format:

  • one real-world day may include multiple chapters
  • a chapter may span multiple sessions
  • using fixed dates per chapter created inaccurate future-dated entries

Logging rule going forward:

  • record by Session or Chapter milestone
  • only use an explicit date when it is certain from the current session
  • prefer Last updated over inventing a new study date

Last updated: 2026-04-07

Session 01

Focus

  • Introduction

What I studied

  • what prompt engineering is and why it matters
  • the guide's experiment setup: gpt-3.5-turbo, temperature=1, top_p=1
  • why prompt results vary across models and settings

Key insights

  • prompt engineering is not just writing better prompts; it is the systematic design of inputs for LLM tasks
  • prompt quality should be evaluated with an experimental mindset because outputs change with model and sampling settings
  • temperature controls randomness; top_p controls how much of the probability mass is considered during token sampling

Confusions / open questions

  • why this guide uses temperature=1 and top_p=1 as defaults
  • when to change temperature versus when to change top_p

Session 02

Focus

  • Introduction review
  • LLM Settings warm-up

What I studied

  • reviewed Introduction through self-test and explanation in my own words
  • clarified the difference between temperature and top_p
  • confirmed when to lower temperature for stable structured outputs
  • started the LLM Settings chapter warm-up

Key insights

  • prompt engineering is about designing inputs for real tasks, not only polishing wording
  • lower temperature is better for predictable outputs such as JSON
  • top_p limits token choice to a probability-mass subset, while temperature reshapes randomness within sampling
  • in production, changing multiple decoding knobs at once makes behavior harder to predict and debug

Confusions / open questions

  • build stronger intuition for top_p through examples instead of definitions alone
  • learn how model choice, max tokens, and stop sequences affect output quality

Session 03

Focus

  • Prompt Elements

What I studied

  • broke prompts into instruction, context, input data, and output indicator
  • reviewed a sentiment classification prompt and identified which element each line belongs to
  • passed two rounds of self-test on prompt element identification and failure diagnosis

Key insights

  • not every prompt needs all four elements
  • the role of each element is different: task definition, steering information, task payload, and output constraint
  • output indicators are small but powerful because they shape answer format and task completion
  • a translation prompt can work with only instruction, input data, and output indicator when the task is simple and unambiguous
  • audience framing like for an engineering manager usually acts as context because it steers style and relevance rather than defining the base task

Confusions / open questions

  • when should context be included versus omitted for simpler tasks?
  • how much output formatting guidance is enough before it becomes over-specified?

Session 04

Focus

  • General Tips for Designing Prompts

What I studied

  • reviewed five design principles: start simple, write strong instructions, be specific, avoid imprecise wording, and prefer telling the model what to do
  • compared weak and improved prompt examples
  • passed one self-test on prompt design principles and one rewrite-focused review round

Key insights

  • prompt design is iterative, so a simple starting point is usually better than an overloaded first draft
  • specificity improves reliability, but only when the details are relevant to the task
  • direct instructions work better than vague style constraints
  • negative-only instructions often fail; positive directions are easier for the model to follow
  • large compound tasks should usually be split into smaller prompts so each step stays clear and testable
  • vague phrases like briefly, not too technical, and not too long should be replaced by concrete constraints

Confusions / open questions

  • how specific is too specific before the prompt becomes noisy?
  • when should a task be split into subtasks instead of adding more detail to one prompt?

Session 05

Focus

  • Zero-Shot Prompting

What I studied

  • learned that zero-shot prompting means asking the model to do a task directly without examples
  • reviewed the sentiment classification example as a zero-shot task
  • connected zero-shot capability to instruction tuning and instruction-following behavior
  • confirmed the distinction between zero-shot and few-shot through self-check and terminology review

Key insights

  • zero-shot works when the model already understands the task pattern from pretraining and instruction tuning
  • zero-shot prompts rely heavily on clear instructions because there are no demonstrations to steer formatting
  • if zero-shot performance is weak, the next step is usually few-shot prompting rather than adding random wording
  • adding output constraints like JSON format does not break zero-shot as long as no examples are provided

Confusions / open questions

  • how to build stronger intuition for the meaning and naming of zero-shot versus few-shot

Result

  • zero-shot pass
  • ready to move on to few-shot

Session 06

Focus

  • Few-Shot Prompting

What I studied

  • how few-shot prompting uses in-context demonstrations to guide model behavior
  • why format and label distribution in examples matter more than label correctness (Min et al. 2022)
  • the 1-shot word-usage example (whatpu / farduddle) showing in-context learning
  • why even randomly assigned labels still improve over no examples
  • the failure case: few-shot cannot reliably solve multi-step reasoning (odd numbers sum)
  • the connection to chain-of-thought prompting as the next technique

Key insights

  • few-shot works through in-context learning: no weight updates, just soft conditioning from examples
  • label correctness matters less than label space and input distribution
  • demonstrations primarily teach format and pattern, not facts
  • the failure mode of few-shot is exactly where chain-of-thought helps: intermediate reasoning steps

Confusions / open questions

  • how many examples is enough before adding more stops helping or hurts?
  • how to select representative examples when the input distribution is varied?

Result

  • few-shot pass on 2026-04-02
  • key insight: "模型学的是范式,不是答案" — model extracts pattern and format, not label correctness
  • understood why reasoning fails: missing intermediate steps, not missing problem description
  • ready to move on to Chain-of-Thought prompting

Session 07

Focus

  • Chain-of-Thought (CoT) Prompting

What I studied

  • how CoT adds intermediate reasoning steps to fix the failure mode of few-shot on reasoning tasks
  • three variants: Few-shot CoT, Zero-shot CoT ("Let's think step by step"), and Auto-CoT
  • why Zero-shot CoT works: explicit instruction + activation of reasoning patterns from pretraining
  • why CoT is an emergent ability limited to large models
  • how Auto-CoT uses question clustering + Zero-shot CoT to auto-generate diverse demonstrations

Key insights

  • CoT teaches the model how to solve problems, not just what format to answer in
  • "Let's think step by step" activates pretraining patterns, not just issuing an instruction
  • Zero-shot CoT vs Few-shot CoT is a trade-off between diversity/autonomy and control/convergence
  • Auto-CoT's clustering step is critical — diversity prevents error propagation across demonstrations

Confusions / open questions

  • how to evaluate whether a CoT reasoning chain is actually correct vs. plausible-sounding but wrong?

Result

  • CoT pass on 2026-04-02
  • ready to move on to Self-Consistency prompting

Session 08

Focus

  • Meta Prompting

What I studied

  • how Meta Prompting focuses on structure and syntax rather than specific content examples
  • the difference between content-driven (few-shot) and structure-driven (meta) approaches
  • why Meta Prompting is token-efficient: cognitive work of extracting structure is shifted to the prompt writer
  • why it can be seen as a zero-shot variant: no concrete content examples
  • its failure condition: relies on model's prior knowledge of the task domain

Key insights

  • Meta Prompting 是把"从示例归纳结构"的工作转移给了人,模型只接收结构指令
  • 失效边界与 zero-shot 类似:模型缺乏先验知识时无法填充正确内容
  • 数学推导选 CoT 而非 Meta Prompting:CoT 保证每一步计算正确,Meta 只约束形式

Result

  • Meta Prompting pass on 2026-04-04
  • ready to move on to Self-Consistency prompting

Session 09

Focus

  • Self-Consistency

What I studied

  • how Self-Consistency improves on CoT by sampling multiple reasoning paths and using majority voting
  • why single-path CoT decoding has no error correction mechanism
  • why majority voting works: error paths are diverse (divergent), correct paths converge to the same answer
  • why Self-Consistency is inapplicable to open-ended tasks: no objective correct answer to vote on
  • the engineering trade-off: token cost multiplies with number of samples

Key insights

  • "文无第一" — open-ended tasks have no ground truth, so voting has no meaning
  • Self-Consistency is not a default technique; the high cost means it's reserved for high-stakes reasoning tasks
  • 被淘汰的推理路径消耗算力却对最终答案无贡献,工程上需要权衡
  • 它更准确地说是对 CoT 采样/解码方式的增强,而不是把 CoT 本身定义成贪婪解码

Result

  • Self-Consistency pass on 2026-04-07

Session 10

Focus

  • Tree of Thoughts (ToT)

What I studied

  • how ToT generalizes CoT from a single chain into a search tree over intermediate thoughts
  • why ToT is different from Self-Consistency: path selection happens during reasoning, not only at the end
  • the four key components: thought decomposition, thought generation, state evaluation, and search algorithm
  • how BFS and DFS can be used to expand, prune, and backtrack over candidate reasoning states
  • why ToT is useful for planning/search-heavy tasks such as Game of 24, creative writing planning, and mini crosswords

Key insights

  • CoT is a chain; ToT is a tree
  • Self-Consistency compares complete chains after generation, while ToT evaluates branches during generation
  • ToT is best understood as a reasoning-plus-search framework, not just a prompt wording trick
  • the power of ToT comes from branching, evaluation, and backtracking, but that also creates its main cost
  • ToT should be reserved for tasks where planning or search materially matters

Result

  • Tree of Thoughts pass on 2026-04-15
  • ready to move on to RAG