[TOC]
↗ Linguistics ↗ Ordinary Language Philosophy ↗ Philosophy of Language
↗ Formal System, Formal Logics, and Its Semantics ↗ Automata Theory and (Formal) Language Theory ↗ Formal Syntax & Metasyntax (and Metalanguage)
↗ Algebraic Structure & Abstract Algebra & Modern Algebra
↗ Computer Languages & Programming Methodology ↗ Programming Language Processing & Program Execution
↗ LLM (Large Language Model) ↗ Knowledge Graph (KG)
[!links] ↗ LLM (Large Language Model)
Daniel Jurafsky and James H. Martin. 2025. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Online manuscript released August 24, 2025. https://web.stanford.edu/~jurafsky/slp3.
- Second Edition: https://home.cs.colorado.edu/~martin/slp.html
🏫 CS224N Natural Language Processing with Deep Learning 🔗 https://web.stanford.edu/class/cs224n/ The following texts are useful, but none are required. All of them can be read free online.
- Dan Jurafsky and James H. Martin. Speech and Language Processing (2024 pre-release)
- Jacob Eisenstein. Natural Language Processing
- Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning
- Delip Rao and Brian McMahan. Natural Language Processing with PyTorch (requires Stanford login).
- Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Natural Language Processing with Transformers If you have no background in neural networks but would like to take the course anyway, you might well find one of these books helpful to give you more background:
- Michael A. Nielsen. Neural Networks and Deep Learning
- Eugene Charniak. Introduction to Deep Learning
CMU 11-711: Advanced Natural Language Processing (ANLP)
- https://www.phontron.com/class/anlp-fall2024/
- 该课程为研究生级别的 NLP 入门与进阶课程,覆盖从词表征、序列建模,到注意力机制、Transformer 架构,再到大规模语言模型预训练、指令微调与复杂推理、多模态和安全性等前沿主题。与其他同类课程相比,本课程:
- 内容全面且紧跟最新研究:除经典算法外,深入讲解近年热门的大模型方法(如 LLaMa、GPT-4 等)。
- 实践性强:每次课配套代码演示与在线小测,学期末项目需复现并改进一篇前沿论文。
- 互动良好:Piazza 讨论、Canvas 测验及现场答疑,学习体验沉浸而有节奏。
- 自学建议:
- 提前阅读课前推荐文献,跟着阅读顺序循序渐进。
- 准备好 Python 环境并熟悉 PyTorch/Hugging Face,因为大量实战代码示例基于此。
- 扎实完成课程作业。
↗ Logic (and Critical Thinking) ↗ Mathematics
↗ Formal Semantics and Programming Language ↗ Formal Syntax & Metasyntax (and Metalanguage)
↗ Computer Languages & Programming Methodology
[!links] ↗ Learning English the Right Way
🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009
To summarize, engaging in complex language behavior requires various kinds of knowledge of language:
- Phonetics and Phonology — knowledge about linguistic sounds
- Morphology — knowledge of the meaningful components of words
- Syntax — knowledge of the structural relationships between words
- Semantics — knowledge of meaning
- Pragmatics — knowledge of the relationship of meaning to the goals and intentions of the speaker
- Discourse — knowledge about linguistic units larger than a single utterance
🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009
A perhaps surprising fact about these categories of linguistic knowledge is that most tasks in speech and language processing can be viewed as resolving ambiguity at one of these levels. We say some input is ambiguous if multiple, alternative linguistic structures can be built for it. Consider the spoken sentence I made her duck. Here are five different meanings this sentence could have (see if you can think of some more), each of which exemplifies an ambiguity at some level:
- (1.5) I cooked waterfowl for her.
- (1.6) I cooked waterfowl belonging to her.
- (1.7) I created the (plaster?) duck she owns.
- (1.8) I caused her to quickly lower her head or body.
- (1.9) I waved my magic wand and turned her into undifferentiated waterfowl.
We often introduce the models and algorithms we present throughout the book as ways to resolve or disambiguate these ambiguities. For example, deciding whether duck is a verb or a noun can be solved by part-of-speech tagging. Deciding whether make means “create” or “cook” can be solved by word sense disambiguation. Resolution of part-of speech and word sense ambiguities are two important kinds of lexical disambiguation. A wide variety of tasks can be framed as lexical disambiguation problems. For example, a text-to-speech synthesis system reading the word lead needs to decide whether it should be pronounced as in lead pipe or as in lead me on. By contrast, deciding whether her and duck are part of the same entity (as in (1.5) or (1.8)) or are different entities (as in (1.6)) is an example of syntactic disambiguation and can be addressed by probabilistic parsing. We also consider ambiguities that don’t arise in this particular example, such as determining whether a sentence is a statement or a question (which can be resolved by speech act interpretation).
[!links] ↗ Mathematical Modeling & Abstraction ↗ Models of Computation & Abstract Machines ↗ (Formal) Model Checking
↗ Automata Theory and (Formal) Language Theory ↗ Classical Logic (Standard Formal Logic) ↗ Probabilistic Models (Distributions) & Stochastic Process
See below "📜 A Brief History of The Technical Evolution Of Language Models"
🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009
One of the key insights of the last 50 years of research in language processing is that the various kinds of knowledge described in the last sections can be captured through the use of a small number of formal models or theories. Fortunately, these models and theories are all drawn from the standard toolkits of computer science, mathematics, and linguistics and should be generally familiar to those trained in those fields. Among the most important models are state machines, rule systems, logic, probabilistic models, and vector-space models. These models, in turn, lend themselves to a small number of algorithms, among the most important of which are state space search algorithms, such as dynamic programming, and machine learning algorithms, such as classifiers and Expectation Maximization (EM) and other learning algorithms.
- In their simplest formulation, ==state machines== are formal models that consist of states, transitions among states, and an input representation. Some of the variations of this basic model that we will consider are deterministic and non-deterministic finite-state automata and finite-state transducers.
- Closely related to these models are their declarative counterparts: ==formal rule systems==. Among the more important ones we consider (in both probabilistic and non-probabilistic formulations) are regular grammars and regular relations, context-free grammars, and feature-augmented grammars. State machines and formal rule systems are the main tools used when dealing with knowledge of phonology, morphology, and syntax.
- A third class of models that plays a critical role in capturing knowledge of language are models based on ==logic==. We discuss first-order logic, also known as the predicate calculus, as well as such related formalisms as lambda-calculus, feature structures, and semantic primitives. These logical representations have traditionally been used for modeling semantics and pragmatics, although more recent work has tended to focus on potentially more robust techniques drawn from non-logical lexical semantics.
- ==Probabilistic models== are crucial for capturing every kind of linguistic knowledge. Each of the other models (state machines, formal rule systems, and logic) can be augmented with probabilities. For example, the state machine can be augmented with probabilities to become the weighted automaton, or Markov model. We spend a significant amount of time on hidden Markov models or HMMs, which are used everywhere in the field, in part-of speech tagging, speech recognition, dialogue understanding, text-to-speech, and machine translation. The key advantage of probabilistic models is their ability to solve the many kinds of ambiguity problems that we discussed earlier; almost any speech and language processing problem can be recast as “given N choices for some ambiguous input, choose the most probable one”.
- Finally, ==vector-space models==, based on linear algebra, underlie information retrieval and many treatments of word meanings. (Neural Networks & Language Models)
Processing language with any of these models typically involves a search through a space of states representing hypotheses about an input.
- In speech recognition, we search through a space of phone sequences for the correct word. In parsing, we search through a space of trees for the syntactic parse of an input sentence.
- In machine translation, we search through a space of translation hypotheses for the correct translation of a sentence into another language.
- For non-probabilistic tasks, such as tasks involving state machines, we use well-known graph algorithms such as depth-first search.
- For probabilistic tasks, we use heuristic variants such as best-first and A* search and rely on dynamic programming algorithms for computational tractability.
- Machine learning tools such as classifiers and sequence models play a significant role in many language processing tasks. Based on attributes describing each object, a classifier attempts to assign a single object to a single class while a sequence model attempts to jointly classify a sequence of objects into a sequence of classes.
For example, in the task of deciding whether a word is spelled correctly, classifiers such as decision trees, support vector machines, Gaussian mixture models, and logistic regression could be used to make a binary decision (correct or incorrect) for one word at a time. Sequence models such as hidden Markov models, maximum entropy Markov models, and conditional random fields could be used to assign correct/incorrect labels to all the words in a sentence at once.
Finally, researchers in language processing use many of the same methodological tools that are used in machine learning research—the use of distinct training and test sets, statistical techniques like cross-validation, and careful evaluation of trained systems.
🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009
Intelligence and Turing Test To many, the ability of computers to process language as skillfully as we humans do will signal the arrival of truly intelligent machines. The basis of this belief is the fact that the effective use of language is intertwined with our general cognitive abilities. Among the first to consider the computational implications of this intimate connection was Alan Turing (1950). In this famous paper, Turing introduced what has come to be known as the Turing test. Turing began with the thesis that the question of what itTuring test would mean for a machine to think was essentially unanswerable because of the inherent imprecision in the terms machine and think. Instead, he suggested an empirical test, a game, in which a computer’s use of language would form the basis for determining if the machine could think. If the machine could win the game, it would be judged intelligent.
In Turing’s game, there are three participants: two people and a computer. One of the people is a contestant who plays the role of an interrogator. To win, the interrogator must determine which of the other two participants is the machine by asking a series of questions via a teletype. The task of the machine is to fool the interrogator into believing it is a person by responding as a person would to the interrogator’s questions. The task of the second human participant is to convince the interrogator that the other participant is the machine and that she is human.
==The classic definition of a language model (LM) is a probability distribution over sequences of tokens.== Suppose we have a vocabulary
For example, the LM should assign $(𝗆𝗈𝗎𝗌𝖾, 𝗍𝗁𝖾, 𝗍𝗁𝖾, 𝖼𝗁𝖾𝖾𝗌𝖾, 𝖺𝗍𝖾)$ a very low probability implicitly because it’s ungrammatical (syntactic knowledge). The LM should assign $(𝗍𝗁𝖾, 𝗆𝗈𝗎𝗌𝖾, 𝖺𝗍𝖾, 𝗍𝗁𝖾, 𝖼𝗁𝖾𝖾𝗌𝖾)$ higher probability than $(𝗍𝗁𝖾, 𝖼𝗁𝖾𝖾𝗌𝖾, 𝖺𝗍𝖾, 𝗍𝗁𝖾, 𝗆𝗈𝗎𝗌𝖾)$ implicitly because of world knowledge: both sentences are the same syntactically, but they differ in semantic plausibility.
Generation As defined, a language model $p$ takes a sequence and returns a probability to assess its goodness. We can also generate a sequence given a language model. The purest way to do this is to sample a sequence $x_{1:L}$ from the language model $p$ with probability equal to $p(x_{1:L})$, denoted:
Autoregressive language models
A common way to write the joint distribution $p(x_{1:L})$ of a sequence $x_{1:L}$ is using the chain rule of probability:
Of course, any joint probability distribution can be written this way mathematically, but an autoregressive language model is one where each conditional distribution$p(x_i∣x_{1:i−1})$ can be computed efficiently (e.g., using a feedforward neural network).
Generation. Now to generate an entire sequence $x_{1:L}$ from an autoregressive language model $p$, we sample one token at a time given the tokens generated so far: $$\begin{align} for \ i=1,…,L: \ & x_{i}∼p(x_i∣x_{1:i−1})^{1/T},\end{align}$$ where $T \geq 0$ is a temperature parameter that controls how much randomness we want from the language model:
-
$T=0$ : deterministically choose the most probable token $x_i$ at each position $i$ -
$T=1$ : sample “normally” from the pure language model -
$T=\infty$ : sample from a uniform distribution over the entire vocabulary $\Box$
However, if we just raise the probabilities to the power $1/T$, the probability distribution may not sum to 1. We can fix this by re-normalizing the distribution. We call the normalized version $p_T(x_i∣x_{1:i−1})∝p(x_i∣x_{1:i−1})^{1/T}$ the annealed conditional probability distribution. For example: $$\begin{align} & p(𝖼𝗁𝖾𝖾𝗌𝖾)=0.4, & p(𝗆𝗈𝗎𝗌𝖾)=0.6 \ & p_{T=0.5}(𝖼𝗁𝖾𝖾𝗌𝖾)=0.31, & p_{T=0.5}(𝗆𝗈𝗎𝗌𝖾)=0.69 \ & p_{T=0.2}(𝖼𝗁𝖾𝖾𝗌𝖾)=0.12, & p_{T=0.2}(𝗆𝗈𝗎𝗌𝖾)=0.88 \ & p_{T=0}(𝖼𝗁𝖾𝖾𝗌𝖾)=0, & p_{T=0}(𝗆𝗈𝗎𝗌𝖾)=1 \end{align}$$
Aside: Annealing is a reference to metallurgy, where hot materials are cooled gradually, and shows up in sampling and optimization algorithms such as simulated annealing.
Technical note: sampling iteratively with a temperature $T$ parameter applied to each conditional distribution $p(x_i∣x_{1:i−1})^{1/T}$ is not equivalent (except when $T=1$) to sampling from the annealed distribution over length $L$ sequences.
Conditional generation. More generally, we can perform conditional generation by specifying some prefix sequence $x_{1:i}$ (called a prompt) and sampling the rest $x_{i+1:L}$ (called the completion). For example, generating with $T=0$ produces (demo): $$\underbrace{\text{the, mouse, ate}}{\text{prompt}} ;;\xrightarrow[T=0]{} \underbrace{\text{the, cheese}}{\text{completion}}$$ If we change the temperature to $T=1$, we can get more variety (demo), for example,${𝗂𝗍𝗌, 𝗁𝗈𝗎𝗌𝖾, and, 𝗆𝗒, 𝗁𝗈𝗆𝖾𝗐𝗈𝗋𝗄 }$.
As we’ll see shortly, conditional generation unlocks the ability for language models to solve a variety of tasks by simply changing the prompt.
Summary
- A language model is a probability distribution $p$ over sequences
$x_{1:L}$ . - Intuitively, a good language model should have linguistic capabilities and world knowledge.
- An autoregressive language model allows for efficient generation of a completion $x_{i+1:L}$ given a prompt $x_{1:i}$.
- The temperature can be used to control the amount of variability in generation.
[!links] ↗ Pre-Training ↗ Post-Training & Fine Tuning ↗ LLM Utilization & Prompt, Context, and Harness Engineering
🔗 https://stanford-cs324.github.io/winter2022/lectures/capabilities/
Recall that a language model p is a distribution over sequences of tokens x1:L and thus can be used to score sequences:
==A task is a mapping from inputs to outputs.== For example, for question answering, we might have:
Input: What school did burne hogarth establish?
Output: School of Visual Arts
We use the term adaptation to refer to the process of taking a language model and turning it into a task model, given:
- a natural language description of the task, and
- a set of training instances (input-output pairs).
There are two primary ways to perform adaptation:
- Training (standard supervised learning): train a new model that maps inputs to outputs, either by
- creating a new model that uses the language model as features (probing), or
- starting with the language model and updating it based on the training instances (fine-tuning), or
- something in between (lightweight fine-tuning).
- Prompting (in-context learning): Construct a prompt (a string based on the description and training instances) or a set of prompts, feed those into a language model to obtain completions.
- Zero-shot learning: number of training examples is 0
- One-shot learning: number of training examples is 1
- Few-shot learning: number of training examples is few
Which adaptation procedure should we go with?
- Training can be challenging due to overfitting (just imagine fine-tuning a 175 billion parameter model based on 5 examples). How to do this effectively will be the topic of the adaptation lecture.
- For now, we will be content with adaptation of GPT-3 using prompting. Note that the limitation of prompting is that we can only leverage a only small number of training instances (as many as can fit into a prompt). This is due to a limitation of Transformers, where the prompt and the completion must fit into 2048 tokens.
The GPT-3 paper evaluated GPT-3 on a large set of tasks. We will consider a subset of these, and for each task, discuss the following:
- Definition: What is the task and its motivation?
- Adaptation: How do we reduce the task to language modeling (via prompting)?
- Results: What are the quantitative numbers compared to task-specific state-of-the-art models?
Size and number of examples matters. By default, the results will based on
- the full GPT-3 model (davinci), which has 175 billion parameters
- using in-context learning with as many training instances as you can stuff into the prompt.
Along the way, we will do ablations to see if model size and number of in-context training instances matters. Spoiler: it does and more is better.
The tasks are grouped as follows:
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2025). A Survey of Large Language Models (arXiv:2303.18223). arXiv.
https://doi.org/10.48550/arXiv.2303.18223
🔗 https://stanford-cs324.github.io/winter2022/lectures/introduction/#a-brief-history
Information theory. Language models date back to Claude Shannon, who founded information theory in 1948 with his seminal paper, A Mathematical Theory of Communication. In this paper, he introduced the entropy of a distribution as
- The lower the entropy, the more “structured” the sequence is, and the shorter the code length.
- Intuitively, $log\frac{1}{p(x)}$ is the length of the code used to represent an element $x$ that occurs with probability $p(x)$.
- If $p(x)=\frac{1}{8}$, we should allocate $log_2(8)=3$ bits (equivalently, $log(8)=2.08 \ nats$).
Aside: actually achieving the Shannon limit is non-trivial (e.g., LDPC codes) and is the topic of coding theory.
Entropy of English. Shannon was particularly interested in measuring the entropy of English, represented as a sequence of letters. This means we imagine that there is a “true” distribution $p$ out there (the existence of this is questionable, but it’s still a useful mathematical abstraction) that can spout out samples of English text $x∼p$.
Shannon also defined cross entropy: nats) needed to encode a sample $x∼p$ using the compression scheme given by the model $q$ (representing x with a code of length $\frac{1}{q(x)}$).
Estimating entropy via language modeling. A crucial property is that the cross entropy $H(p,q)$ upper bounds the entropy $H(p)$,
So we can get better estimates of the entropy $H(p)$ by constructing better models $q$, as measured by $H(p,q)$.
Shannon game (human language model). Shannon first used n-gram models as $q$ in 1948, but in his 1951 paper Prediction and Entropy of Printed English, he introduced a clever scheme (known as the Shannon game) where $q$ was provided by a human:
Language models became first used in practical applications that required generation of text:
- speech recognition in the 1970s (input: acoustic signal, output: text), and
- machine translation in the 1990s (input: text in a source language, output: text in a target language).
Noisy channel model. The dominant paradigm for solving these tasks then was the noisy channel model. Taking speech recognition as an example:
- We posit that there is some text sampled from some distribution $p$.
- This text becomes realized to speech (acoustic signals).
- Then given the speech, we wish to recover the (most likely) text. This can be done via Bayes rule: $$p(text∣speech) ∝ \underbrace{p(text)}{\text{Language Model}} \underbrace{p(speech ∣ text)}{\text{Acoustic Model}}$$ Speech recognition and machine translation systems used n-gram language models over words (first introduced by Shannon, but for characters).
N-gram models. In an n-gram model, the prediction of a token xi only depends on the last n−1 characters xi−(n−1):i−1 rather than the full history:
Fitting n-gram models to data is extremely computationally cheap and scalable. As a result, n-gram models were trained on massive amount of text. For example, Brants et al. (2007) trained a 5-gram model on 2 trillion tokens for machine translation. In comparison, GPT-3 was trained on only 300 billion tokens. However, an n-gram model was fundamentally limited. Imagine the prefix:
[!links] ↗ Artificial Neural Networks (ANN) & Deep Learning Methods ↗ Neural Network Models
🔗 https://stanford-cs324.github.io/winter2022/lectures/introduction/#neural-language-models
An important step forward for language models was the introduction of neural networks. Bengio et al., 2003 pioneered neural language models, where
Now, the main challenge was that training neural networks was much more computationally expensive. They trained a model on only 14 million words and showed that it outperformed n-gram models trained on the same amount of data. But since n-gram models were more scalable and data was not a bottleneck, n-gram models continued to dominate for at least another decade.
Since 2003, two other key developments in neural language modeling include:
- Recurrent Neural Networks (RNNs), including Long Short Term Memory (LSTMs), allowed the conditional distribution of a token $x_i$ to depend on the entire context $x_{1:i−1}$ (effectively $n=\infty$), but these were hard to train.
- Transformers are a more recent architecture (developed for machine translation in 2017) that again returned to having fixed context length n, but were much easier to train (and exploited the parallelism of GPUs). Also, n could be made “large enough” for many applications (GPT-3 used n=2048).
↗ LLM (Large Language Model) /The Technical Evolution of LLM & Future Directions ↗ Transformers