Skip to content

Latest commit

 

History

History
391 lines (309 loc) · 63.7 KB

File metadata and controls

391 lines (309 loc) · 63.7 KB

Natural Language Processing (NLP) & Computational Linguistics

[TOC]

Res

Related Topics

LinguisticsOrdinary Language PhilosophyPhilosophy of Language

Formal System, Formal Logics, and Its SemanticsAutomata Theory and (Formal) Language TheoryFormal Syntax & Metasyntax (and Metalanguage)

Algebraic Structure & Abstract Algebra & Modern Algebra

Computer Languages & Programming MethodologyProgramming Language Processing & Program Execution

LLM (Large Language Model)Knowledge Graph (KG)

Learning Resources

[!links] ↗ LLM (Large Language Model)

Daniel Jurafsky and James H. Martin. 2025. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Online manuscript released August 24, 2025. https://web.stanford.edu/~jurafsky/slp3.

Volume I: Large Language Models
Chapter Slides
1: Introduction
2: Words and Tokens 2: Words and Tokens [pptx] [pdf] 2: Edit Distance [pptx] [pdf]
3: N-gram Language Models 3: [pptx] [pdf]
4: Logistic Regression and Text Classification 4: [pptx] [pdf]
5: Embeddings 5: [pptx] [pdf]
6: Neural Networks 6: [pptx] [pdf]
7: Large Language Models 7: [pptx] [pdf]
8: Transformers 8: [pptx] [pdf]
9: Post-training: Instruction Tuning, Alignment, and Test-Time Compute
10: Masked Language Models 10: [pptx] [pdf]
11: Information Retrieval and Retrieval-Augmented Generation
12: Machine Translation
13: RNNs and LSTMs 13: [pptx] [pdf]
14: Phonetics and Speech Feature Extraction
15: Automatic Speech Recognition
16: Text-to-Speech
Volume II: Annotating Linguistic Structure
Chapter Slides
17: Sequence Labeling for Parts of Speech and Named Entities 17: (Intro only) [pptx] [pdf]
18: Context-Free Grammars and Constituency Parsing
19: Dependency Parsing
20: Information Extraction: Relations, Events, and Time
21: Semantic Role Labeling and Argument Structure
22: Lexicons for Sentiment, Affect, and Connotation
23: Coreference Resolution and Entity Linking
24: Discourse Coherence
25: Conversation and its Structure
Appendix (will be just on the web)
A: Hidden Markov Models
B: Naive Bayes Classification B: [pptx] [pdf]
C: Kneser-Ney Smoothing
D: Spelling Correction and the Noisy Channel
E: Statistical Constituency Parsing
F: Context-Free Grammars
G: Combinatory Categorial Grammar
H: Logical Representations of Sentence Meaning
I: Word Senses and WordNet
J: PPMI
K: Frame-based Dialogue Systems

🏫 CS224N Natural Language Processing with Deep Learning 🔗 https://web.stanford.edu/class/cs224n/ The following texts are useful, but none are required. All of them can be read free online.

CMU 11-711: Advanced Natural Language Processing (ANLP)

  • https://www.phontron.com/class/anlp-fall2024/
  • 该课程为研究生级别的 NLP 入门与进阶课程,覆盖从词表征、序列建模,到注意力机制、Transformer 架构,再到大规模语言模型预训练、指令微调与复杂推理、多模态和安全性等前沿主题。与其他同类课程相比,本课程:
    • 内容全面且紧跟最新研究:除经典算法外,深入讲解近年热门的大模型方法(如 LLaMa、GPT-4 等)。
    • 实践性强:每次课配套代码演示与在线小测,学期末项目需复现并改进一篇前沿论文。
    • 互动良好:Piazza 讨论、Canvas 测验及现场答疑,学习体验沉浸而有节奏。
  • 自学建议:
    • 提前阅读课前推荐文献,跟着阅读顺序循序渐进。
    • 准备好 Python 环境并熟悉 PyTorch/Hugging Face,因为大量实战代码示例基于此。
    • 扎实完成课程作业。

Other Resources

Intro

Language and Language Processing

Philosophy of Language

Logic (and Critical Thinking)Mathematics

Formal Semantics and Programming LanguageFormal Syntax & Metasyntax (and Metalanguage)

Computer Languages & Programming Methodology

NLP Background & Classical Approaches

Knowledge in Speech and Language Processing

[!links] ↗ Learning English the Right Way

🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009

To summarize, engaging in complex language behavior requires various kinds of knowledge of language:

  • Phonetics and Phonology — knowledge about linguistic sounds
  • Morphology — knowledge of the meaningful components of words
  • Syntax — knowledge of the structural relationships between words
  • Semantics — knowledge of meaning
  • Pragmatics — knowledge of the relationship of meaning to the goals and intentions of the speaker
  • Discourse — knowledge about linguistic units larger than a single utterance

Ambiguity (?🤔)

🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009

A perhaps surprising fact about these categories of linguistic knowledge is that most tasks in speech and language processing can be viewed as resolving ambiguity at one of these levels. We say some input is ambiguous if multiple, alternative linguistic structures can be built for it. Consider the spoken sentence I made her duck. Here are five different meanings this sentence could have (see if you can think of some more), each of which exemplifies an ambiguity at some level:

  • (1.5) I cooked waterfowl for her.
  • (1.6) I cooked waterfowl belonging to her.
  • (1.7) I created the (plaster?) duck she owns.
  • (1.8) I caused her to quickly lower her head or body.
  • (1.9) I waved my magic wand and turned her into undifferentiated waterfowl.

We often introduce the models and algorithms we present throughout the book as ways to resolve or disambiguate these ambiguities. For example, deciding whether duck is a verb or a noun can be solved by part-of-speech tagging. Deciding whether make means “create” or “cook” can be solved by word sense disambiguation. Resolution of part-of speech and word sense ambiguities are two important kinds of lexical disambiguation. A wide variety of tasks can be framed as lexical disambiguation problems. For example, a text-to-speech synthesis system reading the word lead needs to decide whether it should be pronounced as in lead pipe or as in lead me on. By contrast, deciding whether her and duck are part of the same entity (as in (1.5) or (1.8)) or are different entities (as in (1.6)) is an example of syntactic disambiguation and can be addressed by probabilistic parsing. We also consider ambiguities that don’t arise in this particular example, such as determining whether a sentence is a statement or a question (which can be resolved by speech act interpretation).

Models and Algorithms

[!links] ↗ Mathematical Modeling & AbstractionModels of Computation & Abstract Machines(Formal) Model Checking

Automata Theory and (Formal) Language TheoryClassical Logic (Standard Formal Logic)Probabilistic Models (Distributions) & Stochastic Process

See below "📜 A Brief History of The Technical Evolution Of Language Models"

🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009

One of the key insights of the last 50 years of research in language processing is that the various kinds of knowledge described in the last sections can be captured through the use of a small number of formal models or theories. Fortunately, these models and theories are all drawn from the standard toolkits of computer science, mathematics, and linguistics and should be generally familiar to those trained in those fields. Among the most important models are state machines, rule systems, logic, probabilistic models, and vector-space models. These models, in turn, lend themselves to a small number of algorithms, among the most important of which are state space search algorithms, such as dynamic programming, and machine learning algorithms, such as classifiers and Expectation Maximization (EM) and other learning algorithms.

  • In their simplest formulation, ==state machines== are formal models that consist of states, transitions among states, and an input representation. Some of the variations of this basic model that we will consider are deterministic and non-deterministic finite-state automata and finite-state transducers.
  • Closely related to these models are their declarative counterparts: ==formal rule systems==. Among the more important ones we consider (in both probabilistic and non-probabilistic formulations) are regular grammars and regular relations, context-free grammars, and feature-augmented grammars. State machines and formal rule systems are the main tools used when dealing with knowledge of phonology, morphology, and syntax.
  • A third class of models that plays a critical role in capturing knowledge of language are models based on ==logic==. We discuss first-order logic, also known as the predicate calculus, as well as such related formalisms as lambda-calculus, feature structures, and semantic primitives. These logical representations have traditionally been used for modeling semantics and pragmatics, although more recent work has tended to focus on potentially more robust techniques drawn from non-logical lexical semantics.
  • ==Probabilistic models== are crucial for capturing every kind of linguistic knowledge. Each of the other models (state machines, formal rule systems, and logic) can be augmented with probabilities. For example, the state machine can be augmented with probabilities to become the weighted automaton, or Markov model. We spend a significant amount of time on hidden Markov models or HMMs, which are used everywhere in the field, in part-of speech tagging, speech recognition, dialogue understanding, text-to-speech, and machine translation. The key advantage of probabilistic models is their ability to solve the many kinds of ambiguity problems that we discussed earlier; almost any speech and language processing problem can be recast as “given N choices for some ambiguous input, choose the most probable one”.
  • Finally, ==vector-space models==, based on linear algebra, underlie information retrieval and many treatments of word meanings. (Neural Networks & Language Models)

Processing language with any of these models typically involves a search through a space of states representing hypotheses about an input.

  • In speech recognition, we search through a space of phone sequences for the correct word. In parsing, we search through a space of trees for the syntactic parse of an input sentence.
  • In machine translation, we search through a space of translation hypotheses for the correct translation of a sentence into another language.
  • For non-probabilistic tasks, such as tasks involving state machines, we use well-known graph algorithms such as depth-first search.
  • For probabilistic tasks, we use heuristic variants such as best-first and A* search and rely on dynamic programming algorithms for computational tractability.
  • Machine learning tools such as classifiers and sequence models play a significant role in many language processing tasks. Based on attributes describing each object, a classifier attempts to assign a single object to a single class while a sequence model attempts to jointly classify a sequence of objects into a sequence of classes.

For example, in the task of deciding whether a word is spelled correctly, classifiers such as decision trees, support vector machines, Gaussian mixture models, and logistic regression could be used to make a binary decision (correct or incorrect) for one word at a time. Sequence models such as hidden Markov models, maximum entropy Markov models, and conditional random fields could be used to assign correct/incorrect labels to all the words in a sentence at once.

Finally, researchers in language processing use many of the same methodological tools that are used in machine learning research—the use of distinct training and test sets, statistical techniques like cross-validation, and careful evaluation of trained systems.

Language, Thought, and Understanding

🔗 https://home.cs.colorado.edu/~martin/slp.html Speech and Language Processing, 2nd ed, 2009

Intelligence and Turing Test To many, the ability of computers to process language as skillfully as we humans do will signal the arrival of truly intelligent machines. The basis of this belief is the fact that the effective use of language is intertwined with our general cognitive abilities. Among the first to consider the computational implications of this intimate connection was Alan Turing (1950). In this famous paper, Turing introduced what has come to be known as the Turing test. Turing began with the thesis that the question of what itTuring test would mean for a machine to think was essentially unanswerable because of the inherent imprecision in the terms machine and think. Instead, he suggested an empirical test, a game, in which a computer’s use of language would form the basis for determining if the machine could think. If the machine could win the game, it would be judged intelligent.

In Turing’s game, there are three participants: two people and a computer. One of the people is a contestant who plays the role of an interrogator. To win, the interrogator must determine which of the other two participants is the machine by asking a series of questions via a teletype. The task of the machine is to fool the interrogator into believing it is a person by responding as a person would to the interrogator’s questions. The task of the second human participant is to convince the interrogator that the other participant is the machine and that she is human.

Language Model (LM)

🔗 What is a language model?

==The classic definition of a language model (LM) is a probability distribution over sequences of tokens.== Suppose we have a vocabulary $\Box$ of a set of tokens. A language model $p$ assigns each sequence of tokens $x_1,…,x_L\in\Box$ a probability (a number between 0 and 1): $$p(x_1,…,x_L).$$ The probability intuitively tells us how “good” a sequence of tokens is. For example, if the vocabulary is $\Box={ate, ball, cheese, mouse, the}$, the language model might assign (demo): $$\begin{align} p(𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾) &= 0.02, \ p(𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾) & =0.01, \ p(𝗆𝗈𝗎𝗌𝖾,𝗍𝗁𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾,𝖺𝗍𝖾) &= 0.0001 \end{align}$$ Mathematically, a language model is a very simple and beautiful object. But the simplicity is deceiving: ==the ability to assign (meaningful) probabilities to all sequences requires extraordinary (but implicit) linguistic abilities and world knowledge.==

For example, the LM should assign $(𝗆𝗈𝗎𝗌𝖾, 𝗍𝗁𝖾, 𝗍𝗁𝖾, 𝖼𝗁𝖾𝖾𝗌𝖾, 𝖺𝗍𝖾)$ a very low probability implicitly because it’s ungrammatical (syntactic knowledge). The LM should assign $(𝗍𝗁𝖾, 𝗆𝗈𝗎𝗌𝖾, 𝖺𝗍𝖾, 𝗍𝗁𝖾, 𝖼𝗁𝖾𝖾𝗌𝖾)$ higher probability than $(𝗍𝗁𝖾, 𝖼𝗁𝖾𝖾𝗌𝖾, 𝖺𝗍𝖾, 𝗍𝗁𝖾, 𝗆𝗈𝗎𝗌𝖾)$ implicitly because of world knowledge: both sentences are the same syntactically, but they differ in semantic plausibility.

Generation As defined, a language model $p$ takes a sequence and returns a probability to assess its goodness. We can also generate a sequence given a language model. The purest way to do this is to sample a sequence $x_{1:L}$ from the language model $p$ with probability equal to $p(x_{1:L})$, denoted: $$x_{1:L}∼p.$$ How to do this computationally efficiently depends on the form of the language model $p$. In practice, we do not generally sample directly from a language model both because of limitations of real language models and because we sometimes wish to obtain not an “average” sequence but something closer to the “best” sequence.

Autoregressive language models A common way to write the joint distribution $p(x_{1:L})$ of a sequence $x_{1:L}$ is using the chain rule of probability: $$p(x_{1:L})=p(x_1)p(x_2∣x_1)p(x_3∣x_1,x_2)⋯p(x_L∣x_{1:L−1})=\prod_{i=1}^{L}p(x_i∣x_{1:i−1}).$$ For example (demo): $$ \begin{align} p(𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾)= \ & p(𝗍𝗁𝖾) \ & p(𝗆𝗈𝗎𝗌𝖾∣𝗍𝗁𝖾) \ & p(𝖺𝗍𝖾∣𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾) \ & p(𝗍𝗁𝖾∣𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾) \ & p(𝖼𝗁𝖾𝖾𝗌𝖾∣𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾) \end{align}$$ In particular, $p(x_i∣x_{1:i−1})$ is a conditional probability distribution of the next token $x_i$ given the previous tokens $x_{1:i−1}$.

Of course, any joint probability distribution can be written this way mathematically, but an autoregressive language model is one where each conditional distribution$p(x_i∣x_{1:i−1})$ can be computed efficiently (e.g., using a feedforward neural network).

Generation. Now to generate an entire sequence $x_{1:L}$ from an autoregressive language model $p$, we sample one token at a time given the tokens generated so far: $$\begin{align} for \ i=1,…,L: \ & x_{i}∼p(x_i∣x_{1:i−1})^{1/T},\end{align}$$ where $T \geq 0$ is a temperature parameter that controls how much randomness we want from the language model:

  • $T=0$: deterministically choose the most probable token $x_i$ at each position $i$
  • $T=1$: sample “normally” from the pure language model
  • $T=\infty$: sample from a uniform distribution over the entire vocabulary $\Box$

However, if we just raise the probabilities to the power $1/T$, the probability distribution may not sum to 1. We can fix this by re-normalizing the distribution. We call the normalized version $p_T(x_i∣x_{1:i−1})∝p(x_i∣x_{1:i−1})^{1/T}$ the annealed conditional probability distribution. For example: $$\begin{align} & p(𝖼𝗁𝖾𝖾𝗌𝖾)=0.4, & p(𝗆𝗈𝗎𝗌𝖾)=0.6 \ & p_{T=0.5}(𝖼𝗁𝖾𝖾𝗌𝖾)=0.31, & p_{T=0.5}(𝗆𝗈𝗎𝗌𝖾)=0.69 \ & p_{T=0.2}(𝖼𝗁𝖾𝖾𝗌𝖾)=0.12, & p_{T=0.2}(𝗆𝗈𝗎𝗌𝖾)=0.88 \ & p_{T=0}(𝖼𝗁𝖾𝖾𝗌𝖾)=0, & p_{T=0}(𝗆𝗈𝗎𝗌𝖾)=1 \end{align}$$

Aside: Annealing is a reference to metallurgy, where hot materials are cooled gradually, and shows up in sampling and optimization algorithms such as simulated annealing.

Technical note: sampling iteratively with a temperature $T$ parameter applied to each conditional distribution $p(x_i∣x_{1:i−1})^{1/T}$ is not equivalent (except when $T=1$) to sampling from the annealed distribution over length $L$ sequences.

Conditional generation. More generally, we can perform conditional generation by specifying some prefix sequence $x_{1:i}$ (called a prompt) and sampling the rest $x_{i+1:L}$ (called the completion). For example, generating with $T=0$ produces (demo): $$\underbrace{\text{the, mouse, ate}}{\text{prompt}} ;;\xrightarrow[T=0]{} \underbrace{\text{the, cheese}}{\text{completion}}$$ If we change the temperature to $T=1$, we can get more variety (demo), for example,${𝗂𝗍𝗌, 𝗁𝗈𝗎𝗌𝖾, and, 𝗆𝗒, 𝗁𝗈𝗆𝖾𝗐𝗈𝗋𝗄 }$.

As we’ll see shortly, conditional generation unlocks the ability for language models to solve a variety of tasks by simply changing the prompt.

Summary

  • A language model is a probability distribution $p$ over sequences $x_{1:L}$.
  • Intuitively, a good language model should have linguistic capabilities and world knowledge.
  • An autoregressive language model allows for efficient generation of a completion $x_{i+1:L}$ given a prompt $x_{1:i}$.
  • The temperature can be used to control the amount of variability in generation.

From Language Model to (General) Task Model

[!links] ↗ Pre-TrainingPost-Training & Fine TuningLLM Utilization & Prompt, Context, and Harness Engineering

🔗 https://stanford-cs324.github.io/winter2022/lectures/capabilities/

Recall that a language model p is a distribution over sequences of tokens x1:L and thus can be used to score sequences: $$p(𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾).$$ It can also be used to perform conditional generation of a completion given a prompt: $$𝗍𝗁𝖾 \ 𝗆𝗈𝗎𝗌𝖾 \ 𝖺𝗍𝖾 ⇝ 𝗍𝗁𝖾 \ 𝖼𝗁𝖾𝖾𝗌𝖾.$$

==A task is a mapping from inputs to outputs.== For example, for question answering, we might have:

Input: What school did burne hogarth establish?
Output: School of Visual Arts

We use the term adaptation to refer to the process of taking a language model and turning it into a task model, given:

  • a natural language description of the task, and
  • a set of training instances (input-output pairs).

There are two primary ways to perform adaptation:

  1. Training (standard supervised learning): train a new model that maps inputs to outputs, either by
    1. creating a new model that uses the language model as features (probing), or
    2. starting with the language model and updating it based on the training instances (fine-tuning), or
    3. something in between (lightweight fine-tuning).
  2. Prompting (in-context learning): Construct a prompt (a string based on the description and training instances) or a set of prompts, feed those into a language model to obtain completions.
    1. Zero-shot learning: number of training examples is 0
    2. One-shot learning: number of training examples is 1
    3. Few-shot learning: number of training examples is few

Which adaptation procedure should we go with?

  • Training can be challenging due to overfitting (just imagine fine-tuning a 175 billion parameter model based on 5 examples). How to do this effectively will be the topic of the adaptation lecture.
  • For now, we will be content with adaptation of GPT-3 using prompting. Note that the limitation of prompting is that we can only leverage a only small number of training instances (as many as can fit into a prompt). This is due to a limitation of Transformers, where the prompt and the completion must fit into 2048 tokens.

The GPT-3 paper evaluated GPT-3 on a large set of tasks. We will consider a subset of these, and for each task, discuss the following:

  1. Definition: What is the task and its motivation?
  2. Adaptation: How do we reduce the task to language modeling (via prompting)?
  3. Results: What are the quantitative numbers compared to task-specific state-of-the-art models?

Size and number of examples matters. By default, the results will based on

  • the full GPT-3 model (davinci), which has 175 billion parameters
  • using in-context learning with as many training instances as you can stuff into the prompt.

Along the way, we will do ablations to see if model size and number of in-context training instances matters. Spoiler: it does and more is better.

The tasks are grouped as follows:

  1. Language modeling
  2. Question answering
  3. Translation
  4. Arithmetic
  5. News article generation
  6. Novel tasks

📜 A Brief History of The Technical Evolution Of Language Models

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2025). A Survey of Large Language Models (arXiv:2303.18223). arXiv.
https://doi.org/10.48550/arXiv.2303.18223

Information Theory, Entropy of English, N-Gram Models

Information Theory

🔗 https://stanford-cs324.github.io/winter2022/lectures/introduction/#a-brief-history

Information theory. Language models date back to Claude Shannon, who founded information theory in 1948 with his seminal paper, A Mathematical Theory of Communication. In this paper, he introduced the entropy of a distribution as $$H(p)=\Sigma_{x}p(x)log\frac{1}{p(x)}.$$ The entropy measures the expected number of bits any algorithm needs to encode (compress) a sample $x∼p$ into a bitstring: $$\text{𝗍𝗁𝖾 𝗆𝗈𝗎𝗌𝖾 𝖺𝗍𝖾 𝗍𝗁𝖾 𝖼𝗁𝖾𝖾𝗌𝖾} \implies 0001110101.$$

  • The lower the entropy, the more “structured” the sequence is, and the shorter the code length.
  • Intuitively, $log\frac{1}{p(x)}$ is the length of the code used to represent an element $x$ that occurs with probability $p(x)$.
  • If $p(x)=\frac{1}{8}$, we should allocate $log_2(8)=3$ bits (equivalently, $log(8)=2.08 \ nats$).

Aside: actually achieving the Shannon limit is non-trivial (e.g., LDPC codes) and is the topic of coding theory.

Entropy of English. Shannon was particularly interested in measuring the entropy of English, represented as a sequence of letters. This means we imagine that there is a “true” distribution $p$ out there (the existence of this is questionable, but it’s still a useful mathematical abstraction) that can spout out samples of English text $x∼p$.

Shannon also defined cross entropy: $$H(p,q)=\Sigma_xp(x)log \frac{1}{q(x)},$$ which measures the expected number of bits (nats) needed to encode a sample $x∼p$ using the compression scheme given by the model $q$ (representing x with a code of length $\frac{1}{q(x)}$).

Estimating entropy via language modeling. A crucial property is that the cross entropy $H(p,q)$ upper bounds the entropy $H(p)$, $$H(p,q) \geq H(p),$$ which means that we can estimate $H(p,q)$ by constructing a (language) model $q$ with only samples from the true data distribution $p$, whereas $H(p)$ is generally inaccessible if $p$ is English.

So we can get better estimates of the entropy $H(p)$ by constructing better models $q$, as measured by $H(p,q)$.

Shannon game (human language model). Shannon first used n-gram models as $q$ in 1948, but in his 1951 paper Prediction and Entropy of Printed English, he introduced a clever scheme (known as the Shannon game) where $q$ was provided by a human: $$\text{𝗍𝗁𝖾 𝗆𝗈𝗎𝗌𝖾 𝖺𝗍𝖾 𝗆𝗒 𝗁𝗈_}$$ Humans aren’t good at providing calibrated probabilities of arbitrary text, so in the Shannon game, the human language model would repeatedly try to guess the next letter, and one would record the number of guesses.

N-gram models for downstream applications

Language models became first used in practical applications that required generation of text:

  • speech recognition in the 1970s (input: acoustic signal, output: text), and
  • machine translation in the 1990s (input: text in a source language, output: text in a target language).

Noisy channel model. The dominant paradigm for solving these tasks then was the noisy channel model. Taking speech recognition as an example:

  • We posit that there is some text sampled from some distribution $p$.
  • This text becomes realized to speech (acoustic signals).
  • Then given the speech, we wish to recover the (most likely) text. This can be done via Bayes rule: $$p(text∣speech) ∝ \underbrace{p(text)}{\text{Language Model}} \underbrace{p(speech ∣ text)}{\text{Acoustic Model}}$$ Speech recognition and machine translation systems used n-gram language models over words (first introduced by Shannon, but for characters).

N-gram models. In an n-gram model, the prediction of a token xi only depends on the last n−1 characters xi−(n−1):i−1 rather than the full history: $$p(x_i∣x_{1:i−1})=p(x_i∣x_{i−(n−1):i−1)}.$$ For example, a trigram (n=3) model would define: $$p(𝖼𝗁𝖾𝖾𝗌𝖾∣𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾)=p(𝖼𝗁𝖾𝖾𝗌𝖾∣𝖺𝗍𝖾,𝗍𝗁𝖾).$$ These probabilities are computed based on the number of times various n-grams (e.g., $\text{𝖺𝗍𝖾 𝗍𝗁𝖾 𝗆𝗈𝗎𝗌𝖾}$ and $\text{𝖺𝗍𝖾 𝗍𝗁𝖾 𝖼𝗁𝖾𝖾𝗌𝖾}$) occur in a large corpus of text, and appropriately smoothed to avoid overfitting (e.g., Kneser-Ney smoothing).

Fitting n-gram models to data is extremely computationally cheap and scalable. As a result, n-gram models were trained on massive amount of text. For example, Brants et al. (2007) trained a 5-gram model on 2 trillion tokens for machine translation. In comparison, GPT-3 was trained on only 300 billion tokens. However, an n-gram model was fundamentally limited. Imagine the prefix: $$\text{𝖲𝗍𝖺𝗇𝖿𝗈𝗋𝖽 𝗁𝖺𝗌 𝖺 𝗇𝖾𝗐 𝖼𝗈𝗎𝗋𝗌𝖾 𝗈𝗇 𝗅𝖺𝗋𝗀𝖾 𝗅𝖺𝗇𝗀𝗎𝖺𝗀𝖾 𝗆𝗈𝖽𝖾𝗅𝗌. 𝖨𝗍 𝗐𝗂𝗅𝗅 𝖻𝖾 𝗍𝖺𝗎𝗀𝗁𝗍 𝖻𝗒 ___}$$ If $n$ is too small, then the model will be incapable of capturing long-range dependencies, and the next word will not be able to depend on 𝖲𝗍𝖺𝗇𝖿𝗈𝗋𝖽. However, if n is too big, it will be statistically infeasible to get good estimates of the probabilities (almost all reasonable long sequences show up 0 times even in “huge” corpora): $$count(𝖲𝗍𝖺𝗇𝖿𝗈𝗋𝖽,𝗁𝖺𝗌,𝖺,𝗇𝖾𝗐,𝖼𝗈𝗎𝗋𝗌𝖾,𝗈𝗇,𝗅𝖺𝗋𝗀𝖾,𝗅𝖺𝗇𝗀𝗎𝖺𝗀𝖾,𝗆𝗈𝖽𝖾𝗅𝗌)=0.$$ As a result, language models were limited to tasks such as speech recognition and machine translation where the acoustic signal or source text provided enough information that only capturing local dependencies (and not being able to capture long-range dependencies) wasn’t a huge problem.

Neural Networks

[!links] ↗ Artificial Neural Networks (ANN) & Deep Learning MethodsNeural Network Models

🔗 https://stanford-cs324.github.io/winter2022/lectures/introduction/#neural-language-models

An important step forward for language models was the introduction of neural networks. Bengio et al., 2003 pioneered neural language models, where $p(x_i∣x_{i−(n−1):i−1)}$ is given by a neural network: $$p(𝖼𝗁𝖾𝖾𝗌𝖾∣𝖺𝗍𝖾,𝗍𝗁𝖾)=some-neural-network(𝖺𝗍𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾).$$ Note that the context length is still bounded by $n$, but it is now statistically feasible to estimate neural language models for much larger values of $n$.

Now, the main challenge was that training neural networks was much more computationally expensive. They trained a model on only 14 million words and showed that it outperformed n-gram models trained on the same amount of data. But since n-gram models were more scalable and data was not a bottleneck, n-gram models continued to dominate for at least another decade.

Since 2003, two other key developments in neural language modeling include:

  • Recurrent Neural Networks (RNNs), including Long Short Term Memory (LSTMs), allowed the conditional distribution of a token $x_i$ to depend on the entire context $x_{1:i−1}$ (effectively $n=\infty$), but these were hard to train.
  • Transformers are a more recent architecture (developed for machine translation in 2017) that again returned to having fixed context length n, but were much easier to train (and exploited the parallelism of GPUs). Also, n could be made “large enough” for many applications (GPT-3 used n=2048).

Large Language Models ⭐

LLM (Large Language Model) /The Technical Evolution of LLM & Future DirectionsTransformers

Ref