Learning-Theory-Two-layer-Neural-Networks/exercise.tex at main · ArthurDanjou/Learning-Theory-Two-layer-Neural-Networks · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
\documentclass[11pt,a4paper]{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage[english]{babel}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{mathtools}

\usepackage[margin=1in]{geometry}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{microtype}
\usepackage{enumitem}
\usepackage{hyperref}
\usepackage{tocloft}
\setlength{\cftsecnumwidth}{4.5em}
\setlength{\cftsubsecnumwidth}{5.5em}
\setlength{\parindent}{0pt}
\setlength{\parskip}{0.5em}

\hypersetup{
    colorlinks=false,
    linkcolor=black,
    filecolor=magenta,
    urlcolor=black,
    citecolor=black,
}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}

\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}

\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}

\renewcommand{\thesection}{Part \Alph{section}:}
\renewcommand{\thesubsection}{Question \arabic{subsection}.}

\title{Learning Theory - Exam - M2ISF}
\author{Arthur DANJOU}
\date{12 April 2026}

\begin{document}

{
\setlength{\topmargin}{2cm}
\begin{titlepage}
    \begin{center}
        \vspace{1.5cm}

        { \sc {{\large Université Paris-Dauphine -- PSL } \\
        Department of MIDO}}

        \vspace{1.5cm}
        \centerline{\hbox to 17cm{\hrulefill}}
        \vspace{0.3cm}

        {\sc \bf \Large \uppercase{{Learning Theory: \\ [0.3cm]  Exam -- Two--Layer Neural Networks}}}
        \centerline{\hbox to 17cm{\hrulefill}}
        \vspace{1.5cm}

        {\sc \Large {Arthur Danjou} \\ [1.8cm]
        \large Course Supervisor: \\ [0.3cm]
        \large Katia MEZIANI}

        \vspace{2cm}
        \centerline{\includegraphics[height=20mm]{logo dauphine.jpg}}
        \vspace{0.5cm}

        \hbox to \textwidth{\hrulefill}
        \vspace{0.2cm}
        {\sc Master 2 ISF (Initial Track) \\[0.2cm] ACADEMIC YEAR 2025/2026}
    \end{center}
\end{titlepage}
}

\newpage

\begin{abstract}
This document examines the mathematical foundations of generalization bounds for two-layer neural networks with ReLU activation. It first derives a naive generalization bound based on \hyperref[def:rademacher]{\textit{empirical Rademacher complexity}} and highlights why this bound fails to explain the behavior of overparameterized models. To address this limitation, we establish a symmetrization inequality tailored to ReLU networks. Exploiting the \hyperref[def:relu]{\textit{positive homogeneity of the ReLU activation}}, we then introduce a scale-invariant complexity measure. This reparameterization yields a tighter, width-independent generalization bound and therefore provides a more faithful assessment of the network's true capacity.
\end{abstract}

\vspace{1cm}

\tableofcontents

\newpage

\section*{Introduction}
\addcontentsline{toc}{section}{Introduction}

This document presents a detailed resolution of the Learning Theory evaluation on two-layer neural networks. The central theme of this exercise is to demonstrate how scale-invariant parameterizations can yield width-independent generalization bounds, overcoming the limitations of naive weight-norm constraints in overparameterized regimes. The study is structured in three main parts. Part A explores a standard but suboptimal width-dependent bound using \hyperref[def:rademacher]{\textit{empirical Rademacher complexity}}. Part B establishes a specific symmetrization inequality for ReLU networks. Finally, Part C leverages the \hyperref[def:relu]{\textit{positive homogeneity of the ReLU activation}} to introduce a scale-invariant complexity measure, culminating in a tighter generalization bound that better reflects the true capacity of modern neural networks.

Throughout this document, we assume that for some constant $C > 0$,
the input vectors satisfy $\|X\|_2^2 \le C^2$ almost surely.

\section{A naive width-dependent bound}

\subsection{}

By definition, the \hyperref[def:rademacher]{\textit{empirical Rademacher complexity}} of the hypothesis class $\mathcal{H}$ on the sample $S_n$ is given by:
\begin{align*}
    \mathcal{R}_{S_n}(\mathcal{H}) &= \mathbb{E}_{\sigma} \left[ \sup_{f_\theta \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i f_\theta(X_i) \right] \\
    &= \frac{1}{n} \mathbb{E}_{\sigma} \left[ \sup_{\substack{\|w\|_2 \le B_w \\ \|u_j\|_2 \le B_u \,\forall j}} \sum_{i=1}^n \sigma_i \sum_{j=1}^m w_j \phi(\langle u_j, X_i \rangle) \right]
\end{align*}

By linearity, we can swap the sums to isolate the outer weights $w_j$:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}) = \frac{1}{n} \mathbb{E}_{\sigma} \left[ \sup_{\substack{\|w\|_2 \le B_w, \\ \|u_j\|_2 \le B_u \,\forall j}} \sum_{j=1}^m w_j \left( \sum_{i=1}^n \sigma_i \phi(\langle u_j, X_i \rangle) \right) \right]
\end{equation*}

Let $Z \in \mathbb{R}^m$ be the vector with components $Z_j = \sum_{i=1}^n \sigma_i \phi(\langle u_j, X_i \rangle)$. The inner product between $w$ and $Z$ can be bounded using the \hyperref[thm:cauchy]{\textit{Cauchy-Schwarz inequality}}:
\begin{equation*}
    \sum_{j=1}^m w_j Z_j \le \|w\|_2 \sqrt{ \sum_{j=1}^m Z_j^2 }
\end{equation*}

Taking the supremum over $w$ subject to the constraint $\|w\|_2 \le B_w$, the maximum is achieved when the vector $w$ is collinear with $Z$:
\begin{equation*}
    \sup_{\|w\|_2 \le B_w} \sum_{j=1}^m w_j Z_j \le B_w \sqrt{ \sum_{j=1}^m \left( \sum_{i=1}^n \sigma_i \phi(\langle u_j, X_i \rangle) \right)^2 }
\end{equation*}

Substituting this back into the expression for the Rademacher complexity, we obtain:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}) \le \frac{B_w}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|u_j\|_2 \le B_u \,\forall j} \sqrt{ \sum_{j=1}^m \left( \sum_{i=1}^n \sigma_i \phi(\langle u_j, X_i \rangle) \right)^2 } \right]
\end{equation*}

The original supremum is taken over the hidden-layer weight matrix $U \in \mathbb{R}^{m \times d}$. Because the constraint $\|u_j\|_2 \le B_u$ applies independently to each row $u_j$ of $U$, the feasible set for the matrix is the Cartesian product of the feasible sets for its individual rows. Therefore, maximizing the sum over the matrix $U$ is strictly equivalent to maximizing each term independently over the vectors $u_j \in \mathbb{R}^d$. The square root function being strictly increasing, we can move the supremum inside:
\begin{align*}
    \mathcal{R}_{S_n}(\mathcal{H}) &\le \frac{B_w}{n} \mathbb{E}_{\sigma} \left[ \sqrt{ \sum_{j=1}^m \sup_{\|u_j\|_2 \le B_u} \left( \sum_{i=1}^n \sigma_i \phi(\langle u_j, X_i \rangle) \right)^2 } \right] \\
    &= \frac{B_w \sqrt{m}}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|u\|_2 \le B_u} \left| \sum_{i=1}^n \sigma_i \phi(\langle u, X_i \rangle) \right| \right]
\end{align*}

We observe that for any real-valued functional $A$, we have $\sup_u |A(u)| \le \sup_u A(u) + \sup_u (-A(u))$. Taking the expectation over $\sigma$ and leveraging the fact that the Rademacher vector $\sigma$ and its negation $-\sigma$ are identically distributed, we obtain $\mathbb{E}_\sigma[\sup_u (-A(u))] = \mathbb{E}_\sigma[\sup_u A(u)]$. Summing these two identical expectations yields a factor of 2:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}) \le \frac{2 B_w \sqrt{m}}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|u\|_2 \le B_u} \sum_{i=1}^n \sigma_i \phi(\langle u, X_i \rangle) \right]
\end{equation*}

The ReLU activation function $\phi(z) = \max(0, z)$ is 1-Lipschitz and satisfies $\phi(0) = 0$. By \hyperref[lem:talagrand]{\textit{Talagrand's contraction lemma}}, we can remove the activation function without increasing the complexity beyond its Lipschitz constant:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}) \le 1 \times \frac{2 B_w \sqrt{m}}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|u\|_2 \le B_u} \sum_{i=1}^n \sigma_i \langle u, X_i \rangle \right]
\end{equation*}

By linearity of the inner product and applying \hyperref[thm:cauchy]{\textit{Cauchy-Schwarz inequality}} once more:
\begin{equation*}
    \mathbb{E}_{\sigma} \left[ \sup_{\|u\|_2 \le B_u} \sum_{i=1}^n \sigma_i \langle u, X_i \rangle \right] = \mathbb{E}_{\sigma} \left[ \sup_{\|u\|_2 \le B_u} \left\langle u, \sum_{i=1}^n \sigma_i X_i \right\rangle \right] = B_u \mathbb{E}_{\sigma} \left[ \left\| \sum_{i=1}^n \sigma_i X_i \right\|_2 \right]
\end{equation*}

To bound this expectation, we apply \hyperref[thm:jensen]{\textit{Jensen's inequality}} using the concavity of the square root function:
\begin{equation*}
    \mathbb{E}_{\sigma} \left[ \left\| \sum_{i=1}^n \sigma_i X_i \right\|_2 \right] \le \sqrt{ \mathbb{E}_{\sigma} \left[ \left\| \sum_{i=1}^n \sigma_i X_i \right\|_2^2 \right] }
\end{equation*}

Expanding the squared $L_2$ norm and using the fact that the Rademacher variables are independent, centered, and have unit variance ($\mathbb{E}[\sigma_i \sigma_k] = \delta_{ik}$):
\begin{equation*}
    \mathbb{E}_{\sigma} \left[ \left\| \sum_{i=1}^n \sigma_i X_i \right\|_2^2 \right] = \sum_{i=1}^n \sum_{k=1}^n \mathbb{E}_{\sigma}[\sigma_i \sigma_k] \langle X_i, X_k \rangle = \sum_{i=1}^n \|X_i\|_2^2
\end{equation*}

Given the assumption that $\|X_i\|_2^2 \le C^2$ almost surely, we obtain:
\begin{equation*}
    \sqrt{ \sum_{i=1}^n \|X_i\|_2^2 } \le \sqrt{n C^2} = C \sqrt{n}
\end{equation*}

Combining all the partial bounds together yields the final result:
\begin{align*}
    \mathcal{R}_{S_n}(\mathcal{H}) &\le \frac{2 B_w \sqrt{m}}{n} B_u C \sqrt{n} \\
    \mathcal{R}_{S_n}(\mathcal{H}) &\le 2 B_w B_u C \sqrt{\frac{m}{n}}
\end{align*}

This demonstrates that the inequality holds with the absolute constant $C_0 = 2$.\qed%

\subsection{}

The generalization bound derived in the previous question scales with the factor $\sqrt{m}$, where $m$ represents the width of the hidden layer. In the regime of modern deep learning, neural networks are heavily overparameterized, meaning that the number of parameters vastly exceeds the number of available training samples ($m \gg n$).

Under this overparameterized regime, as the width $m$ grows, the theoretical upper bound on the Rademacher complexity diverges and becomes vacuous. A bound that is arbitrarily large provides no meaningful mathematical guarantee regarding the generalization gap.

However, empirical evidence consistently demonstrates that highly overparameterized neural networks generalize exceptionally well to unseen data, contrary to what this $\sqrt{m}$ dependence suggests. This fundamental discrepancy highlights that the capacity of a neural network is not merely a function of its size or a naive parameter count. Therefore, the hypothesis class defined strictly by independent norm constraints $B_w$ and $B_u$ fails to capture the true inductive bias of the model. Explaining this generalization requires a more refined, width-independent complexity measure.

\section{A symmetrization inequality for ReLU networks}

To simplify the notation, let us denote the inner product for a fixed $\sigma$ and $\theta$ as
$Z(\sigma, \theta) = \langle \sigma, f_\theta(\mathcal{S}_n) \rangle$.
We aim to show that
$\mathbb{E}_\sigma \left[ \sup_\theta |Z(\sigma, \theta)| \right] \le 2\, \mathbb{E}_\sigma \left[ \sup_\theta Z(\sigma, \theta) \right]$.

Using the provided hint, the absolute value of any real number $z$ can be decomposed as:
\begin{equation*}
    |z| = \phi(z) + \phi(-z),
\end{equation*}
where $\phi(z) = \max(0,z)$ is the \hyperref[def:relu]{\textit{ReLU activation function}}.
Applying this identity to $Z(\sigma, \theta)$:
\begin{equation*}
    |Z(\sigma, \theta)| = \phi(Z(\sigma, \theta)) + \phi(-Z(\sigma, \theta)).
\end{equation*}

Taking the supremum over $\theta$ on both sides and using the sub-additivity of the
supremum ($\sup(A + B) \le \sup A + \sup B$):
\begin{equation*}
    \sup_\theta |Z(\sigma, \theta)| \le \sup_\theta \phi(Z(\sigma, \theta))
    + \sup_\theta \phi(-Z(\sigma, \theta)).
\end{equation*}

Taking the expectation with respect to $\sigma$:
\begin{equation*}
    \mathbb{E}_\sigma \left[ \sup_\theta |Z(\sigma, \theta)| \right]
    \le \mathbb{E}_\sigma \left[ \sup_\theta \phi(Z(\sigma, \theta)) \right]
    + \mathbb{E}_\sigma \left[ \sup_\theta \phi(-Z(\sigma, \theta)) \right].
\end{equation*}

Notice that $-Z(\sigma, \theta) = - \langle \sigma, f_\theta(\mathcal{S}_n) \rangle = \langle -\sigma, f_\theta(\mathcal{S}_n) \rangle = Z(-\sigma, \theta)$.
Since the Rademacher vector $\sigma$ and its negation $-\sigma$ are identically distributed,
the two expectations on the right-hand side are equal:
\begin{equation*}
    \mathbb{E}_\sigma \left[ \sup_\theta \phi(-Z(\sigma, \theta)) \right]
    = \mathbb{E}_\sigma \left[ \sup_\theta \phi(Z(-\sigma, \theta)) \right]
    = \mathbb{E}_\sigma \left[ \sup_\theta \phi(Z(\sigma, \theta)) \right].
\end{equation*}

Summing these two identical expectations gives:
\begin{equation*}
    \mathbb{E}_\sigma \left[ \sup_\theta |Z(\sigma, \theta)| \right]
    \le 2\, \mathbb{E}_\sigma \left[ \sup_\theta \phi(Z(\sigma, \theta)) \right].
\end{equation*}

It remains to show that $\mathbb{E}_\sigma\!\left[\sup_\theta \phi(Z(\sigma,\theta))\right]
\le \mathbb{E}_\sigma\!\left[\sup_\theta Z(\sigma,\theta)\right]$.
Since $\phi$ is non-decreasing, for any fixed $\theta$ we have
$Z(\sigma,\theta) \le \sup_{\theta'} Z(\sigma,\theta')$, and therefore:
\begin{equation*}
    \phi(Z(\sigma,\theta)) \le \phi\!\left(\sup_{\theta'} Z(\sigma,\theta')\right).
\end{equation*}

Taking the supremum over $\theta$ on the left-hand side:
\begin{equation*}
    \sup_\theta\, \phi(Z(\sigma, \theta)) \leq \phi\!\left(\sup_\theta Z(\sigma, \theta)\right).
\end{equation*}

By the assumption of the problem, $\sup_\theta Z(\sigma,\theta) \ge 0$ for all
$\sigma \in \{-1,+1\}^n$. On any non-negative argument, the ReLU acts as the
identity, so:
\begin{equation*}
    \phi\!\left(\sup_\theta Z(\sigma, \theta)\right) = \sup_\theta Z(\sigma, \theta).
\end{equation*}

Combining these two steps:
\begin{equation*}
    \sup_\theta\, \phi(Z(\sigma, \theta)) \leq \sup_\theta\, Z(\sigma, \theta).
\end{equation*}

Taking the expectation over $\sigma$ and substituting into our earlier bound, we conclude:
\begin{equation*}
    \mathbb{E}_\sigma \left[ \sup_\theta \left| \langle \sigma, f_\theta(\mathcal{S}_n) \rangle \right| \right]
    \le 2\, \mathbb{E}_\sigma \left[ \sup_\theta \langle \sigma, f_\theta(\mathcal{S}_n) \rangle \right].
\end{equation*}

This completes the proof of the symmetrization inequality.\qed%

\section{A scale-invariant complexity measure}

\subsection{}

Let $X \in \mathbb{R}^d$ be an arbitrary input vector. The output of the neural network parameterized by the new set of weights $\theta^{\prime} = \{(\lambda_j w_j, u_j / \lambda_j)\}_{j=1}^m$ is given by:
\begin{equation*}
    f_{\theta^{\prime}}(X) = \sum_{j=1}^{m} (\lambda_j w_j) \phi \left( \left\langle \frac{u_j}{\lambda_j}, X \right\rangle \right)
\end{equation*}

By the linearity of the inner product, we can extract the scalar $\frac{1}{\lambda_j}$:
\begin{equation*}
    \left\langle \frac{u_j}{\lambda_j}, X \right\rangle = \frac{1}{\lambda_j} \langle u_j, X \rangle
\end{equation*}

Since we are given that $\lambda_j > 0$ for all $j=1, \dots, m$, it follows that $\frac{1}{\lambda_j} > 0$. We can therefore apply the positive homogeneity property of the \hyperref[def:relu]{\textit{ReLU activation function}}, $\phi(\alpha z) = \alpha \phi(z)$ for any $\alpha > 0$:
\begin{equation*}
    \phi \left( \frac{1}{\lambda_j} \langle u_j, X \rangle \right) = \frac{1}{\lambda_j} \phi ( \langle u_j, X \rangle )
\end{equation*}

Substituting this back into the expression for the network output, we obtain:
\begin{align*}
    f_{\theta^{\prime}}(X) &= \sum_{j=1}^{m} \lambda_j w_j \left( \frac{1}{\lambda_j} \phi ( \langle u_j, X \rangle ) \right) \\
    &= \sum_{j=1}^{m} \left( \lambda_j \frac{1}{\lambda_j} \right) w_j \phi ( \langle u_j, X \rangle ) \\
    &= \sum_{j=1}^{m} w_j \phi ( \langle u_j, X \rangle )
\end{align*}

This final expression is exactly the definition of the network output under the original parameterization $\theta$. Thus, we have shown that $f_{\theta^{\prime}}(X) = f_{\theta}(X)$ for any input $X$, meaning the function computed by the network remains perfectly unchanged under this reparameterization.

\subsection{}

We evaluate the complexity measure $\mathcal{C}$ on the reparameterized weights $\theta^{\prime} = \{(\lambda_j w_j, u_j / \lambda_j)\}_{j=1}^m$. By definition, we have:
\begin{equation*}
    \mathcal{C}(\theta^{\prime}) = \sum_{j=1}^{m} |\lambda_j w_j| \left\| \frac{u_j}{\lambda_j} \right\|_2
\end{equation*}

Using the properties of the absolute value for the product of scalars and the property of the $L_2$ norm regarding scalar multiplication ($\|\alpha v\|_2 = |\alpha| \|v\|_2$ for any scalar $\alpha$ and vector $v$), we can rewrite the terms inside the sum as:
\begin{equation*}
    \mathcal{C}(\theta^{\prime}) = \sum_{j=1}^{m} |\lambda_j| |w_j| \frac{1}{|\lambda_j|} \| u_j \|_2
\end{equation*}

Since we are given that $\lambda_j > 0$ for all $j=1, \dots, m$, we have $|\lambda_j| = \lambda_j$. The expression simplifies to:
\begin{align*}
    \mathcal{C}(\theta^{\prime}) &= \sum_{j=1}^{m} \lambda_j |w_j| \frac{1}{\lambda_j} \| u_j \|_2 \\
    &= \sum_{j=1}^{m} \left( \lambda_j \frac{1}{\lambda_j} \right) |w_j| \| u_j \|_2 \\
    &= \sum_{j=1}^{m} |w_j| \| u_j \|_2
\end{align*}

This final expression is exactly the definition of the complexity measure for the original parameters $\theta$. We have thus demonstrated that $\mathcal{C}(\theta^{\prime}) = \mathcal{C}(\theta)$, proving that the complexity measure is strictly scale-invariant.

\subsection{}

We want to bound the empirical Rademacher complexity of the hypothesis class $\mathcal{H}'$. By definition of \hyperref[def:rademacher]{\textit{empirical Rademacher complexity}}, we have:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}') = \frac{1}{n} \mathbb{E}_{\sigma} \left[ \sup_{\mathcal{C}(\theta) \le B} \sum_{i=1}^n \sigma_i \sum_{j=1}^m w_j \phi(\langle u_j, X_i \rangle) \right]
\end{equation*}

Using the scale-invariance properties established in the previous questions, we can reparameterize the network without changing its output or its complexity measure. For $u_j \neq 0$, define $\alpha_j = w_j \|u_j\|_2 \in \mathbb{R}$ and $v_j = \frac{u_j}{\|u_j\|_2} \in \mathbb{R}^d$. If $u_j = 0$, set $\alpha_j = 0$ and $v_j = 0$. By the positive homogeneity of \hyperref[def:relu]{\textit{ReLU activation function}}, the network output can be rewritten as:
\begin{equation*}
    f_\theta(X_i) = \sum_{j=1}^m w_j \|u_j\|_2 \phi \left( \left\langle \frac{u_j}{\|u_j\|_2}, X_i \right\rangle \right) = \sum_{j=1}^m \alpha_j \phi(\langle v_j, X_i \rangle)
\end{equation*}

The complexity measure constraint becomes $\sum_{j=1}^m |\alpha_j| = \sum_{j=1}^m |w_j| \|u_j\|_2 = \mathcal{C}(\theta) \le B$, and by definition, the new weight vectors satisfy $\|v_j\|_2 \le 1$. Substituting this into the Rademacher complexity and swapping the sums yields:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}') = \frac{1}{n} \mathbb{E}_{\sigma} \left[ \sup_{\substack{\sum_{j=1}^m |\alpha_j| \le B,\\ \|v_j\|_2 \le 1}} \sum_{j=1}^m \alpha_j \left( \sum_{i=1}^n \sigma_i \phi(\langle v_j, X_i \rangle) \right) \right]
\end{equation*}

By applying \hyperref[thm:holder]{\textit{Hölder's inequality}}, the inner product over the $m$ hidden units is bounded by the $L_1$ norm of $\alpha$ multiplied by the maximum absolute value of the inner sums:
\begin{equation*}
    \sum_{j=1}^m \alpha_j \left( \sum_{i=1}^n \sigma_i \phi(\langle v_j, X_i \rangle) \right) \le \left( \sum_{j=1}^m |\alpha_j| \right) \sup_{\|v\|_2 \le 1} \left| \sum_{i=1}^n \sigma_i \phi(\langle v, X_i \rangle) \right|
\end{equation*}

Using the constraint $\sum_{j=1}^m |\alpha_j| \le B$, we can bound the Rademacher complexity by taking the scalar $B$ out of the supremum:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}') \le \frac{B}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|v\|_2 \le 1} \left| \sum_{i=1}^n \sigma_i \phi(\langle v, X_i \rangle) \right| \right]
\end{equation*}

Now, we must apply the symmetrization inequality derived in Part B. It allows us to remove the absolute value inside the supremum at the cost of a factor of 2:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}') \le \frac{2B}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|v\|_2 \le 1} \sum_{i=1}^n \sigma_i \phi(\langle v, X_i \rangle) \right]
\end{equation*}

From this point, the mathematical steps perfectly mirror Part A. By applying \hyperref[lem:talagrand]{\textit{Talagrand's contraction lemma}} for the 1-Lipschitz ReLU function, we can remove the activation without increasing the bound:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}') \le \frac{2B}{n} \mathbb{E}_{\sigma} \left[ \sup_{\|v\|_2 \le 1} \left\langle v, \sum_{i=1}^n \sigma_i X_i \right\rangle \right]
\end{equation*}

Applying the \hyperref[thm:cauchy]{\textit{Cauchy-Schwarz inequality}}, the supremum over $v$ subject to $\|v\|_2 \le 1$ is simply the $L_2$ norm of the Rademacher sum:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}') \le \frac{2B}{n} \mathbb{E}_{\sigma} \left[ \left\| \sum_{i=1}^n \sigma_i X_i \right\|_2 \right]
\end{equation*}

Finally, using \hyperref[thm:jensen]{\textit{Jensen's inequality}} and the independence of the Rademacher variables as done in Part A, we bound the expected $L_2$ norm using the explicit assumption $\|X_i\|_2 \le C$:
\begin{equation*}
    \mathbb{E}_{\sigma} \left[ \left\| \sum_{i=1}^n \sigma_i X_i \right\|_2 \right] \le \sqrt{ \sum_{i=1}^n \|X_i\|_2^2 } \le \sqrt{n C^2} = C\sqrt{n}
\end{equation*}

Injecting this back gives the final scale-invariant generalization bound:
\begin{align*}
    \mathcal{R}_{S_n}(\mathcal{H}') &\le \frac{2B}{n} C\sqrt{n} \\
    \mathcal{R}_{S_n}(\mathcal{H}') &\le \frac{2BC}{\sqrt{n}}
\end{align*}

This concludes the proof, demonstrating that the inequality holds with the absolute constant $C_1 = 2$.\qed%

\subsection{}

The generalization bound obtained in Part A is strictly width-dependent, scaling as $\mathcal{O}(\sqrt{\frac{m}{n}})$. As discussed previously, this bound becomes vacuous in the overparameterized regime where the number of hidden units $m$ is extremely large, thereby failing to explain why very wide neural networks can still generalize well in practice.

Conversely, the bound derived in Part C using the complexity measure $\mathcal{C}(\theta)$ scales as $\mathcal{O}(\frac{1}{\sqrt{n}})$. This new bound is entirely independent of the network's width $m$, effectively decoupling the statistical guarantee from the raw number of parameters.

The measure $\mathcal{C}(\theta)$ provides a more appropriate notion of capacity because it accounts for the algebraic symmetries of the architecture, specifically the positive homogeneity of the ReLU activation function. In Part A, the hypothesis class was defined by independent constraints on the norms of the weights. However, due to scale-invariance, one could artificially inflate the norm of the hidden layer while proportionally deflating the output layer. This transformation results in the exact same mathematical function but drastically inflates the naive complexity bound.

The scale-invariant measure $\mathcal{C}(\theta)$ resolves this ambiguity by coupling the norms of the incoming and outgoing weights for each hidden unit. Consequently, it bounds the true functional capacity of the network rather than the arbitrary scale of its parameters. This proves that overparameterized neural networks can generalize perfectly well, provided that their scale-invariant complexity remains bounded.

\section*{Conclusion}
\addcontentsline{toc}{section}{Conclusion}

This exercise highlights a foundational concept in the modern statistical learning theory of neural networks: the necessity of scale-invariant capacity measures.

In Part A, we demonstrated that a naive approach relying strictly on weight norms yields a generalization bound that scales with $\sqrt{m}$. In the context of modern deep learning, where networks are massively overparameterized ($m \gg n$), this width-dependent bound becomes vacuous and fails to explain the strong empirical generalization of such models.

By exploiting the positive homogeneity of the ReLU activation function in Parts B and C, we established that the network's function remains invariant under specific rescaling operations. This allowed us to define a new complexity measure, $\mathcal{C}(\theta)$, which captures the true capacity of the two-layer network independently of its width $m$. Consequently, the resulting generalization bound depends solely on the sample size $\mathcal{O}(\frac{1}{\sqrt{n}})$ and the scale-invariant complexity.

Ultimately, this illustrates that overparameterized neural networks can generalize effectively, provided that their scale-invariant complexity is bounded, a property often achieved in practice through explicit regularization or the implicit regularization of gradient-based optimization methods.

\newpage

\section*{Appendix: Mathematical Tools}
\addcontentsline{toc}{section}{Appendix: Mathematical Tools}

This appendix outlines the fundamental mathematical definitions and theorems used in the resolution of this exercise. As these are standard results in empirical process theory and probability, they are stated here without proof and assumed to be true.

\begin{definition}[Empirical Rademacher Complexity]\label{def:rademacher}
Let $\mathcal{H}$ be a family of functions mapping from $\mathcal{X}$ to $\mathbb{R}$, and let $S_n = (X_1, \dots, X_n)$ be a fixed sample of size $n$ with elements in $\mathcal{X}$. The empirical Rademacher complexity of $\mathcal{H}$ with respect to the sample $S_n$ is defined as:
\begin{equation*}
    \mathcal{R}_{S_n}(\mathcal{H}) = \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i) \right]
\end{equation*}
where $\sigma_1, \dots, \sigma_n$ are independent Rademacher random variables taking values in $\{-1, +1\}$ with equal probability.
\end{definition}

\begin{definition}[ReLU Activation and Positive Homogeneity]\label{def:relu}
The Rectified Linear Unit (ReLU) activation function is defined as $\phi(z) = \max(0, z)$. It satisfies the positive homogeneity property, meaning that for any strictly positive scalar $\alpha > 0$ and any real number $z$:
\begin{equation*}
    \phi(\alpha z) = \alpha \phi(z)
\end{equation*}
Furthermore, the absolute value of any real number can be decomposed using the ReLU function as $|z| = \phi(z) + \phi(-z)$.
\end{definition}

\begin{theorem}[Cauchy-Schwarz Inequality]\label{thm:cauchy}
For all vectors $u, v$ in an inner product space over $\mathbb{R}$, the Cauchy-Schwarz inequality states that:
\begin{equation*}
    |\langle u, v \rangle| \le \|u\|_2 \|v\|_2
\end{equation*}
where $\langle \cdot, \cdot \rangle$ is the inner product and $\|\cdot\|_2$ is the induced $L_2$ norm.
\end{theorem}

\begin{theorem}[Hölder's Inequality]\label{thm:holder}
For any vectors $u, v \in \mathbb{R}^m$, the absolute value of their inner product is bounded by the product of their $L_1$ and $L_\infty$ norms:
\begin{equation*}
    |\langle u, v \rangle| \le \|u\|_1 \|v\|_\infty
\end{equation*}
where $\|u\|_1 = \sum_{j=1}^m |u_j|$ and $\|v\|_\infty = \max_{1 \le j \le m} |v_j|$.
\end{theorem}

\begin{theorem}[Jensen's Inequality]\label{thm:jensen}
Let $X$ be a random variable and $\varphi$ be a measurable function. If $\varphi$ is convex, then:
\begin{equation*}
    \varphi(\mathbb{E}[X]) \le \mathbb{E}[\varphi(X)]
\end{equation*}
Conversely, if $\varphi$ is a concave function, the inequality is reversed:
\begin{equation*}
    \varphi(\mathbb{E}[X]) \ge \mathbb{E}[\varphi(X)]
\end{equation*}
\end{theorem}

\begin{lemma}[Talagrand's Contraction Lemma]\label{lem:talagrand}
Let $\phi : \mathbb{R} \to \mathbb{R}$ be an $L$-Lipschitz function such that $\phi(0) = 0$. For any hypothesis class $\mathcal{H}$ of real-valued functions and any sample $S_n = (X_1, \dots, X_n)$, the empirical Rademacher complexity of the composition satisfies:
\begin{equation*}
    \frac{1}{n} \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{H}} \sum_{i=1}^n \sigma_i \phi(f(X_i)) \right] \le \frac{L}{n} \mathbb{E}_{\sigma} \left[ \sup_{f \in \mathcal{H}} \sum_{i=1}^n \sigma_i f(X_i) \right]
\end{equation*}
\end{lemma}

\end{document}