notes/cs486/cs486.html at master · zandershah/notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <title>cs486</title>
  <style type="text/css">
      code{white-space: pre-wrap;}
      span.smallcaps{font-variant: small-caps;}
      span.underline{text-decoration: underline;}
      div.column{display: inline-block; vertical-align: top; width: 50%;}
  </style>
  <link rel="stylesheet" href="../pandoc.css" />
  <script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.9.0/katex.min.js"></script>
  <script>document.addEventListener("DOMContentLoaded", function () {
    var mathElements = document.getElementsByClassName("math");
    for (var i = 0; i < mathElements.length; i++) {
      var texText = mathElements[i].firstChild;
      if (mathElements[i].tagName == "SPAN") { katex.render(texText.data, mathElements[i], { displayMode: mathElements[i].classList.contains("display"), throwOnError: false } );
    }}});</script>
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.9.0/katex.min.css" />
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<h1 id="cs-486">CS 486</h1>
<h2 id="rational-agent-paradigm">Rational Agent Paradigm</h2>
<blockquote>
<p>An entity that perceives and acts.</p>
</blockquote>
<ul>
<li>Function from percepts to actions.</li>
<li>Performance measures.
<ul>
<li>Goal achievement, resource consumption.</li>
</ul></li>
<li><strong>Caveat</strong>: Computational limitations and environmental constraints means we do not have perfect rationality.</li>
</ul>
<h3 id="task-environments">Task Environments</h3>
<p>To design a rational agent the task environment must be defined.</p>
<ul>
<li>Performance measures.</li>
<li>Environment.</li>
<li>Actuators.</li>
<li>Sensors.</li>
</ul>
<h4 id="properties-of-task-environment">Properties of Task Environment</h4>
<ul>
<li><strong>Fully Observable</strong> vs. <strong>Partially Observable</strong>.</li>
<li><strong>Deterministic</strong> vs. <strong>Stochastic</strong>.
<ul>
<li>Is the next state completely determined by the current state and action executed?</li>
</ul></li>
<li><strong>Episodic</strong> vs. <strong>Dynamic</strong>.</li>
<li><strong>Discrete</strong> vs. <strong>Continuous</strong>.</li>
<li><strong>Static</strong> vs. <strong>Dynamic</strong>.</li>
<li><strong>Single Agent</strong> vs. <strong>Multiagent</strong>.</li>
</ul>
<h1 id="search">Search</h1>
<blockquote>
<p><strong>Search problem</strong> consists of a <strong>state space</strong>, a <strong>successor function</strong>, a <strong>start space</strong>, and a <strong>goal test</strong>.</p>
</blockquote>
<ul>
<li><strong>Solution</strong> is a sequence of actions (plan) from the start state to some goal state.</li>
</ul>
<p><strong>Example</strong>: Sliding Tiles Problem.</p>
<blockquote>
<ul>
<li><strong>State</strong>: Board configuration.</li>
<li><strong>Start</strong>: Any state.</li>
<li><strong>Actions</strong>: Slide the blank tile into an adjacent space.</li>
<li><strong>Goal</strong>: Does it match picture?</li>
</ul>
</blockquote>
<p><strong>Example</strong>: N Queens Problem.</p>
<blockquote>
<ul>
<li><strong>State</strong>: <span class="math inline">0</span> to <span class="math inline">N</span> queens.</li>
<li><strong>Start</strong>: <span class="math inline">0</span> queens.</li>
<li><strong>Actions</strong>: Add a queen to an empty space.</li>
<li><strong>Goal</strong>: <span class="math inline">N</span> queens none attacking.</li>
</ul>
</blockquote>
<p>Alternate representation which is more complicated but has a smaller search space.</p>
<blockquote>
<ul>
<li><strong>State</strong>: <span class="math inline">0</span> to <span class="math inline">N</span> queens, first <span class="math inline">n</span> columns not attacking each other.</li>
<li><strong>Start</strong>: <span class="math inline">0</span> queens.</li>
<li><strong>Actions</strong>: Add a queen to the first empty column none attacking.</li>
<li><strong>Goal</strong>: <span class="math inline">N</span> queens. And babu is cutie</li>
</ul>
</blockquote>
<h2 id="state-space">State Space</h2>
<ul>
<li>The <strong>world space</strong> includes every last detial in the environment.</li>
<li>A <strong>search space</strong> keeps only the details needed for planning (abstraction).</li>
</ul>
<h2 id="representing-state">Representing State</h2>
<ul>
<li><strong>State space graph</strong>.
<ul>
<li>Vertices correspond to states with one vertex for each space.</li>
<li>Edges correspond to successors.</li>
<li>Goal test is a set of goal nodes.</li>
</ul></li>
<li>We search for a solution by building a <strong>search tree</strong> and traversing it to find a goal state.</li>
</ul>
<h3 id="search-tree">Search Tree</h3>
<ul>
<li>State state is the root of the tree.</li>
<li>Children are the successors.</li>
<li>Plan is a path in the tree. A solution is a path from the root to a goal node.</li>
</ul>
<blockquote>
<p>For most problems we do not actually generate the entire tree.</p>
</blockquote>
<ul>
<li>We expand a node by applying all legal actions on it and adding the new states to the tree.</li>
</ul>
<h2 id="generic-search-algorithm">Generic Search Algorithm</h2>
<ul>
<li>Initialize with initial state of the problem.</li>
<li><strong>Repeat</strong>.
<ul>
<li>If no candidate nodes, <strong>faliure</strong>.</li>
<li>Choose leaf node for expansion according to <strong>search strategy</strong>.</li>
<li>If node contains goal state, return <strong>solution</strong>.</li>
<li>Otherwise, expand the node. Add resulting nodes to the tree.</li>
</ul></li>
<li>Nodes can be classified as <strong>start</strong> node, <strong>explored</strong> nodes, <strong>frontier</strong>, <strong>unexplored</strong> nodes.</li>
</ul>
<h3 id="key-properties">Key Properties</h3>
<ul>
<li><strong>Completeness</strong>: Is the algorithm guaranteed to find a solution if one exists?</li>
<li><strong>Optimality</strong>: Does the algorithm find the optimal solution?</li>
<li><strong>Time complexity</strong>.</li>
<li><strong>Space complexity</strong>: Size of the fringe.</li>
<li><span class="math inline">b</span>: Branching factor.</li>
<li><span class="math inline">m</span>: Maximum depth.</li>
<li><span class="math inline">d</span>: Depth of the nearest goal node.</li>
</ul>
<p><strong>Example</strong>: DFS.</p>
<blockquote>
<ul>
<li><strong>Complete</strong>: No. Infinitely stuck in a loop. If <span class="math inline">m</span> is finite then it is.</li>
<li><strong>Optimal</strong>: No. Finds the first goal, not necessarily the optimal.</li>
<li><strong>Time complexity</strong>: Whole tree, <span class="math inline">O(b^m)</span>.</li>
<li><strong>Space complexity</strong>: Fringe and related path information. <span class="math inline">O(m \cdot b)</span>.</li>
</ul>
</blockquote>
<p><strong>Example</strong>: BFS.</p>
<blockquote>
<ul>
<li><strong>Complete</strong>: Yes.</li>
<li><strong>Optimal</strong>: Depends on whether the shallowest goal node is the one with the least cost.</li>
<li><strong>Time complexity</strong>: Whole tree, <span class="math inline">O(b^{d + 1})</span>.</li>
<li><strong>Space complexity</strong>: <span class="math inline">O(b^d)</span>.</li>
</ul>
</blockquote>
<h3 id="iterative-deepened-search">Iterative Deepened Search</h3>
<blockquote>
<p>Combine search methods to take advantage of DFS space complexity and BFS completeness and shallow solution advantage?</p>
</blockquote>
<ul>
<li><strong>Complete</strong>: Yes.</li>
<li><strong>Optimal</strong>: Depends on whether the shallowest goal node is the one with the least cost.</li>
<li><strong>Time complexity</strong>: Whole tree, <span class="math inline">O(b^d)</span>.</li>
<li><strong>Space complexity</strong>: <span class="math inline">O(m \cdot b)</span>.</li>
</ul>
<h2 id="cost-sensitive-search">Cost-Sensitive Search</h2>
<h3 id="uniform-cost-search">Uniform Cost Search</h3>
<ul>
<li><strong>Strategy</strong>: Expand cheapest node first.</li>
<li><strong>Implementation</strong>: Priority queue.</li>
<li><strong>Complete</strong>: Yes.</li>
<li><strong>Optimal</strong>: Yes if costs are all greater or less some <span class="math inline">\epsilon</span>.</li>
<li><strong>Time Complexity</strong>: <span class="math inline">O(b^{1 + \frac{C^*}{\epsilon}})</span>, where <span class="math inline">C^*</span> is the optimal cost.</li>
<li><strong>Space Complexity</strong>: Same as BFS.</li>
</ul>
<h1 id="informed-search">Informed Search</h1>
<p>Uninformed search expands nodes on the distance from the start node. Why not try to expand on the distance to the goal?</p>
<h2 id="heuristics">Heuristics</h2>
<blockquote>
<p>A function that <strong>estimates</strong> the cost of reaching a goal from a given state.</p>
</blockquote>
<ul>
<li>If <span class="math inline">h(n_1) &lt; h(n_2)</span> we guess that it is cheaper to reach the goal from <span class="math inline">n_1</span> than <span class="math inline">n_2</span>.</li>
<li>We require <span class="math inline">h(n, goal) = 0</span>.</li>
</ul>
<p><strong>Example</strong>: Best First Search.</p>
<blockquote>
<p><strong>Search strategy</strong>: Expand the most promising node according to the heuristic. - Not complete (infinite expansion). Time complexity, space complexity <span class="math inline">O(b^m)</span>.</p>
</blockquote>
<ul>
<li><strong><span class="math inline">A^\star</span> Search</strong>: Expand according to the cost of path and heuristic. <span class="math inline">f(n) = g(n) + h(n)</span>. <em>Note: Goal test must be done while expanding the node, not when it is first discovered.</em> Time complexity is <span class="math inline">O(b^{\delta d})</span>, where <span class="math inline">\delta</span> is the relative error of the heuristic. We are required to store all expanded nodes in memory.</li>
<li><strong>Admissible Heuristic</strong>: <span class="math inline">0 \le h(n) \le h^\star(n)</span>, where <span class="math inline">h^\star(n)</span> is the true cost.</li>
<li>Heuristic is <strong>consistent</strong> if <span class="math inline">h(a) \le cost(a, b) + h(b)</span>. Required if we have a graph (multiple paths to a goal).</li>
</ul>
<blockquote>
<p>Let <span class="math inline">G</span> be the optimal goal. Let <span class="math inline">G^\prime</span> be a sub-optimal goal. So <span class="math inline">f(G) &lt; f(G^\prime)</span>. Assume by way of contradiction that <span class="math inline">G^\prime</span> is selected over some <span class="math inline">n</span>. So <span class="math inline">f(G^\prime = g(G^\prime) &gt; g(G) = cost(s, n) + cost(n, G) \ge g(n) + h(n)</span>. This is a contradiction since <span class="math inline">n</span> would have been selected first.</p>
</blockquote>
<ul>
<li><strong>Dominated Heuristic</strong>: <span class="math inline">h_1(n)</span> dominates <span class="math inline">h_2(n)</span> if for all <span class="math inline">n</span>, <span class="math inline">h_1(n) \ge h_2(n)</span> and there exists one strict inequality.</li>
<li><strong>Theorem</strong>: If <span class="math inline">h_1(n)</span> dominates <span class="math inline">h_2(n)</span>, then <span class="math inline">A^\star</span> will never expand more nodes.</li>
</ul>
<blockquote>
<p>Let <span class="math inline">c^\star</span> be the cost of the optimal solution. <span class="math inline">A^\star</span> expands all nodes such that <span class="math inline">f(n) &lt; c^\star</span>, so <span class="math inline">h(n) \le c^\star - g(n)</span>. <span class="math inline">g(n)</span> is constant across all heuristics, so since <span class="math inline">h_2(n) \le h_1(n)</span>, a node expanded in <span class="math inline">h_1</span> must also have been expanded in <span class="math inline">h_2</span>.</p>
</blockquote>
<h1 id="constraint-satisfaction">Constraint Satisfaction</h1>
<blockquote>
<p>Special subset of search problems.</p>
</blockquote>
<ul>
<li><strong>States</strong> are defined by <strong>variables</strong> <span class="math inline">X_i</span> with values from <strong>domains</strong> <span class="math inline">D_i</span>.</li>
<li><strong>Goal test</strong> is a <strong>set of constraints</strong> specifying allowable combinations of values for subsets of variables.</li>
</ul>
<h2 id="types-of-cpss">Types of CPSs</h2>
<ul>
<li><strong>Discrete variables</strong>.
<ul>
<li><strong>Finite domains</strong>: If domain has size <span class="math inline">d</span>, there are <span class="math inline">O(d^n)</span> complete assignments.</li>
<li><strong>Infinite domains</strong>: Linear constraints are solvable but non-linear are undecidable.</li>
</ul></li>
<li><strong>Continuous variables</strong>: Linear programming polynomial time.</li>
<li><strong>Unary constraints</strong>.</li>
<li><strong>Binary constraints</strong>: Representable with a constraint graph.</li>
<li><strong>Higher-order constraints</strong>.</li>
<li><strong>Soft constraints</strong>: Constrained optimization problem.</li>
</ul>
<h2 id="commutativity">Commutativity</h2>
<blockquote>
<p><strong>Key insight</strong> is that CPSs are commutative.</p>
</blockquote>
<ul>
<li>Order of actions do not affect outcome.</li>
<li>Algorithm takes advantage of this.</li>
</ul>
<h2 id="backtracking">Backtracking</h2>
<blockquote>
<p>Basic search algorithm for CSPs.</p>
</blockquote>
<pre><code>Select unassigned variable X
For every value {x_1, ..., x_n} in domain of X
    If value satisfies constraints, assign X = x_i and exit loop
If an assignment is found
    Move to next variable
If no assignment is found
    Back up to preceding variable and try a different assignment</code></pre>
<h3 id="backtracking-efficiency">Backtracking Efficiency</h3>
<h4 id="ordering">Ordering</h4>
<blockquote>
<p>Which variables should be tried first? In what order should the variable’s values be tried?</p>
</blockquote>
<ul>
<li><strong>Most constrained variable</strong>: Try the variable with the fewest remaining <em>legal</em> moves. Also known as <strong>minimum remaining values</strong>.</li>
<li><strong>Most constraining variable</strong>: Try the variable with the most constraints on the remaining variables.</li>
<li><strong>Least constraining value</strong>: Try the variable which rules out the fewest values in the remaining variables.</li>
</ul>
<h4 id="filtering">Filtering</h4>
<blockquote>
<p>How do we detect faliure early?</p>
</blockquote>
<ul>
<li><strong>Forward checking</strong>: Keep track of remaining legal values for unassigned varibles. Terminate search when any variable has no legal values.
<ul>
<li>Does not detect all future faliures early.</li>
</ul></li>
<li><strong>Arc consistency</strong>: Given domains <span class="math inline">D_1, D_2</span>, an arc is consistent if for all <span class="math inline">x \in D_1</span>, there exists <span class="math inline">y \in D_2</span> such that <span class="math inline">x, y</span> are consistent.</li>
<li><strong>K-consistency</strong>: For all sets of <span class="math inline">K - 1</span> variables and consistent assignment of values, a consistent value is always assignable to any <span class="math inline">K</span>th variable.</li>
</ul>
<h4 id="structure">Structure</h4>
<blockquote>
<p>Is it possible to exploit the problem structure? <strong>Idea</strong>: Break down the graph into connected components and solve each component separately.</p>
</blockquote>
<ul>
<li><strong>Independent Subproblems</strong>: Solve each connected component separately.</li>
<li><strong>Tree Structures</strong>: If the graph is a DAG, topological sort and solve in <span class="math inline">O(nd^2)</span>.</li>
<li><strong>Cutsets</strong>: For a general graph, we define a subset of variables <span class="math inline">S</span> such that the when removed the graph is a tree. Try all values of the subset <span class="math inline">O(d^{|S|})</span> and see if there is a solution in the tree <span class="math inline">O((n - |S|)d^2)</span>, total runtime is <span class="math inline">O(d^{|S| + 2}(n-|S|))</span>.</li>
<li><strong>Tree Decomposition</strong>: Group nodes into mega-nodes forming a tree. All variables occur in at least one mega-node. Variables connected by constraints must appear together. If a variable is in multiple mega-nodes it must be in the entire path. If <span class="math inline">w</span> is the width of the tree (1 less than the size of the largest sub-problem), the runtime is <span class="math inline">O(nd^{w + 1}</span>. Finding min-width decomposition is NP-hard, but we can use heuristics.</li>
</ul>
<h1 id="constraints-and-local-search">Constraints and Local Search</h1>
<blockquote>
<p>For many problems, the path is unimportant.</p>
</blockquote>
<h2 id="iterative-improvement-methods">Iterative Improvement Methods</h2>
<ul>
<li>Start at some potential solution. Generate all possible points to move to. If stuck then reset, otherwise move to one of the points.</li>
</ul>
<h2 id="hill-climbing-gradient-descent">Hill Climbing (Gradient Descent):</h2>
<ul>
<li>Take a step in the direction which improves the current solution value the most.</li>
<li>Not necessarily complete (flat optima), not optimal, gets stuck at local optima and plateaus. Random restarts fixes local optima issue.</li>
</ul>
<h2 id="simulated-annealing">Simulated Annealing</h2>
<blockquote>
<p>Escape local optima by allowing “downhill” movements.</p>
</blockquote>
<ul>
<li>Take selected move if it improves the solution, otherwise take it with probability <span class="math inline">p</span>.</li>
</ul>
<h3 id="boltzman-distribution">Boltzman Distribution</h3>
<ul>
<li><span class="math inline">e^{\frac{\Delta V}{T}}</span>, <span class="math inline">T &gt; 0</span> is the temperature parameter.</li>
<li>When <span class="math inline">T</span> is high, bad moves have a chance, <strong>exploration phase</strong>.</li>
<li>When <span class="math inline">T</span> is low, bad moves have low probability of being selected, <strong>exploitation phase</strong>.</li>
<li>If <span class="math inline">T</span> decreases slowly enough, then we are <em>theoretically</em> guaranteeed to reach optimum.</li>
</ul>
<h2 id="genetic-algorithms">Genetic Algorithms</h2>
<ul>
<li>Encoded candidate solution is an <strong>individual</strong>.</li>
<li>Individuals have <strong>fitness</strong> which corresponds to the quality of the solution.</li>
<li>Populations change over generations by applying operators.</li>
</ul>
<ol type="1">
<li><strong>Selection</strong>: Fitness proportional selection could lead to overcrowding. Ranking by percentile of fitness gets around this. Softmax (Boltzman) selection similar to simulated annealing works.</li>
<li><strong>Crossover</strong>: Select random crossover point. Implemented with bitmasks.</li>
<li><strong>Mutate</strong>: With small probability, modify a feature.</li>
</ol>
<ul>
<li>In a new generation, for every child, select parents, crossover, then mutate with low probability.</li>
</ul>
<h1 id="adversarial-search">Adversarial Search</h1>
<ul>
<li>Focused on zero-sum games.</li>
<li><strong>MAX</strong> player maximizes utility, <strong>MIN</strong> player minimizes utility.</li>
<li><strong>Optimal stategy</strong> leads to outcomes at least as good as any other strategy, given that MIN is playing optimally.</li>
<li><strong>Nash Equilibrium</strong>: <span class="math inline">s^\star \in S</span> is a Nash Equilibrium if for all <span class="math inline">i</span>, <span class="math inline">u_i(s^\star) \ge u_i(s_i, s_{-i}^\star)</span>, for all <span class="math inline">s_i \in S_i</span>.</li>
<li><strong>Theorem</strong> (Kuhn): Every finite extensive form game has a subgame perfect equilibria.</li>
</ul>
<h2 id="minimax">Minimax</h2>
<p><span class="math inline">M(n) = \begin{cases}u(n),\ &amp;\text{$n$ is terminal} \\ \max_{c \in succ(n)} M(c),\ &amp;\text{$n$ is MAX node} \\ \min_{c \in succ(n)} M(c),\ &amp;\text{$n$ is MIN node} \end{cases}</span>.</p>
<ul>
<li>Complete for fintite games, time complexity is <span class="math inline">O(b^m)</span>, space complexity $O(bm), where <span class="math inline">m</span> is the number of moves in the game. Optimal.</li>
</ul>
<h3 id="alpha-beta-pruning">Alpha-Beta Pruning</h3>
<blockquote>
<p>Compute minimax without searching the entire tree.</p>
</blockquote>
<ul>
<li><strong>Alpha</strong>: Value of the best choice so far on path for MAX.</li>
<li><strong>Beta</strong>: Value of the best (lowest) choice so far on path for MIN.</li>
<li>We update <strong>alpha</strong> and <strong>beta</strong> as we compute <span class="math inline">M</span>. Prune as soon as the current node is known to be worse than our current best result.</li>
<li>Results in the same outcome as full minimax. In the worst-case it offers no improvement.</li>
</ul>
<h2 id="optimizations">Optimizations</h2>
<blockquote>
<p>We might not have enough resources to compute full results.</p>
</blockquote>
<ul>
<li><strong>Evaluation Functions</strong>: Returns on estimate of the expected utility. Needs to be fast to compute. Often is a weighted function of the state features.</li>
<li><strong>Cutting Off Search</strong>: Instead of searching to terminal states, we search <em>low</em>, and then use evaluation functions.
<ul>
<li>Typically, we search deeper when the node is clearly good. Avoids <strong>horizon effect</strong>, where we are surprised by a bad node in the future.</li>
</ul></li>
</ul>
<h3 id="monte-carlo-tree-search-mcts">Monte-Carlo Tree Search (MCTS)</h3>
<ul>
<li>Build a search tree according to outcomes of simulated plays.</li>
</ul>
<ol type="1">
<li><strong>Selection</strong>: Traverse the tree following a policy using Upper Confidence Bounds until you run out of information required to progress.
<ul>
<li><span class="math inline">v_i + c\sqrt{\frac{\ln N}{n_i}}</span>, where <span class="math inline">n_i</span> is the number of times we have visited the node, <span class="math inline">N</span> is the total number of runs, and <span class="math inline">v_i</span> is the expected value of the node given the information we have.</li>
</ul></li>
<li><strong>Expansion</strong>: Eventually we run out of information to progress, so expand a random child.</li>
<li><strong>Simulate</strong>: Quickly simulate a game from the node.</li>
<li><strong>Back-propogation</strong>: Based on simulations, update expected values.</li>
</ol>
<h2 id="stochastic-games">Stochastic Games</h2>
<blockquote>
<p>Element of randomness.</p>
</blockquote>
<ul>
<li>Modelled by adding <strong>chance</strong> nodes between MIN and MAX layers, with weights equal to the probability of every option.</li>
<li>Compute <strong>expected values</strong> for minimax.</li>
</ul>
<h1 id="machine-learning">Machine Learning</h1>
<ul>
<li><strong>Learning</strong>: Improving behaviour based on experience. Range of behaviour, accuracy on tasks, or speed of execution are considered improvements.</li>
<li><strong>Supervised Classification</strong>: Given a set of pre-classified, classify on a new instance.
<ul>
<li><strong>Feedback</strong>: Supervices learning is explicitly given what must be learned by each example.</li>
</ul></li>
<li><strong>Unsupervised Learning</strong>: Find natural classes for examples.
<ul>
<li><strong>Feedback</strong>: No feedback is given, learner has to find patterns themselves.</li>
</ul></li>
<li><strong>Reinforcement Learning</strong>: Determine what to do based on rewards and punishments.
<ul>
<li><strong>Feedback</strong>: Feedback is only given after a sequence of actions.</li>
</ul></li>
<li><strong>Transfer Learning</strong>: Learning from an expert.</li>
<li><strong>Active Learning</strong>: Actively seeking to learn.</li>
<li><strong>Representation</strong>: Richer representations are more useful for subsequent problem solving but are harder to learn.</li>
<li>Measured against how well the agent performs with <strong>new examples</strong>.</li>
<li>Tendency to prefer one hypothesis over another is a <strong>bias</strong>.</li>
</ul>
<p>Given representations, data, and bias, we have a <strong>search problem</strong>. We search through the space of possible representations to best fit the data. The search space is typically too large for systematic search, so we use iterative improvement. So a <strong>learning problem</strong> is made up of a search space, an evaluation function, and a search method.</p>
<h2 id="supervised-learning">Supervised Learning</h2>
<blockquote>
<p>Given a set of input features <span class="math inline">X_1, ..., X_n</span>, a set of target features <span class="math inline">f(x)</span>, and a set of training examples, predict the target features for a set of test examples. This is done by returning a function that approximates <span class="math inline">f</span>.</p>
</blockquote>
<ul>
<li><strong>Classification</strong>: <span class="math inline">f</span> is discrete.</li>
<li><strong>Regression</strong>: <span class="math inline">f</span> is continuous.</li>
<li><strong>Inductive Learning Hypothesis</strong>: We hope that the approximation of <span class="math inline">f</span> which performs well over a sufficiently large set of training examples will also perform well over any unobserved examples.</li>
<li><strong>Evaluating Performance</strong>: Suppose <span class="math inline">y</span> is a feature and <span class="math inline">e</span> is an example. <span class="math inline">y(e)</span> is the true value of <span class="math inline">y</span> for <span class="math inline">e</span>. <span class="math inline">y^\star(e)</span> is the predicted value. The <strong>error</strong> is a measure of how close <span class="math inline">y^\star(e)</span> is to <span class="math inline">y(e)</span>.
<ul>
<li><strong>Receiver Operator Curve</strong>: <span class="math inline">Recall = Sensitivity = \frac{TP}{TP + FN}</span>. <span class="math inline">Specificity = \frac{TN}{TN + FP}</span>. <span class="math inline">Precision = \frac{TP}{TP + FP}</span>. <span class="math inline">F\text{-}measure = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}</span>.</li>
</ul></li>
</ul>
<h3 id="decision-trees">Decision Trees</h3>
<blockquote>
<p>Classify instructions by storting them down the tree from root to leaf.</p>
</blockquote>
<ul>
<li>Leaves are the classifications.</li>
<li>Any boolean function is representable using decision trees.</li>
<li>Data needs to be discrete.</li>
</ul>
<h4 id="inducing-decision-tree"><strong>Inducing Decision Tree</strong></h4>
<blockquote>
<p>Recursively choose the <em>most significant</em> attribute as the root of the subtree.</p>
</blockquote>
<ul>
<li><strong>Entropy</strong>: Measure of unpredictability, uncertainty of random variables.</li>
<li><span class="math inline">I(P(v_1), ..., P(v_n)) = \sum_{i=1}^n -P(v_i) \log_2(P(v_i))</span> from Information Theory. Assume <span class="math inline">0\log(0) = 0</span>.</li>
<li>High entropy inplies that we gain information by observing that value.</li>
<li><strong>Information Gain</strong>: Gain on attribute <span class="math inline">A</span> is the expected reduce in entropy. Given that attribute <span class="math inline">A</span> divides the training set <span class="math inline">E</span> into subsets <span class="math inline">E_1, ..., E_v</span>, <span class="math inline">remainder(A)</span> is the weighted sum of their information. <span class="math inline">remainder(A) = \sum_{i=1}^v \frac{p_i + n_i}{p + n} I(\frac{p_i}{p_i + n_i}, \frac{n_i}{p_i + n_i})</span>.
<ul>
<li>So <span class="math inline">IG(A) = I(\frac{p}{p + n}, \frac{n}{p + n}) - remainder(A)</span>. We choose the attribute <span class="math inline">A</span> which maximizes the information given, so the minimum remainder.</li>
</ul></li>
</ul>
<h2 id="assessing-performance">Assessing Performance</h2>
<ul>
<li>Divide large set of examples into disjoint sets. Apply learning algorithm to training set an dmeasure performance against test set.</li>
<li><strong>Overfitting</strong>: Hypothesis <span class="math inline">h \in H</span> overfits training data if there exists <span class="math inline">h^\prime</span>, <span class="math inline">h \neq h^\prime</span> such that <span class="math inline">h</span> has a smaller error on the training examples but <span class="math inline">h^\prime</span> has a smaller error on the entire distribution of instances.
<ul>
<li>Reduces performance of decision trees by 10%-25%.</li>
<li>Errors causes by bias, variance in data, noise in data.</li>
</ul></li>
<li><strong>Bias-Variance Trade-off</strong>: Complicated models will not have enough data (low bias, high variance). Simple models with will have lots of data (high bias, low variance).</li>
</ul>
<h3 id="avoiding-overfitting">Avoiding Overfitting</h3>
<ol type="1">
<li><strong>Regularization</strong>: Prefer small decision trees over large ones. Add complexity penalty to stopping criteria.</li>
<li><strong>Pseudocounts</strong>: Add data based on previous knowledge.</li>
<li><strong>Cross Validation</strong>: Split training set into training an validation. Use validation as pretend test set. Optimize hypothesis to perform well against validation set.
<ul>
<li><strong>K-fold Validation</strong>: Divide training set into <span class="math inline">K</span> subsets, pull one out as validation and train on the rest.</li>
</ul></li>
</ol>
<h2 id="linear-classifiers">Linear Classifiers</h2>
<blockquote>
<p>Data is of the form <span class="math inline">(x, f(x))</span>, <span class="math inline">x \in \mathbb{R}^n</span>, <span class="math inline">f(x) \in \{0, 1\}</span>.</p>
</blockquote>
<ul>
<li>Want <span class="math inline">w</span> with <span class="math inline">h_w(x) = \begin{cases}1,\ &amp;w \cdot x \ge 0\\0,\ &amp;w \cdot x &lt; 0\end{cases}</span>. Find <span class="math inline">w</span> which minimizes the loss function <span class="math inline">L(h_w) = \sum_{i=1} (y_i - h_w(x_i))^2</span>.</li>
<li><strong>Gradient Descent</strong>: <span class="math inline">\frac{\partial}{\partial w_i} L(h_w) = 2(y-h_w(x))(-x_i)</span>. Iteratively until convergence, we update <span class="math inline">w_i = w_i - \alpha \frac{\partial}{\partial w_i} L(h_w)</span>, where <span class="math inline">\alpha</span> is the step size or learning rate.</li>
</ul>
<h2 id="ensembles">Ensembles</h2>
<blockquote>
<p>Use many hypothesis and combine their predictions.</p>
</blockquote>
<h3 id="bagging">Bagging</h3>
<blockquote>
<p>Majority vote.</p>
</blockquote>
<ul>
<li>Assume each hypothesis makes error with probability <span class="math inline">p</span>.</li>
<li>Probability that the majority is wrong is <span class="math inline">\sum_{k = \lceil \frac{n}{2} \rceil}^n {n \choose k}p^k (1-p)^{n-k}</span>.</li>
<li>Example: Random Forests.</li>
</ul>
<blockquote>
<p>Randomly sample subsets of training data and features, learn decision trees off the subsets. Classify using majority vote of the forest.</p>
</blockquote>
<h3 id="boosting">Boosting</h3>
<ul>
<li>In reality, hypothesis are not equally correct and independent.</li>
<li>Want to increase the weight of good hypothesis and decrease the weight of bad hypothesis, then use a weighted majority.</li>
<li>Similarily, we want to increase the weight of misclassified examples.</li>
<li>Example: AdaBoost.</li>
</ul>
<blockquote>
<p>Compute accuracy <span class="math inline">p</span> for every hypothesis. Update weights of correct predictions by <span class="math inline">\frac{p}{1 - p}</span>. Update weight of hypothesis by <span class="math inline">\log(\frac{1-p}{p})</span>.</p>
</blockquote>
<h1 id="neural-networks">Neural Networks</h1>
<blockquote>
<p>Relationships between input and output may be complicated.</p>
</blockquote>
<ul>
<li>Want methods for learning arbitrary relationships.</li>
<li><strong>Neurons</strong>: Dendrites are inputs, soma is activity, axon is output, synapse is links. Learning changes how efficiently signals transfer.</li>
<li><strong>Artificial Neuron</strong>: <span class="math inline">in(i) = \sum_{j}w_{i, j}a(j)</span>, <span class="math inline">a(i) = g(in(i))</span>. Activation functions should be non-linear and mimic real neurons, so output close to 0 or 1.</li>
<li><strong>Common Activation Functions</strong>.
<ul>
<li><strong>Rectified Linear Unit</strong>: <span class="math inline">g(x) = \max\{0, x\}</span>.</li>
<li><strong>Sigmoid Function</strong>: <span class="math inline">g(x) = \frac{1}{1 + e^x}</span>.</li>
<li><strong>Hyperbolid Tangent</strong>: <span class="math inline">g(x) = \tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1}</span>.</li>
<li><strong>Threshold</strong>: <span class="math inline">g(x) = \begin{cases}1,\ &amp;x \ge b\\0,\ &amp;x &lt; b\end{cases}</span>.</li>
</ul></li>
</ul>
<h2 id="network-structure">Network Structure</h2>
<ul>
<li><strong>Feed-forward ANN</strong>: DAG, no internal state, maps input to outputs.</li>
<li><strong>Recurrent ANN</strong>: Directed cyclic graph, dynamic system with internal state.</li>
<li><strong>Perceptrons</strong>: Single layered feed-forward network. Can only learn linear separators. Learning means adjusting the weights.
<ul>
<li><span class="math inline">E = \sum_{k} \frac{1}{2}(y_k - (h_w(x))_k)^2</span>. Gradient descent.</li>
</ul></li>
<li><strong>Stochastic Gradient Descent</strong>: Shuffle data at random.</li>
<li><strong>Multilayer Networks</strong>: Any continuous function is learnable with just one hidden layer.
<ul>
<li>Last level we can train with gradient descent. For other layers, we do not <span class="math inline">y</span>, so how do we compute <span class="math inline">E</span>?</li>
<li><strong>Back Propogation</strong>: Hidden layer nodes contributed to some error at output. Amount of error is proportional to the weight. Chain rule.</li>
</ul></li>
<li><strong>Deep Learning</strong>: Neural networks with more than one hidden layer.</li>
</ul>
<h1 id="bayesian-networks">Bayesian Networks</h1>
<ul>
<li><span class="math inline">P(A, B) = \sum_i P(A | B_i) P(B_i)</span>.</li>
<li><span class="math inline">P(A | B) P(B) = P(B | A) P(A)</span>.</li>
<li><strong>Probabilistic Inferefence</strong>: Query to get probability.
<ul>
<li>Representing a full joint distribution over a set of random variables <span class="math inline">X_1, ..., X_n</span> is very expensive.</li>
<li>Answering queries is very slow.</li>
</ul></li>
</ul>
<h2 id="variable-independence">Variable Independence</h2>
<p>Two <strong>variables</strong> <span class="math inline">X, Y</span> are conditionally independent given variable <span class="math inline">Z</span> if given that you know the value of <span class="math inline">Z</span>, nothing you learn about <span class="math inline">Y</span> will influence your beliefs about <span class="math inline">X</span>.</p>
<ul>
<li><span class="math inline">P(x | y, z) = P(x | z)</span>, for all <span class="math inline">x \in X, y \in Y, z \in Z</span>.</li>
<li>Represent link using directed acyclic graph. Calculate values with marginalization.</li>
<li><strong>Bayesian Network</strong>: Graphical representation of direct dependencies over a set of variables and a set of <strong>conditional probability tables</strong> <span class="math inline">P(X | Parents(X))</span> quantifying the strength of influences.
<ul>
<li>So every variable is conditionally independent of all non-descendants given its parents.</li>
<li>Assume every random variable is directly influenced by at most <span class="math inline">k</span> others, so every table is of size at most <span class="math inline">O(2^k)</span>, so the entire network is specified in <span class="math inline">O(n2^k)</span>.</li>
</ul></li>
</ul>
<h2 id="testing-independence-d-separation">Testing Independence (D-Separation)</h2>
<ul>
<li><strong>D-Separation</strong>: Set of variables <span class="math inline">E</span> (evidence) separates <span class="math inline">X, Y</span> if it blocks every undirected path between <span class="math inline">X</span> and <span class="math inline">Y</span>. So <span class="math inline">X, Y</span> are conditionally independent given <span class="math inline">E</span>.</li>
</ul>
<p>Let <span class="math inline">P</span> be an undirected path between <span class="math inline">X, Y</span>. <span class="math inline">E</span> blocks path <span class="math inline">P</span> if there exists a node <span class="math inline">Z \in P</span> such that.</p>
<ol type="1">
<li>An arc on <span class="math inline">P</span> goes into <span class="math inline">Z</span> and an arc leaves <span class="math inline">Z</span>, and <span class="math inline">Z \in E</span>.
<ul>
<li><span class="math inline">X ... \to Z \to ... Y</span>.</li>
</ul></li>
<li>Both arcs on <span class="math inline">P</span> leave <span class="math inline">Z</span> and <span class="math inline">Z \in E</span>.
<ul>
<li><span class="math inline">X ... \gets Z \to ... Y</span>.</li>
</ul></li>
<li>Both arcs on <span class="math inline">P</span> enter <span class="math inline">Z</span> and neither <span class="math inline">Z</span>, nor any descentents, are in <span class="math inline">E</span>.
<ul>
<li><span class="math inline">X ... \to Z \cup Descentents(Z) \gets ... Y</span>.</li>
</ul></li>
</ol>
<ul>
<li><strong>Markov Blanket</strong>: Given parents and children, a node is conditionally independent of all other nodes in the network.</li>
</ul>
<p>Example:</p>
<blockquote>
<p><span class="math inline">\begin{aligned} P(J) &amp;= \sum P(J, TS, Fl, Fv, Th, M, ET) \\ &amp;= \sum_{M} P(J, M) \\ &amp;= \sum_{M} P(J \mid M) P(M) \\ &amp;= \sum_{M} P(J \mid M) \sum_{ET} P(M, ET) \\ &amp;= \sum_{M} P(J \mid M) \sum_{ET} P(M \mid ET) P(ET) \end{aligned}</span></p>
</blockquote>
<p>Example:</p>
<blockquote>
<p><span class="math inline">\begin{aligned} P(J \mid ET) &amp;= \frac{P(J, ET)}{P(ET)} \\ &amp;= \frac{1}{P(ET)} \sum_{M} P(J, M, ET) \\ &amp;= \frac{1}{P(ET)} \sum_{M} P(J | M, ET) P(M, ET) \\ &amp;= \frac{1}{P(ET)} \sum_{M} P(J | M, ET) P(M \mid ET) P(ET) \\ &amp;= \frac{1}{P(ET)} \sum_{M} P(J | M) P(M \mid ET) P(ET) \\ \end{aligned}</span></p>
</blockquote>
<p>Example: Backwards Inference.</p>
<blockquote>
<p><span class="math inline">\begin{aligned} P(ET \mid J) &amp;= \frac{P(J \mid ET)P(ET)}{P(J)} \\ &amp;= \frac{1}{P(J)} P(J, ET) \\ \end{aligned}</span></p>
</blockquote>
<h2 id="algorithm">Algorithm</h2>
<p>Given query variable <span class="math inline">Q</span>, evidence set <span class="math inline">E</span>, remaining variables <span class="math inline">Z</span>, set of factors <span class="math inline">F</span> corresponding to <span class="math inline">\{Q\} \cup Z</span>.</p>
<ol type="1">
<li>Replace factors <span class="math inline">f \in F</span> that mention a variable in <span class="math inline">E</span> with <span class="math inline">f_{E = e}</span>.</li>
<li>Choose elimination ordering of <span class="math inline">Z</span>.</li>
<li>For <span class="math inline">Z_j \in Z</span>, eliminate as follows.
<ul>
<li>Compute factor <span class="math inline">g_j = \sum_{Z_j} f_1 \times ... \times f_k</span>, where <span class="math inline">f_i</span> are factors that include <span class="math inline">Z_j</span>.</li>
<li>Remove the factors <span class="math inline">f_i</span> from <span class="math inline">F</span> and add new factor <span class="math inline">g_j</span> to <span class="math inline">F</span>.</li>
</ul></li>
<li>Remaining factors refer to <span class="math inline">Q</span> only. Take their product and normalize for <span class="math inline">P(Q)</span>.</li>
</ol>
<h1 id="reasoning-under-uncertainty">Reasoning Under Uncertainty</h1>
<h2 id="dynamic-inference">Dynamic Inference</h2>
<ul>
<li>Allow the world to evolve.</li>
<li>Set of states representing all possible worlds.</li>
<li>Time-splices representing snapshots of the world.</li>
<li>Different probability distributions over states at different time-slices.</li>
<li>Dynamic encoding of how distributions change over time.</li>
</ul>
<h2 id="stochastic-process">Stochastic Process</h2>
<ul>
<li>Set of states <span class="math inline">S</span>, <span class="math inline">P(s_t | s_{t-1}, ..., s_0)</span>.</li>
<li>Bayes Net with ony random variable per time slice.</li>
<li>Problem since there are infinitely many variables with infinity large CPTs.
<ul>
<li><strong>Stationary Process</strong>: Dynamics do not change over time.</li>
<li><strong>Markov Assumption</strong>: Current state depends on a finite history of past states.</li>
</ul></li>
</ul>
<h3 id="k-order-markov-process">k-Order Markov Process</h3>
<ul>
<li>Assume the last <span class="math inline">k</span> states are sufficient.</li>
<li>Problem is that uncertainty increases over time.</li>
</ul>
<h1 id="hidden-markov-model">Hidden Markov Model</h1>
<blockquote>
<p>First Order Hidden Markov Model</p>
</blockquote>
<ul>
<li>Set of states <span class="math inline">S</span>, set of observations <span class="math inline">O</span>, transition model <span class="math inline">P(s_t | s_{t-1})</span>, observation model <span class="math inline">P(o_t | s_t)</span>.</li>
</ul>
<ol type="1">
<li><strong>Monitoring</strong>: <span class="math inline">P(s_t | o_t, ..., o_1)</span>.</li>
<li><strong>Prediction</strong>: <span class="math inline">P(s_{t+k} | o_t, ..., o_1)</span>.</li>
<li><strong>Hindsight</strong>: <span class="math inline">P(s_k | o_t, ..., o_1)</span>.</li>
<li><strong>Most Likely Explanation</strong>: <span class="math inline">argmax_{s_t, ..., s_1} P(s_t, ..., s_1 | o_t, ..., o_1)</span>.
<ul>
<li><strong>Verterbi</strong> algorithm is a DP algorithm to compute the most likely sequence of states.</li>
</ul></li>
</ol>
<h1 id="dynamic-bayes-net">Dynamic Bayes Net</h1>
<blockquote>
<p>What if the number of states or observations are exponential?</p>
</blockquote>
<ul>
<li>Encode states and observations with several random variables.</li>
<li>Exploit conditional independence to save time and space.</li>
</ul>
<blockquote>
<p>HMMs are DBN with one state variable and one observation variable.</p>
</blockquote>
<ul>
<li>Set of states <span class="math inline">S</span>, with observations for every state.</li>
<li><strong>Non-Stationary Processes</strong>: When process is not stationary, we must add new state components until dynamics are stationary.</li>
<li><strong>Non-Markovian Processes</strong>: Add new state components until dynamics are Markovian.</li>
</ul>
<h1 id="statistical-learning">Statistical Learning</h1>
<blockquote>
<p>Where do the probabilities for Bayes Nets come from?</p>
</blockquote>
<ul>
<li>Experts are scarce and expensive. Data is cheap and plentiful.</li>
<li>Build models of the world directly from data.</li>
</ul>
<p>Example: Candy is sold in two flavours, lime and cherry, with the same wrapper for both flavours. Bags have different ratios, (e.g. 100% cherry, 50% cherry 50% line, 100% lime). After eating <span class="math inline">k</span> candies, what is the flavour ratio of the bag?</p>
<ul>
<li>Hypothesis <span class="math inline">H</span> is a probabilistic theory about the world.</li>
<li>Data <span class="math inline">D</span> is evidence.</li>
</ul>
<p><span class="math display">\begin{aligned}P(x | d) &amp;= \sum_i P(x | d, h_i) P(h_i | d) \\ &amp;= \sum_i P(x | h_i) P(h_i | d) \end{aligned}</span></p>
<blockquote>
<p>Predictions are weighted averages of the predictions of individual hypothesis.</p>
</blockquote>
<ul>
<li><strong>Optimal</strong>: Given prior, no other prediction is correct more often than Bayesian one.</li>
<li><strong>No Overfitting</strong>: Use prior to penalize complex hypothesis.</li>
<li><strong>Intractable</strong>: When hypothesis space is large.</li>
<li><strong>Approximations</strong>: Maximum a posteriori (MAP).</li>
</ul>
<h2 id="maximum-a-posteriori-map">Maximum a Posteriori (MAP)</h2>
<ul>
<li>Make prediction on the most probably hypothesis <span class="math inline">h_{MAP}</span>.</li>
<li><span class="math inline">h_{MAP} = argmax_{h_i} P(h_i | d)</span>, <span class="math inline">P(x | d) = P(x | h_{MAP})</span>.</li>
<li>Less accurate than Bayesian prediction since it only relies on one hypothesis, but converges as data increases.</li>
</ul>
<p><span class="math display">\begin{aligned}h_{MAP} &amp;= arg max_h P(h | d) \\
&amp;= arg max_h P(h) P(d | h) \\
&amp;= arg max_h P(h) \Pi_i P(d_i | h)\end{aligned}</span></p>
<ul>
<li>This is nonlinear optimization, take the log to linearize. Will not change optimal hypothesis.</li>
</ul>
<p><span class="math display"> h_{MAP} = arg max_h \left[\log P(h) + \sum_i \log P(d_i | h)\right]</span></p>
<h2 id="maximum-likelihood-ml">Maximum Likelihood (ML)</h2>
<ul>
<li>Simplify MAP by assuming uniform prior, <span class="math inline">P(h_i) = P(h_j)</span>.</li>
</ul>
<p><span class="math display">\begin{aligned}h_{MAP} &amp;= arg max_h P(h) P(d | h) \\
h_{ML} &amp;= arg max_h P(d | h)
\end{aligned}</span></p>
<ul>
<li>Less accurate than Bayesian and MAP, still converges as data increases.</li>
<li>No prior, so subject to overfitting.</li>
</ul>
<p>Example:</p>
<blockquote>
<p>Hypothesis <span class="math inline">h_\theta</span>, <span class="math inline">P(cherry) = \theta</span>, <span class="math inline">P(lime) = 1 - \theta</span>. - Data <span class="math inline">d</span>, <span class="math inline">N</span> candies with <span class="math inline">c</span> cherry and <span class="math inline">N - c</span> lime.</p>
</blockquote>
<p><span class="math display">\begin{aligned}P(d | h_\theta) &amp;= \theta^c (1 - \theta)^{N - c} \\
L(d | h_\theta) &amp;= \log P(d | h_\theta) \\
&amp;= c \log \theta + (N - c) \log(1 - \theta)
\end{aligned}</span></p>
<p>Find the <span class="math inline">\theta</span> that maximizes log likelihood.</p>
<p><span class="math display">\begin{aligned}
\frac{\partial L(d | h_\theta)}{\partial \theta} &amp;= \frac{c}{\theta} - \frac{N - c}{1 - \theta} = 0 \\
\theta &amp;= \frac{c}{N}
\end{aligned}</span></p>
<ul>
<li>With multiple hypothesis, take partial derivatives.</li>
<li>Approach is extendable to any Bayes networ with complete data.</li>
</ul>
<h2 id="laplace-smoothing">Laplace Smoothing</h2>
<ul>
<li>What happens if we have not observed any cherry candies?</li>
<li><span class="math inline">\theta = 0</span> is not a good prediction.</li>
</ul>
<p>Given observations <span class="math inline">x = (x_1, ..., x_d)</span> from <span class="math inline">N</span> trials, we use estimate parameters <span class="math inline">\theta</span>.</p>
<p><span class="math display">\begin{aligned}\theta &amp;= (\theta_1, ..., \theta_d) \\
\theta_i &amp;= \frac{x_i + \alpha}{N + \alpha d},\ \alpha &gt; 0\end{aligned}</span></p>
<h1 id="naive-bayes">Naive Bayes</h1>
<ul>
<li>Want to predict a class <span class="math inline">C</span> based on attributes <span class="math inline">A_i</span>.</li>
<li>Parameters.
<ul>
<li><span class="math inline">\theta = P(C = True)</span>.</li>
<li><span class="math inline">\theta_{j, 1} = P(A_j = True | C = True)</span>.</li>
<li>…</li>
</ul></li>
<li>Assumption is that <span class="math inline">A_i</span> are independent given <span class="math inline">C</span>.</li>
</ul>
<p>With observed attribute values <span class="math inline">x_1, ..., x_n</span>.</p>
<p><span class="math display">P(C | x_1, ..., x_n) = \alpha P(C) \Pi_i P(x_i | C)</span></p>
<p>From ML we know that the parameters are just observed frequencies with possible Laplace smoothing, we just need to choose the most likely class <span class="math inline">C</span>.</p>
<h2 id="text-classifiation">Text Classifiation</h2>
<blockquote>
<p>Example application of Naive Bayes.</p>
</blockquote>
<ul>
<li><strong>Given</strong>: Collection of documents, classified as <em>interesting</em> or <em>not interesting</em>.</li>
<li><strong>Goal</strong>: Learn a classifier to look at a text of new documents and provide a label.</li>
</ul>
<h3 id="data-representation">Data Representation</h3>
<ul>
<li>Consider possible significant words that can occur in documents.</li>
<li>Stem words, map words to their root.</li>
<li>For every root, introduce common binary feature for whether the word is present in the document.</li>
<li>Naive Bayes assumption that words are independent of each other, so <span class="math inline">P(y | document) \propto \Pi_{i=1}^{|Vocab|} P(w_i | y)</span>.
<ul>
<li>Use ML parameter estimation <span class="math inline">P(w_i | y)</span> as the percentage of documents of class <span class="math inline">y</span> which contain word <span class="math inline">w_i</span>.</li>
</ul></li>
<li>When <span class="math inline">\theta</span> cannot be found analytically, gradient serach to find a good value of <span class="math inline">\theta</span>. <span class="math inline">\theta \gets \theta + \alpha \frac{\partial L(\theta | d)}{\partial \theta}</span>.</li>
</ul>
<h1 id="expectation-maximization-em">Expectation Maximization (EM)</h1>
<p>Previous problems always have values of all attributes known and learning is relatively easy. Many real world problems have hidden variables with incomplete data and missing attribute values.</p>
<h2 id="maximum-likelihood-learning">Maximum Likelihood Learning</h2>
<ul>
<li><span class="math inline">\theta_{V = True, Parent(V) = x}</span> is the proportion of instances of <span class="math inline">V = x</span> with <span class="math inline">V = True</span>.</li>
<li>What is some values are missing?</li>
<li><strong>Naive Solutions</strong>.
<ul>
<li>Ignore examples with missing attribute values. What is all examples have missing attribute values?</li>
<li>Ignore hidden variables, model might become more complex.</li>
</ul></li>
</ul>
<h2 id="direct-ml"><em>Direct</em> ML</h2>
<p>Maximize likelihood directly where <span class="math inline">E</span> are the evidence variables and <span class="math inline">Z</span> are the hidden variables.</p>
<p><span class="math display">\begin{aligned}
h_{ML} &amp;= arg max_h P(E | h) \\
&amp;= arg max_h \sum_Z P(E, Z | h) \\
&amp;= arg max_h \sum_Z \Pi_i CPT(V_i) \\
&amp;= \arg max_h \log \sum_z \Pi_i CPT(V_i)
\end{aligned}</span></p>
<h2 id="expectation-maximization-em-1">Expectation-Maximization (EM)</h2>
<ul>
<li>If we knew the missing values then we can compute <span class="math inline">h_{ML}</span>.</li>
</ul>
<ol type="1">
<li>Guess <span class="math inline">h_{ML}</span>.</li>
<li>Iterate.
<ul>
<li><strong>Expectation</strong>: Based on <span class="math inline">h_{ML}</span> compute expectation of (missing) values.</li>
<li><strong>Maximization</strong>: Based on expected (missing) values, compute new <span class="math inline">h_{ML}</span>.</li>
</ul></li>
</ol>
<p><span class="math display">\begin{aligned}
h_{i+1} &amp;= arg max_h \sum_Z P(Z | h_i, e) \log P(e, Z | h_i) \\
&amp;= arg max_h \sum_Z P(Z | h, e) \log \Pi_h CPT_j \\
&amp;= arg max_h \sum_Z P(Z | h, e) \sum_j \log CPT_j
\end{aligned}</span></p>
<p>Monotonic improvement of likelihood, <span class="math inline">P(e | h_{i+1}) \ge P(e | h_i)</span>.</p>
<p>Example: Two coins <span class="math inline">A, B</span>. Probability of heads with <span class="math inline">A</span> is <span class="math inline">\theta_A</span>, probability of heads with <span class="math inline">B</span> is <span class="math inline">\theta_B</span>. Want to find <span class="math inline">\theta_A, \theta_B</span> by performing a number of trials.</p>
<ul>
<li>Set of trials where we know which coin was used gives us <span class="math inline">\theta_A = 0.8</span>, <span class="math inline">\theta_B = 0.45</span>.</li>
<li>Now, given a set of trials where we do not know which coin was used (hidden variable).</li>
</ul>
<ol type="1">
<li>HTTTHHTHTH</li>
<li>HHHHTHHHHH</li>
<li>HTHHHHHTHH</li>
<li>HTHTTTHHTT</li>
<li>THHHTHHHTH</li>
</ol>
<p><strong>Initialization</strong>: Let’s say <span class="math inline">\theta_A^0 = 0.6</span>, <span class="math inline">\theta_B^0 = 0.5</span>. Assume that the probability of <span class="math inline">A, B</span> being used is the same, so <span class="math inline">P(A) = P(B) = 0.5</span>.</p>
<ul>
<li><strong>Expectation</strong>: Compute expected counts of heads and tails. <span class="math inline">P(A | T_i) = \frac{P(T_i | A) P(A)}{\sum_{j \in \{A, B\}} P(T_i | j) P(j)}</span>.</li>
</ul>
<ol type="1">
<li><span class="math inline">P(T_1 | A) = {10 \choose 5} 0.6^5 0.4^5 \approx 0.2</span>. <span class="math inline">P(T_1 | B) = {10 \choose 5} 0.5^{10} \approx 0.246</span>. So <span class="math inline">P(A | T_1) = \frac{0.2}{0.2 + 0.246} \approx 0.45</span>, <span class="math inline">P(B | T_1) = 1 - P(A | T_1) \approx 0.55</span>. Give both coins a proportional number of heads and tails,</li>
</ol>
<table>
<thead>
<tr class="header">
<th>Coin A</th>
<th>Coin B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>2.2, 2.2</td>
<td>2.8, 2.8</td>
</tr>
</tbody>
</table>
<ol start="2" type="1">
<li><span class="math inline">P(T_2 | A) = {10 \choose 9} 0.6^9 0.4^1 \approx 0.0403</span>. <span class="math inline">P(T_2 | B) = {10 \choose 9} 0.5^10 \approx 0.01</span>. So <span class="math inline">P(A | T_2) \approx 0.8</span>, <span class="math inline">P(B | T_2 \approx 0.2</span>.</li>
</ol>
<table>
<thead>
<tr class="header">
<th>Coin A</th>
<th>Coin B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>2.2, 2.2</td>
<td>2.8, 2.8</td>
</tr>
<tr class="even">
<td>7.2, 0.8</td>
<td>1.8, 0.2</td>
</tr>
</tbody>
</table>
<p>…</p>
<table>
<thead>
<tr class="header">
<th>Coin A</th>
<th>Coin B</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>2.2, 2.2</td>
<td>2.8, 2.8</td>
</tr>
<tr class="even">
<td>7.2, 0.8</td>
<td>1.8, 0.2</td>
</tr>
<tr class="odd">
<td>5.9, 1.5</td>
<td>2.1, 0.5</td>
</tr>
<tr class="even">
<td>1.4, 2.1</td>
<td>2.6, 3.9</td>
</tr>
<tr class="odd">
<td>4.5, 1.9</td>
<td>2.5, 1.1</td>
</tr>
<tr class="even">
<td><strong>23.1, 8.6</strong></td>
<td><strong>11.7, 8.4</strong></td>
</tr>
</tbody>
</table>
<ul>
<li><strong>Maximization</strong>: Compute parameters based on expected counts and repeat. <span class="math inline">\theta^1_A = \frac{21.3}{21.3 + 8.6} = 0.71</span>, <span class="math inline">\theta^1_B = \frac{11.7}{11.7 + 8.4}</span>.</li>
</ul>
<p>…</p>
<p>Eventually we will see similar probabilities to <span class="math inline">0.8, 0.45</span>.</p>
<h2 id="k-means-algorithm">K-Means Algorithm</h2>
<ul>
<li><strong>Input</strong>.
<ul>
<li>Set of examples <span class="math inline">E</span>.</li>
<li>Input features <span class="math inline">X_1, ..., X_n</span>.</li>
<li><span class="math inline">val(e, X)</span>, value of feature <span class="math inline">j</span> for example <span class="math inline">e</span>.</li>
<li><span class="math inline">k</span> classes.</li>
</ul></li>
<li><strong>Output</strong>.
<ul>
<li>Function class <span class="math inline">E \to \{1, ... k\}</span> which classifies examples.</li>
<li>Function <span class="math inline">pval</span>, where <span class="math inline">pval(i, X_j)</span> is the predicted value of feature <span class="math inline">X_j</span> for every example in class <span class="math inline">i</span>.</li>
</ul></li>
<li>Use sum-of-squares error for class <span class="math inline">i</span> and pval.</li>
</ul>
<p><span class="math display">\sum_{e \in E} \sum_{j=1}^n (pval(class(e), X_j) - val(e, X_j))^2</span></p>
<ul>
<li>Given class, the pval that minimizes the error is the mean value for that class. Prove by taking derivative.</li>
<li>Given pval, every example assigned to the class that minimizes the error for that example.</li>
</ul>
<ol type="1">
<li>Randomly assign examples to classes.</li>
<li>Repeat following EM steps until assignment does not change.
<ul>
<li><strong>M</strong>: For every class <span class="math inline">i</span> and feature <span class="math inline">X_j</span>, <span class="math inline">pval(i, X_j) = \dfrac{\sum_{e: class(e) = i} val(e, X_j)}{|\{e: class(e) = i\}|}</span></li>
<li><strong>E</strong>: For every example <span class="math inline">e</span>, assign <span class="math inline">e</span> to class that minimizes <span class="math inline">\sum_{j=1}^n (pval(class(e), X_j) - val(e, X_j))^2</span>.</li>
</ul></li>
</ol>
<ul>
<li>Will eventually converge to a stable local minimum.</li>
<li>Increasing <span class="math inline">k</span> can alwasy decrease error but is not always meaningful.</li>
</ul>
<h1 id="markov-decision-processes">Markov Decision Processes</h1>
<h1 id="markov-chains">Markov Chains</h1>
<ul>
<li>Agent in a state <span class="math inline">s \in S</span>.</li>
<li>Initial state <span class="math inline">s_0 = 0</span>.</li>
<li>State that agent transitions to from <span class="math inline">s_t</span> and time <span class="math inline">t</span> is only determined by probability distribution of current state.</li>
<li>Represent transition as probability matrix <span class="math inline">P</span>.</li>
</ul>
<h2 id="discounted-rewards">Discounted Rewards</h2>
<ul>
<li>Reward in future is not worth as much as reward now.</li>
<li><strong>Discounted sum of future awards</strong>: Using discount factor <span class="math inline">\gamma</span>, <span class="math inline">\sum_{i \ge 0} \gamma^i R_i</span>, where <span class="math inline">R_i</span> is the reward after <span class="math inline">i</span> time steps.</li>
</ul>
<h2 id="solving-a-markov-process">Solving a Markov Process</h2>
<ul>
<li><strong>Input</strong>.
<ul>
<li>Set of states <span class="math inline">S = \{s_1, ..., s_n\}</span> with corresponding rewards <span class="math inline">\{r_1, ..., r_n\}</span>.</li>
<li>Discount factor <span class="math inline">\gamma</span>.</li>
<li>Transition probability matrix <span class="math inline">P</span>.</li>
</ul></li>
<li>Let <span class="math inline">U^*(s_i)</span> be the expected discount sum of future rewards starting at state <span class="math inline">s_i</span>.</li>
<li><span class="math inline">U^*(s_i) = r_i + \gamma \sum_{j=1}^n P_{ij} U*(s_j)</span>.</li>
<li>Closed form solution <span class="math inline">U = (I - \gamma P)^{-1} R</span>.</li>
</ul>
<h3 id="value-iteration">Value Iteration</h3>
<p>Manually solve <span class="math inline">U^k(s_i)</span>, the expected discounted sum of rewards over the next <span class="math inline">k</span> time steps.</p>
<p><span class="math display">\begin{aligned}
U^1(s_i) &amp;= r_i \\
U^k(s_i) &amp;= r_i + \gamma \sum_{j=1}^n P_{ij} U^{k-1}(s_j)
\end{aligned}</span></p>
<ul>
<li>Repeat until convergence to desired <span class="math inline">\epsilon</span>.</li>
</ul>
<h2 id="policies">Policies</h2>
<ul>
<li>Mapping from states to actions.</li>
<li>For every MDP there exists an optimal policy. For every possible start state, there is no better option than to follow the policy.</li>
<li>Run through every possible policy and select the best.</li>
</ul>
<p>Let <span class="math inline">P^k</span> represent the transition matrix for policy <span class="math inline">k</span>.</p>
<p><strong>Bellman’s Equation</strong>: <span class="math inline">V^{t+1}(s_i) = \max_k\left[r_i + \gamma \sum_{j=1}^n P^k_{ij}V^t(s_j)\right]</span></p>
<ul>
<li>Find optimal policy by taking another step and seeing which policy gives the best result.</li>
</ul>
<h3 id="policy-iteration">Policy Iteration</h3>
<ul>
<li><strong>Policy evaluation</strong>: Given <span class="math inline">\pi</span>, compute <span class="math inline">V_i = V^\pi</span>.</li>
<li><strong>Policy improvement</strong>: Calcualte new <span class="math inline">\pi_{i+1}</span> using 1-step lookahead.</li>
</ul>
<h2 id="partially-observable-mdps-pomdps">Partially Observable MDPs (POMDPs)</h2>
<ul>
<li>Agent maintains belief state <span class="math inline">b</span>, probability distribution over all states.</li>
<li>Optimal action depends only on agent’s current belief state.</li>
<li>Representably as a MDP by summing over possible states <span class="math inline">s^\prime</span> that an agent might reach.
<ul>
<li>Given current <span class="math inline">b</span>, execute action <span class="math inline">a = \pi^*(b)</span>.</li>
<li>Receive observation <span class="math inline">o</span>.</li>
</ul></li>
</ul>
<p><span class="math display">P(b^\prime | a, b) = \sum_o P(b^\prime, o, a, b) \sum_{s^\prime} O(o | s^\prime) \sum_s P(s^\prime | a, s) b(s)</span></p>
<h1 id="reinforcement-learning">Reinforcement Learning</h1>
<ul>
<li>Learning what to do to maximize a numerical reward signal.</li>
<li>Not told what actions to take.</li>
<li>Discover value of actions by trying them out and seeing what reward is.</li>
</ul>
<p><strong>Characteristics</strong>.</p>
<ul>
<li>Set of states <span class="math inline">S</span>.</li>
<li>Set of actions <span class="math inline">A</span>.</li>
<li>Set of reinforcement signals (rewards). Delayed reward possible, credit assignment problem.</li>
<li>Exploration and exploitation.</li>
<li>States may be partially observable only.</li>
<li>Continuous learning.</li>
</ul>
<h2 id="multi-armed-bandit">Multi-Armed Bandit</h2>
<blockquote>
<p>Simplest reinforcement learning problem.</p>
</blockquote>
<p>At every time step <span class="math inline">t</span>, choose action <span class="math inline">A_t</span> from <span class="math inline">k</span> possible actions, receive reward <span class="math inline">R_t</span>.</p>
<ul>
<li>Reward only depends on action taking.</li>
<li>True values are unknown, distribution is unknown.</li>
<li><strong>Goal</strong>: Maximize total reward.</li>
</ul>
<h2 id="exploration-vs.exploitation">Exploration vs. Exploitation</h2>