klir/tutorial.Rmd at master · kingsleyiyawe/klir · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
title: "Using KLI in R"
author: "Sean Carver"
output: html_document
---

To use KLI-R you must first source the code:

```{r sourcing}
source('kli.R')
```

The next step is to code (and source) the models you need.  However, you can skip this step if it has already been done for you.  For the moment, KLI-R comes only with code to study models of the t-distribution with different degress of freedom (df) and the standard Normal distribution with df = Inf.  We will work with those models first, then I'll show you how to add your own models.

To use KLI-R you must have data.  You can gather data in an experiment or you can simulate data.  To simulate data, you must decide what model to use for the simulations.  If you want 10 samples from the standard normal distribution type:

```{r normaldata}
sim.t(n=10, df=Inf)
```

The first parameter n=10 is a required parameter that specifies the number of samples returned in the data, in this case 10.  The result is a vector of 10 numbers, one for each sample.

The second parameter df=Inf is the number of degrees of freedom.
The standard normal distribution is the limit of the t-distribution as its parameter "df," degrees of freedom, grows to infinity.  Currently there is no way to collect data from a non-standard normal distribution using the code from KLI-R, however there is no reason why you couldn't use the R function rnorm() to produce such data.  When we discuss writing our own models, I'll show you how to write a wrapper for rnorm() that will make its use more convenient.  The function sim.t() is such a wrapper for R's t-distribution simulator rt().

On the other hand, using KLI-R code, if you want to simulate data from a t-distribution, other than standard normal, set the parameter "df" to something other than Inf.

```{r t5data}
sim.t(n=10, df=5)
```

If you do not specify df, it defaults to Inf.

```{r normaldata2}
sim.t(n=10)
```

Notice the numbers obtained the second time (without explicitly setting df) differed from the numbers we obtained the first time (explicitly setting df=Inf).  Different data arise because we are using a dynamic seed.  Even if we reexcecute commands identical to those shown above, we would get different data, each time.  To avoid this behavior, we need to fix the seed, either by using the set.seed() function (not shown), or by passing in the parameter "seed" (shown below).  With the same seed, you get the same data:

```{r normaldataseed6}
sim.t(n=10, df=Inf, seed=6)
```

```{r normaldata2seed6}
sim.t(n=10, seed=6)
```

The next step evaluates the log-likelihood that a model (the same model or another misspecified model) produces the data.  For models with distributions other than the t-distribution or the standard normal distribution, you will need to code the log-likelihood function and run it on your data.  However, for the t-distribution or standard normal, I have done this step for you with the function likes.t(x, df=Inf).  Here x is your data and is a required parameter.  The other parameter df is an optional parameter which defaults to Inf (standard normal), if you omit a value for df.

First let's collect some more data and save it to the variable data.df5:

```{r datadf5}
data.df5 <- sim.t(n=100, df=5, seed=7)
head(data.df5)
```

The log-likelihood at a given data point is the value of the log of the probability density function at that data point.

The probability density function of the model in question (t with 5 degrees of freedom) is shaped roughly like a standard Normal density curve (i.e. a bell curve), except that there is more probability mass in the tails, and less probability mass in the center.

Here is a picture of the density curves (I use the function "dt" here instead of "likes.t" to avoid taking the log) for the t distribution with 5 degrees (red) and t with infinite degrees of freedom (blue, standard normal).  The overlap is in purple.

```{r df5infdcurve}
library(ggplot2)
densitycurve5 <- function(x) {return(dt(x,df=5))}
densitycurveinf <- function(x) {return(dt(x,df=Inf))}
ggplot(NULL,aes(c(-3,3))) +
  geom_area(stat="function",fun=densitycurve5,fill="red",alpha=I(.4),xlim = c(-3,3), show.legend=TRUE) +
  geom_area(stat="function",fun=densitycurveinf,fill="blue",alpha=I(.4),xlim=c(-3,3), show.legend=TRUE) +
  xlab("Data Point") + ylab("Density") + scale_fill_manual(name="df",values=c("red","blue"),labels=factor(c(5,Inf))) +
  theme(legend.position="right")
```

The blue band shows where the standard normal density curve lies above the t(5) density curve (in the center).  And the red bands where the heavy tails of the t-distribution lie above the normal density curve.

Now let's evaluate the log-likelihoods of 5-degree of freedom data points, with the model correctly specified:

```{r likesdf5}
likes.df5 <- likes.t(data.df5, df=5)
head(likes.df5)
```

Now let's evaluate the log-likelihood of the same data, with the model misspecified as infinite-degrees of freedom:

```{r likesinf}
likes.dfinf <- likes.t(data.df5, df=Inf)
head(likes.df5)
```
Now let's evaluate the log of the likelihood ratio (difference of log-likelihoods):
```{r likerat}
likes.rat<-like.ratios(likes.df5, likes.dfinf)
head(likes.rat)
```