Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
title author date
非参估计作业
Lijun Wang
June 9, 2017

第一题

${Y_t}\sim U[-3,3]$,进行核密度估计

  • 选择不同的核函数进行估计,$K_1(x)$为高斯核,$K_2(x)=0.5I(\vert x\vert\le 1)$
  • 通过MSE和MISE选择窗宽
  • 画图$\sqrt{nh}(\hat f(x)-f(x))$

估计

首先生成$n$个$U(-3,3)$的随机数据。

n = 1000
set.seed(12345)
y = runif(n, -3, 3)

自己编写核函数估计的函数

mykde(y, kernel, h, reflect, y.range)

其中参数"y"是需要进行核密度估计的数据,"kernel"可以选择高斯核函数("gaussian")或者均匀核函数("uniform"),"h"为窗宽,"reflect"表示是否通过反射处理边缘数据,如果需要进行反射处理,则需要输入参数"y.range"来确定对称轴。具体实现细节如下:

mykde <- function(y, kernel = "gaussian", h = 0.1, reflect = FALSE, y.range=NULL )
{
  n = 1000
  x = seq(-3, 3, length.out = 1000)
  x.left = x[1:250]
  x.middle = x[251:750]
  x.right = x[751:1000]
  if (kernel == "gaussian")
  {
    kernel.gaussian <- function(x)
    {
      return(exp(-x^2/2)/sqrt(2*pi))
    }
    if (reflect)
    {
      fhat.left = sapply(x.left,
        function(x) sum(kernel.gaussian((y-x)/h) +
                          kernel.gaussian((2*y.range[1]-y-x)/h))/(n*h))
      fhat.middle = sapply(x.middle,
        function(x) sum(kernel.gaussian((y-x)/h))/(n*h))
      fhat.right = sapply(x.right,
        function(x) sum(kernel.gaussian((y-x)/h) +
                          kernel.gaussian((2*y.range[2]-y-x)/h))/(n*h))
      fhat = c(fhat.left, fhat.middle, fhat.right)
    }  
    else
      fhat = sapply(x, function(x) sum(kernel.gaussian((y-x)/h))/(n*h))
  }
  else if (kernel == "uniform")
  {
    kernel.uniform <- function(x)
    {
      x[abs(x) < 1] = 1
      x[abs(x) > 1] = 0
      return(0.5*x)
    }
    if (reflect)
    {
      fhat.left = sapply(x.left,
        function(x) sum(kernel.uniform((y-x)/h) +
                          kernel.uniform((2*y.range[1]-y-x)/h))/(n*h))
      fhat.middle = sapply(x.middle,
        function(x) sum(kernel.uniform((y-x)/h))/(n*h))
      fhat.right = sapply(x.right,
        function(x) sum(kernel.uniform((y-x)/h) +
                          kernel.uniform((2*y.range[2]-y-x)/h))/(n*h))
      fhat = c(fhat.left, fhat.middle, fhat.right)
    }
    else
      fhat = sapply(x, function(x) sum(kernel.uniform((y-x)/h))/(n*h))
  }
  else
    cast(paste0("Can not support ", kernel,
                "kernel, please change to gaussian kernel or uniform kernel.."))
  res = list(x = x,
             fhat = fhat)
  return(res)
}

取窗宽$h=0.395$,在考虑边缘效应和不考虑边缘效应时进行估计。对于反射的具体处理如下,若$y\in [a,b]$。则对较小的$y$(从小到大排序,位于前25%)取关于$y=a$的对称点,对较大的$y$(从小到大排序,位于后25%)取关于$y=b$的对称点,即 $$ \begin{array}{ll} f^(x)&=\frac 1{nh}\sum\limits_{t=1}^n[K(\frac{y_t-x}h)+K(\frac{2a-y_t-x}h)], \qquad x\in [a, a+\frac{1}4(b-a)]\ f^(x)&=\frac 1{nh}\sum\limits_{t=1}^n[K(\frac{y_t-x}h)+K(\frac{2b-y_t-x}h)], \qquad x\in [a+\frac{3}4(b-a), b] \end{array} $$

## 高斯核
mykde.gaussian = mykde(y, kernel = "gaussian", h = 0.395)
mykde.gaussian2 = mykde(y, kernel = "gaussian", h = 0.395,
                        reflect = TRUE, y.range = c(-3,3))
par(mfrow = c(1, 2))
plot(mykde.gaussian$x, mykde.gaussian$fhat, ylim = c(0, 1/5),
     cex.axis = .7, cex = 0.5)
segments(-3,1/6,3,1/6,col = "red")
plot(mykde.gaussian2$x, mykde.gaussian2$fhat, ylim = c(0, 1/5),
     cex.axis = .7, cex = 0.5)
segments(-3,1/6,3,1/6,col = "red")

左图是不考虑边缘效应的核密度估计,而右图是通过反射法来考虑边缘效应进行核密度估计。从高斯核密度估计的这两种图象可以看出,反射处理能够有效减轻边缘效应带来的核密度估计的影响。

## 均匀核
mykde.uniform = mykde(y, kernel = "uniform", h = 0.395)
mykde.uniform2 = mykde(y, kernel = "uniform", h = 0.395,
                       reflect = TRUE, y.range = c(-3,3))
par(mfrow = c(1, 2))
plot(mykde.uniform$x, mykde.uniform$fhat, ylim = c(0, 1/5),
     cex.axis = .7, cex = 0.5)
segments(-3,1/6,3,1/6,col = "red")
plot(mykde.uniform2$x, mykde.uniform2$fhat, ylim = c(0, 1/5),
     cex.axis = .7, cex = 0.5)
segments(-3,1/6,3,1/6,col = "red")

从这两张图象可以看出,反射处理能够有效减轻边缘效应带来的核密度估计的影响。从均匀核密度估计的这两张图象可以看出,反射处理能够有效减轻边缘效应带来的核密度估计的影响。

确定窗宽

下面选择高斯核函数,并采用反射方法处理边缘效应。

MSE

h = seq(0.01, 1, length.out = 100)
fhat = sapply(h, function(x) {res = mykde(y, h = x,
                              reflect = TRUE, y.range = c(-3,3)); res$fhat})
fhat.mse = apply(fhat, 2, function(x) (x-1/6)^2)
fhat.mse.min = apply(fhat.mse, 1, function(x) which.min(x))

以$y_{123}, y_{523}$为例,做出$E(\hat f(x)-f(x))^2$关于$h$的图象

par(mfrow = c(1,2))
plot(h, fhat.mse[123, ], ylab = "MSE", xlab = "h")
plot(h, fhat.mse[523, ], ylab = "MSE", xlab = "h")

从图象可以看出MSE的变化趋势,当然对于不同的点其变化趋势不同,对于每个点我们选择使得MSE最小的那个$h$,这样选出来的$h$如下(只显示了前20个x对应的$h$)

h[fhat.mse.min[1:20]]

MISE

fhat.mise = colSums(fhat.mse)/ncol(fhat.mse)
plot(h, fhat.mise, ylab = "MISE", xlab = "h")

从图象看起来,MISE是随着$h$的增大而降低的(在给定的$h$的范围内),这与上课老师讲的最优$h$在$O(h^{-1/5})$处取得似乎存在矛盾。但注意到当$f(x)=1/6$时 $$ Bias=\frac {h^2}2f''(x)\int_{-A}^Az^2K(z)dz+o(h^2) = o(h^2) $$ 则MSE主要依赖方差项, $$ Var(\hat f(x))\approx\frac 1{nh} \int_{-A}^AK^2(z)dz \cdot f(x) $$ 对于方差项,$h$越大,则方差越小,从而MSE(MISE)随着h的增大而降低,如果从这个角度看,图象中表现出的MISE随着$h$的增大而降低是合理的。

渐近分布

取$h=0.1$,采用高斯核函数进行核密度估计,考虑 $$ \sqrt{nh}(\hat f(x)-f(x)) $$ 的渐进分布

n = 10000
set.seed(12345)
y = runif(n, -3, 3)
hh = 0.1
res = mykde(y, kernel = "gaussian", h = hh, reflect = TRUE, y.range = c(-3,3))
fhat = res$fhat
plot(density(sqrt(n*hh)*(fhat-mean(fhat))), main = "limiting distribution")
# normal
x = seq(-5, 5, length.out = 1000)
sigmahat = 1.2
p = 1/sqrt(2*pi*sigmahat^2)*exp(-x^2/(sigmahat*2))
lines(x, p, col = "red")

图中黑线是$\sqrt{nh}(\hat f(x)-f(x))$在$n=10000, h=0.1$情形下的密度曲线,红色曲线是$\sigma=1.2, \mu=0$的正态分布密度曲线,可以看出两者近似程度还是相当高的,与极限分布为正态的结论是一致的。

第二题

$$ Y_t=m(X_t)+\varepsilon_t $$ 其中 $$ m(x)=2sin\pi x \qquad x\in[0,2] $$ $\varepsilon_t$任取,模拟时取$X_t\sim U[0,2]$

对$\hat m(x)$进行局部线性估计,计算$\hat m(x)$的MSE和MISE.

理论推导

由泰勒展开有 $$ m(x)=m(x_0)+(x-x_0)m'(x_0)+o(\vert x-x_0\vert), \qquad x\rightarrow x_0 $$

则有 $$ Y_t=m(x_t)=\alpha+\beta(x_t-x_0)+\varepsilon_t $$ 则局部最小二乘估计为 $$ (\hat \alpha(x_0), \hat\beta(x_0))=\underset{\alpha(x_0),\beta(x_0)}{arg; min};\sum\limits_{t=1}^n(y_t-\alpha-\beta(x_t-x_0))^2K(\frac{x_t-x_0}h) $$ 记

$$ X = \left( \begin{array}{ll} 1 & x_1-x_0\ 1 & x_2-x_0\ \cdot &\cdots\ 1 & x_n-x_0 \end{array} \right) \quad W = \left( \begin{array}{ccc} K(\frac{x_1-x_0}{h}) & &O\ & \ddots& \ O&&K(\frac{x_n-x_0}{h}) \end{array} \right) \qquad Y= \left( \begin{array}{c} y_1\ y_2\ \vdots\ y_n \end{array} \right) $$ 并记$\bm{\beta} =(\alpha,\beta)'$

则局部最小二乘估计写成矩阵形式为

$$ \hat {\bm\beta(x_0)} = \underset{\bm\beta(x_0)}{arg;min}; (Y-X\bm\beta)^TW(Y-X\bm\beta) $$ 即为加权最小二乘估计, $$ \hat{\bm\beta}(x_0)=(X'WX)^{-1}(XWY) $$ 且 $$ \hat m(x_0) = \alpha(x_0) $$ 在"lm()"函数中指定权重"weights=W"便可以求解。

编程求解

选择高斯噪声,产生$n$个数据

n = 1000
set.seed(1234)
xt = runif(n, 0, 2)
set.seed(4321)
et = rnorm(n)
yt = 2*sin(pi*xt) + et
#yt = xt + et

编写myloess函数得到局部最小二乘的估计结果

myloess(xt, yt, h, ngrid)

其中,"xt" 和"yt" 分别是数据点的横纵坐标,"h" 为窗宽,"n.grid"是网格剖分的格数,具体函数实现细节如下

myloess <- function(xt, yt, h, ngrid)
{
  x.grid = seq(0, 2, length.out = ngrid)
  kernel.gaussian <- function(x)
  {
      return(exp(-x^2/2)/sqrt(2*pi))
  }
  kernel.uniform <- function(x)
  {
      x[abs(x) < 1] = 1
      x[abs(x) > 1] = 0
      return(0.5*x)
  }
  weight <- function(x)
  {
    return(sapply(xt, function(xx) kernel.gaussian((xx-x)/h)))
  }
  w.grid = sapply(x.grid, weight)
  alpha.grid = numeric(ngrid)
  beta.grid = numeric(ngrid)
  y.grid = numeric(ngrid)
  for(i in 1:ngrid)
  {
    lm.x = xt - x.grid[i]
    lm.fit = lm(yt ~ lm.x, w = w.grid[,i])
    lm.coef = coef(lm.fit)
    alpha.grid[i] = lm.coef[[1]]
    beta.grid[i] = lm.coef[[2]]
    y.grid[i] = alpha.grid[i]
    #y.grid[i] = alpha.grid[i] + beta.grid[i]*(x.grid[i]-x.grid[i])
    #y.grid[i] = predict(lm.fit, data.frame(lm.x = x.grid[i]))
  }
  res = list(alpha = alpha.grid,
             beta = beta.grid,
             x = x.grid,
             y = y.grid)
  return(res)
}

选取窗宽$h=0.15, ngrid=512$进行模拟,得到下面结果

res = myloess(xt, yt, h = 0.15, ngrid = 512)

作出核密度估计拟合后的结果,与原始结果进行比较,可以看出局部线性估计可以很好地得到$y$与$x$原始的关系。

par(mfrow=c(1,2))
plot(xt, yt)
plot(res$x, res$y)
lines(res$x, 2*sin(pi*res$x), col = "red")

右图中红色线条为$y=2sin\pi x$的曲线,可以看出局部线性估计在极值处的估计偏小,即$x=0.5,1.5$处,其余地方与真实的差距不是很大。

MSE和MISE

$$ MSE(\hat m(x))=E(\hat m(x)-m(x))^2 $$ $$ MISE(\hat m(x)) = E\int R (\hat m(x)-m(x))^2dx = \frac{1}{N}\sum\limits{i=1}^N(\hat m(x_i)-m(x_i)) $$ 编写下列代码计算MSE和MISE

# MSE
mx.mse = (res$y - 2*sin(pi*res$x))^2
summary(mx.mse)
# MISE
mx.mise = mean(mx.mse)
mx.mise

计算出所有点的MSE,其四分位数分布结果如上所示,其均值即为MISE,即MISE=0.0172489。