📈統計的問題を回避するためのデータ解析のプロトコル (Zuur et al. 2010): 4 データの中にゼロがたくさんあるか？

この記事では統計的問題を回避するためのデータ解析のプロトコル (Zuur et al. 2010)で扱われているゼロ過剰問題を取り扱っている。

uribo.hatenablog.com

離散値の整数かならるカウントデータの多くはポアソン分布に従うことが一般的である。しかし、ある生息地における生物の観察数やスポーツにおける試合の得点など、０を多く含むデータが存在する。

そうしたデータについて統計モデルを適用する場合、ポアソン分布や負の二項分布を仮定した一般化線形モデル GLMなどを行うと、ポアソン分布で期待されるよりも過剰（あるいは過少）にデータが観測されることがあり推定がうまくいかないことがある。そのデータのように０の割合が多いデータに対して有効なモデルがゼロ過剰なポアソン分布モデル Zero-inflated Poisson Distribution: ZIPモデルである。

📉 カウントデータにポアソン分布を用いる

"Analysis of Categorical Data with R (Christopher R. Bilder and Thomas M. Loughin 2014)"から例題として出されているデータセット(Tauber et al. 1996)を利用させてもらう。このデータセットはGalerucella nymphaeaeという甲虫のオスとメスのペアを異なる温度条件で飼育し、産卵数を観察したものである。

このデータから温度に対する産卵数の効果を検討してみたい。

library(data.table)
library(dplyr)

beetle_egg <- fread(input = "BeetleEggCrowding.txt",
                        data.table = FALSE) %>% 
  dplyr::filter(TRT == "I")

beetle_egg$NumEggs %>% {
  print(class(.))
  print(head(.))
}

## [1] "integer"
## [1] 8 0 3 5 1 3

まずはGLMでポアソン分布を当てはめてみる

poi_mod <- beetle_egg %>% glm(NumEggs ~ Temp, 
                              family = poisson(),
                              data   = .)
summary(poi_mod)

## 
## Call:
## glm(formula = NumEggs ~ Temp, family = poisson(), data = .)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.947  -2.662  -1.259   1.569   5.174  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16026    0.91570  -0.175   0.8611
## Temp         0.06787    0.04034   1.682   0.0925
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 418.28  on 69  degrees of freedom
## Residual deviance: 415.43  on 68  degrees of freedom
## AIC: 566.09
## 
## Number of Fisher Scoring iterations: 6

mu_poi <- poi_mod %>% predict() %>% exp() 
zero_poi <- -mu_poi %>% exp() %>% sum()
round(zero_poi, digits = 2)

## [1] 1.47

実際の０の観察値（27）に対し、０の期待値が過少に評価されてしまった。

📈 ゼロ過剰なポアソン分布

Rのパッケージでは

Which is the best R package for zero-Inflated count data? - ResearchGate

という話題がでるほどたくさんあるのだが、ここでは最尤推定法を利用する{pscl}とMCMCサンプリングによるベイズ推定を行う{rstan} とそのラッパーパッケージ{brms}と{glmmstan}の４つのパッケージを試してみる。

library(pscl)
library(brms)
library(glmmstan)

📦 {pscl}

zip_mod <- beetle_egg %>% zeroinfl(NumEggs ~ Temp | 1, dist = "poisson", data = .)
#  distを明示しない場合と一緒
zip_mod %>% {
  print(summary(.))
  zero_zip <<- sum(predict(object = ., type = "prob")[,1]) %>% round(digits = 2)
    # type... prob, count, zero
}

## 
## Call:
## zeroinfl(formula = NumEggs ~ Temp | 1, data = ., dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0998 -1.0318 -0.6531  0.8632  3.2015 
## 
## Count model coefficients (poisson with log link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.44898    0.92249  -1.571 0.116245
## Temp         0.14700    0.04061   3.620 0.000295
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.4720     0.2466  -1.914   0.0556
## 
## Number of iterations in BFGS optimization: 8 
## Log-likelihood:  -188 on 3 Df

zero_zip

## [1] 27.02

{pscl}のzeroinfl()で真の０の度数に近い値が得られた。

先ほどのポアソン分布を当てはめたモデルと比べてみても

vuong(zip_mod, poi_mod)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A      p-value
## Raw                    4.620874 model1 > model2 0.0000019106
## AIC-corrected          4.571228 model1 > model2 0.0000024244
## BIC-corrected          4.515414 model1 > model2 0.0000031597

ZIPモデルのほうが妥当であるという結果が得られた。

📦 {rstan}

Korner-Nievergeltらの本（"Bayesian Data Analysis in Ecology Using Linear Models with R (2015)"）にZIPモデルの内容が書かれていたので、それを参考に書いたStanコードとその実行結果をしれっと追記しておく（2015-11-26） 📝

stan_code <- '
data {
  int<lower=0> N;
  int<lower=0> NumEggs[N];
  vector[N] Temp;
}
parameters {
  vector[2] a;
  vector[2] b;
}
model {
  // transformed parameters
  vector[N] theta;
  vector[N] lambda;

  for(i in 1:N){
    theta[i] <- inv_logit(a[1] + a[2] * Temp[i]);
    lambda[i] <- exp(b[1] + b[2] * Temp[i]); 
  }

  // priors
  a[1] ~ normal(0,5);
  a[2] ~ normal(0,5);
  b[1] ~ normal(0,5);
  b[2] ~ normal(0,5);

  // likelihood
  for (i in 1:N) {
    if(NumEggs[i] == 0) {
      increment_log_prob(log_sum_exp(bernoulli_log(1, theta[i]), bernoulli_log(0, theta[i]) + poisson_log(NumEggs[i], lambda[i])));
    } else {
      increment_log_prob(bernoulli_log(0, theta[i]) + poisson_log(NumEggs[i], lambda[i]));
    }
  }
}
'
list_beetle_egg <- list(N       = nrow(beetle_egg),
                        NumEggs = beetle_egg$NumEggs,
                        Temp    = beetle_egg$Temp)

stan_code_c <- stan_model(model_code = stan_code)
stan_mod <- sampling(object = stan_code_c,
                data = list_beetle_egg,
                seed = 71,
                chains = 3,
                iter   = 10000,
                warmup = 8000)
stan_mod

## Inference for Stan model: a38fdd33df8e5dc6fefa3bbb31161a8f.
## 3 chains, each with iter=10000; warmup=8000; thin=1; 
## post-warmup draws per chain=2000, total post-warmup draws=6000.
## 
##         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## a[1]   -3.36    0.09 3.10   -9.56   -5.39   -3.36   -1.21    2.67  1183
## a[2]    0.13    0.00 0.14   -0.14    0.03    0.13    0.22    0.40  1181
## b[1]   -1.41    0.03 0.91   -3.26   -1.99   -1.41   -0.81    0.41  1264
## b[2]    0.15    0.00 0.04    0.07    0.12    0.15    0.17    0.23  1273
## lp__ -189.58    0.05 1.42 -193.29 -190.27 -189.26 -188.52 -187.79   942
##      Rhat
## a[1] 1.00
## a[2] 1.00
## b[1] 1.01
## b[2] 1.00
## lp__ 1.00
## 
## Samples were drawn using NUTS(diag_e) at Thu Nov 26 06:26:45 2015.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

📦 {brms}

続いて{brms}でZIPモデルを適用する。

zip_mod_brms <- beetle_egg %>% 
  brm(NumEggs ~ Temp, 
      data     = ., 
      seed     = 71,
      n.chains = 3,
      n.iter   = 10000,
      n.warmup = 8000,
      family   = "zero_inflated_poisson")

summary(zip_mod_brms)

##  Family: zero_inflated_poisson (log) 
## Formula: NumEggs ~ Temp 
##    Data: . (Number of observations: 70) 
## Samples: 3 chains, each with n.iter = 10000; n.warmup = 8000; n.thin = 1; 
##          total post-warmup samples = 6000
##    WAIC: 466.74
##  
## Fixed Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept    -2.01      1.03    -3.90     0.03        528    1
## Temp          0.17      0.05     0.08     0.25        526    1
## 
## Samples were drawn using NUTS(diag_e). For each parameter, Eff.Sample is a 
## crude measure of effective sample size, and Rhat is the potential scale 
## reduction factor on split chains (at convergence, Rhat = 1).

📦 {glmmstan}

最後に{glmmstan}でZIPモデルを適用する。{glmmstan}は最近のバージョンでZIPモデルを始め、より多くの分布に対応したとのこと。

zip_mod_glmmstan <- beetle_egg %>% 
  glmmstan(NumEggs ~ Temp, 
           data   = ., 
           family = "zipoisson",
           chains = 3,
           iter   = 10000,
           warmup = 8000)

# print(zip_mod_glmmstan, digits = 2, pars = "beta")

{glmmstan}では、output_result()を使って回帰モデルの係数を取得できる、というのを開発者の\@simizu706 さんに教えてもらったので追記（2015-11-26） 📝

output_result(zip_mod_glmmstan) %>% {
  .$beta %>% data.frame() %>% kable(format = "markdown", digit = 3) %>% print()
  .$beta_zero %>% data.frame() %>% kable(format = "markdown", digit = 3)
}

	coefficient	stdev	X95.lower	X95.upper
Intercept	-1.447	0.963	-3.295	0.432
Temp	0.147	0.042	0.063	0.228

	coefficient	stdev	X95.lower	X95.upper
Intercept_zero	-5.4688688	3.8265784	-13.5053596	2.105523
Temp_zero	0.2207903	0.1686405	-0.1099017	0.577800

Stanのモデルコードについてはもう少し検討する必要があるみたいだが、これらのパッケージを利用してゼロが過剰なポアソン分布のデータに対応することができた。

☕ 雑

RでZIPモデルを行うパッケージ。怪しいものもある。上のものから気に入っている。下の３つは検証していない。

{pscl}... 良い。
{glmmstan}... 良い。{rstan}のラッパー
{brms}... Stanを利用した線形モデル
{blme}... {lme4}のベイズ拡張。ランダム効果を取り組んだモデルに使える
{MCMCglmm}... 引数の与え方がちょっと変わっている。事前分布などの細かい指定ができるっぽい。vignettesが充実しているので後で読む。ランダム効果を取り込まないのならば{pscl}で十分か (ref. http://grokbase.com/t/r/r-help/112epvax7t/r-mcmc-glmm)
{glmmADMB} CRANにもGitHubにもないどこか怪しいパッケージ
{COZIGAM} アーカイブされている
{gamlss.dist}... 使えなさそう

おまけ。

## 負の二項分布
negbim_mod <- beetle_egg %>% glm.nb(NumEggs ~ Temp, 
                      data = ., 
                      init.theta = 0.61, 
                      link = log)
summary(negbim_mod)

## 
## Call:
## glm.nb(formula = NumEggs ~ Temp, data = ., init.theta = 0.4684948827, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4773  -1.4185  -0.4396   0.4459   1.2875  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.16026    2.78026  -0.058    0.954
## Temp         0.06787    0.12320   0.551    0.582
## 
## (Dispersion parameter for Negative Binomial(0.4685) family taken to be 1)
## 
##     Null deviance: 74.169  on 69  degrees of freedom
## Residual deviance: 73.865  on 68  degrees of freedom
## AIC: 342.43
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.468 
##           Std. Err.:  0.106 
## 
##  2 x log-likelihood:  -336.429

📚 参考

Zuur et al. (2010). A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1: 3--14.
岩崎学. (2010). カウントデータの統計学. 朝倉書店.
Christopher R. Bilder and Thomas M. Loughin. (2014). Analysis of Categorical Data with R. CRC Press.
Zuur et al. (2009). Mixed Effects Models and Extensions in Ecology with R. Springer.
Martin et al. (2005). Zero tolerance ecology: improving ecological inference by modelling the source of zero observations. Ecology Letters 8:1235--1246.
Fränzi Korner-Nievergelt et al. (2015). Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan. Elsevier
R Data Analysis Examples: Zero-Inflated Poisson Regression
StanでZero-inflated Poissonモデル：Taglibro de H：So-netブログ
Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models | Datavore Consulting