В целях сравнения с расчетом доли доверительных интервалов (CI) на основе опроса я здесь выкладываю код для вычисления CI, используя Центральную предельную теорему (CLT) для неидентично распределенных переменных (согласно запросу оригинала). Плакат (ОП) в комментариях):
1) Определение вспомогательных функций
#' Compute survey-based Confidence Intervals
#'
#' @param df data frame with at least one column: "Category".
#' @param design survey design, normally created with \code{svydesign()} of the survey package.
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using the \code{svymean()} function of the survey package on the given \code{design}.
#' @return data frame containing estimated proportions for each "Category" and respective
#' 95% confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_survey <- function(df, design) {
design.mean = svymean(~Category, design, deff="replace")
CI = confint( design.mean )
# Add the estimated proportions and Delta = Semi-size CI for comparison with the CLT-based results below
CI = cbind( p=design.mean, Delta=( CI[,"97.5 %"] - CI[,"2.5 %"] ) / 2, CI )
return(as.data.frame(CI))
}
#' Compute CLT-based Confidence Intervals
#'
#' @param df data frame with at least two columns: "Category", "Weight" (sampling weights)
#' @details The confidence interval for the proportion of each "Category" present in \code{df}
#' is computed using Lyapunov's or Lindenberg's version of the Central Limit Theorem (CLT) which
#' applies to independent but NOT identically distributed random variables
#' Ref: https://en.wikipedia.org/wiki/Central_limit_theorem#Lack_of_identical_distribution
#' @return data frame containing estimated proportions for each "Category" and respective
#' \code{ci_level} confidence intervals.
#'
#' ASSUMPTION: the data do not have any missing values
ci_clt <- function(df, ci_level) {
### The following calculations are based on the following setup for each category given in 'df':
### 1) Let X(i) be the measurement of variable X for sampled case i, i = 1 ... n (n=500 in this case)
### where X is a 0/1 variable indicating absence or presence of a selected category.
### From the X(i) samples we would like to estimate the
### true proportion p of the presence of the category in the population.
### Therefore X(i) are iid random variables with Binomial(1,p) distribution
###
### 2) Let Y(i) = w(i)*X(i)
### where w(i) is the sampling weight applied to variable X(i).
###
### We apply the CLT to the sum of the Y(i)'s, using:
### - E(Y(i)) = mu(i) = w(i) * E(X(i)) = w(i) * p (since w(i) is a constant and the X(i) are identically distributed)
### - Var(Y(i)) = sigma2(i) = w(i)^2 * Var(X(i)) = w(i)^2 * p*(1-p) (since the X(i) iid)
###
### Hence, by CLT:
### Sum{Y(i) - mu(i)} / sigma -> N(0,1)
### where:
### sigma = sqrt( Sum{ sigma2(i) } ) = sqrt( Sum{ w(i)^2 } ) * sqrt( p*(1-p) )
### and note that:
### Sum{ mu(i) } = Sum{ w(i) } * p = n*p
### since the sampling weights are assumed to sum up to the sample size.
###
### Note: all the Sums are from i = 1, ..., n
###
### 3) Compute the approximate confidence interval for p based on the N(0,1) distribution
### in the usual way, by first estimating sigma replacing p for the estimated p.
###
alpha = 1 - ci_level # area outside the confidence band
z = qnorm(1 - alpha/2) # critical z-quantile from Normal(0,1)
n = nrow(df) # Sample size (assuming no missing values)
ws = df$Weight / sum(df$Weight) * n # Weights scaled to sum the sample size (assumed for sampling weights)
S = aggregate(ws ~ Category, sum, data=df) # Weighted-base estimate of the total by category (Sum{ Y(i) })
sigma2 = sum( ws^2 ) # Sum of squared weights (note that we must NOT sum by category)
S[,"p"] = S[,"ws"] / n # Estimated proportion by category
S[,"Delta"] = z * sqrt( sigma2 ) *
sqrt( S$p * (1 - S$p) ) / n # Semi-size of the CI by category
LB_name = paste(formatC(alpha/2*100, format="g"), "%") # Name for the CI's Lower Bound column
UB_name = paste(formatC((1 - alpha/2)*100, format="g"), "%") # Name for the CI's Upper Bound column
S[,LB_name] = S[,"p"] - S[,"Delta"] # CI's Lower Bound
S[,UB_name] = S[,"p"] + S[,"Delta"] # CI's Upper Bound
return(S)
}
#' Show the CI with the specified number of significant digits
show_values <- function(values, digits=3) {
op = options(digits=digits)
print(values)
options(op)
}
2) Имитация данных
### Simulated data
set.seed(100)
df <- data.frame(
Category = rep(c("A", "B", "C", "D"), times = seq(50, 200, length.out = 4)),
Weight = sample(c(1, 1/2, 1/3, 1/4, 1/5), 500, prob = c(0.1, 0.2, 0.4, 0.2, 0.1), replace = TRUE)
)
3) CI на основе обследования (для справки)
# Computation of CI using survey sampling theory (implemented in the survey package)
library(survey)
design <- svydesign(ids = ~1, weights = ~Weight, data = df)
CI_survey = ci_survey(df, design)
show_values(CI_survey)
, что дает:
p Delta 2.5 % 97.5 %
CategoryA 0.0896 0.0268 0.0628 0.116
CategoryB 0.2119 0.0417 0.1702 0.254
CategoryC 0.2824 0.0445 0.2379 0.327
CategoryD 0.4161 0.0497 0.3664 0.466
4) CI на основе CLT
Описание используемого метода включено в комментарии в верхней части функции ci_clt()
, определенной выше.
# Computation of CI using the Central Limit Theorem for non-identically distributed variables
CI_clt = ci_clt(df, ci_level=0.95)
show_values(CI_clt)
, которая дает:
Category ws p Delta 2.5 % 97.5 %
1 A 44.8 0.0896 0.0286 0.061 0.118
2 B 106.0 0.2119 0.0410 0.171 0.253
3 C 141.2 0.2824 0.0451 0.237 0.328
4 D 208.0 0.4161 0.0494 0.367 0.465
5) Сравнение размеров CI
Здесь мы вычисляем соотношение между CI на основе CLT и CI на основе обследования.
# Comparison of CI size
show_values(
data.frame(Category=CI_clt[,"Category"],
ratio_DeltaCI_clt2survey=CI_clt[,"Delta"] / CI_survey[,"Delta"])
)
, что дает :
Category ratio_DeltaCI_clt2survey
1 A 1.067
2 B 0.982
3 C 1.013
4 D 0.994
Так как все крысы ios близки к 1, мы заключаем, что размеры CI от двух методов очень milar!
6) Убедитесь, что реализация для CI на основе CLT кажется правильной
Удобная проверка для расчета на основе CLT КИ должен выполнить расчет по случаю простой случайной выборки (SRS) и убедиться, что результаты совпадают с результатами, полученными при вычислении svymean()
по схеме SRS.
# CLT-based calculation
df_noweights = df
df_noweights$Weight = 1 # SRS: weights equal to 1
show_values( ci_clt(df_noweights, ci_level=0.95) )
, что дает:
Category p Delta 2.5 % 97.5 %
1 A 0.1 0.0263 0.0737 0.126
2 B 0.2 0.0351 0.1649 0.235
3 C 0.3 0.0402 0.2598 0.340
4 D 0.4 0.0429 0.3571 0.443
, который мы сравниваем с расчетом на основе опроса:
# Survey-based calculation
design <- svydesign(ids=~1, probs=NULL, data=df)
show_values( ci_survey(df, design) )
, который дает:
p Delta 2.5 % 97.5 %
CategoryA 0.1 0.0263 0.0737 0.126
CategoryB 0.2 0.0351 0.1649 0.235
CategoryC 0.3 0.0402 0.2598 0.340
CategoryD 0.4 0.0430 0.3570 0.443
Мы видим, что результаты совпадают, предполагая, что реализация на основе CLT расчет CI в общем случае с неравными весами является правильным.