Non-Parametric Statistics

Statistical inference often relies on parametric assumptions, specifically that the population from which the sample is drawn follows a known probability distribution, typically the normal distribution, characterized by a set of parameters (e.g., mean $\mu$ and variance $\sigma^2$ ). Non-parametric statistics, in contrast, provide procedures for inferring properties of populations that do not rely on restrictive assumptions regarding the underlying parameterized probability distributions.

These methods are essential when sample sizes are small, data are ordinal or nominal, or severe departures from normality are evident. While non-parametric tests are more robust to distributional violations, they generally possess less statistical power compared to their parametric counterparts when the parametric assumptions are actually met.

The Sign Test

The sign test is one of the simplest non-parametric tests, used to assess whether the median of a continuous distribution equals a hypothesized value $M_0$ . It is the non-parametric alternative to the one-sample t-test.

Let $X_1, X_2, \dots, X_n$ be a random sample from a continuous distribution with median $M$ . We wish to test the null hypothesis $H_0: M = M_0$ .

The test statistic $S$ is defined as the number of sample observations strictly greater than $M_0$ . Under $H_0$ , each observation has a 0.5 probability of being greater than $M_0$ , assuming continuity. Thus, $S$ follows a binomial distribution: $S \sim \text{Binomial}(N, p = 0.5)$ where $N$ is the effective sample size, discarding any ties where $X_i = M_0$ .

For large $N$ (typically $N > 20$ ), a normal approximation can be used: $Z = \frac{S - \frac{N}{2}}{\sqrt{\frac{N}{4}}} \sim \mathcal{N}(0, 1)$ A continuity correction of $0.5$ is often applied to $S$ for greater accuracy.

Why might the sign test discard observations equal to the hypothesized median $M_0$?

Wilcoxon Signed-Rank Test

The sign test ignores the magnitude of the differences between the observations and the hypothesized median. The Wilcoxon signed-rank test incorporates this magnitude, requiring the assumption that the underlying continuous distribution is symmetric about its median. It serves as a more powerful non-parametric alternative to the paired Student’s t-test or the one-sample t-test.

Given pairs of observations $(X_i, Y_i)$ for $i = 1, \dots, n$ , compute the differences $D_i = X_i - Y_i$ .

Discard pairs where $D_i = 0$ . Let $N$ be the reduced sample size.
Rank the absolute differences $|D_i|$ from smallest to largest. Ties are assigned the average of the ranks they would have received. Let $R_i$ be the rank of $|D_i|$ .
Calculate the test statistic $W$ , which is the sum of the signed ranks: $W = \sum_{i=1}^{N} \text{sgn}(D_i) R_i$ Alternatively, calculate the sum of ranks for positive differences ( $T^+$ ) and negative differences ( $T^-$ ). The test statistic is often defined as $T = \min(T^+, T^-)$ .

Under $H_0$ (symmetric distribution about 0), the expected value and variance of $W$ are: $\mathbb{E}[W] = 0$ $\text{Var}(W) = \frac{N(N+1)(2N+1)}{6}$ For large $N$ , $W$ is approximately normally distributed, permitting the use of a $Z$ -test.

Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

When comparing two independent samples to determine if they originate from the same population, the Mann-Whitney U test (or Wilcoxon rank-sum test) offers a non-parametric alternative to the independent two-sample t-test. It assumes the two distributions are identical in shape but potentially shifted in location.

Let $X_1, \dots, X_m$ and $Y_1, \dots, Y_n$ be independent samples.

Combine all $m+n$ observations and rank them from $1$ to $m+n$ .
Compute the sum of the ranks for sample 1 ( $R_1$ ) and sample 2 ( $R_2$ ).
The $U$ statistics are calculated as: $U_1 = R_1 - \frac{m(m+1)}{2}$ $U_2 = R_2 - \frac{n(n+1)}{2}$ Note that $U_1 + U_2 = mn$ . The test statistic is $U = \min(U_1, U_2)$ .

Under the null hypothesis that $X$ and $Y$ have the same distribution, the expectation and variance of $U$ are: $\mathbb{E}[U] = \frac{mn}{2}$ $\text{Var}(U) = \frac{mn(m+n+1)}{12}$ Ties in the data require an adjustment to the variance formula: $\text{Var}(U) = \frac{mn}{12} \left( (m+n+1) - \sum_{i=1}^k \frac{t_i^3 - t_i}{(m+n)(m+n-1)} \right)$ where $k$ is the number of tied groups and $t_i$ is the number of observations in the $i$ -th tied group.

What condition reduces the power of the Mann-Whitney U test relative to an independent two-sample t-test?

Kruskal-Wallis one-way analysis of variance

The Kruskal-Wallis H test extends the Mann-Whitney U test to more than two independent groups. It is the non-parametric equivalent of the one-way ANOVA, testing whether $k$ independent samples originate from the same distribution.

Given $k$ groups with sample sizes $n_1, n_2, \dots, n_k$ and total observations $N = \sum_{i=1}^k n_i$ :

Rank all $N$ observations jointly from $1$ to $N$ .
Compute the sum of ranks $R_i$ for each group $i$ .
The test statistic $H$ is: $H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1)$

If the null hypothesis is true (all samples come from the same population) and the sample sizes are sufficiently large (typically $n_i \geq 5$ ), $H$ is approximately distributed as a chi-square distribution with $k-1$ degrees of freedom: $H \sim \chi^2_{k-1}$ If the null hypothesis is rejected, post-hoc procedures like Dunn’s test are utilized for pairwise comparisons to isolate the specific stochastic dominance among groups.

Spearman’s Rank Correlation Coefficient

Evaluating the strength and direction of association between two continuous or ordinal variables without assuming linearity relies on Spearman’s rank correlation coefficient ( $\rho$ or $r_s$ ). It evaluates the monotonic relationship between two variables, contrasting with Pearson’s correlation which evaluates linear relationships.

For $n$ pairs of observations $(X_i, Y_i)$ , convert the raw scores to ranks $R(X_i)$ and $R(Y_i)$ . Spearman’s $\rho$ is computed analogously to Pearson’s correlation coefficient, but applied to the ranks: $\rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)}$ where $d_i = R(X_i) - R(Y_i)$ is the difference between the ranks of corresponding variables.

If there are identical values (ties), the simplified formula utilizing $d_i^2$ becomes inaccurate, and the standard Pearson correlation formula must be applied directly to the ranked variables.

$\rho = \frac{\sum_i (R(X_i) - \bar{R}(X))(R(Y_i) - \bar{R}(Y))}{\sqrt{\sum_i (R(X_i) - \bar{R}(X))^2 \sum_i (R(Y_i) - \bar{R}(Y))^2}}$

Values of $\rho$ vary from $-1$ to $+1$ , indicating perfect negative or positive monotonic associations, respectively.

Bootstrap and Resampling Methods

Modern computational power enables simulation-based non-parametric approaches, most notably bootstrapping. Introduced by Bradley Efron, bootstrapping relies on random sampling with replacement from the original dataset.

If we possess a sample $X = \{x_1, \dots, x_n\}$ drawn from an unknown distribution $F$ , we construct an empirical distribution function $\hat{F}$ . By drawing repeated samples of size $n$ , with replacement, from $X$ , we generate $B$ bootstrap samples $X^{*1}, X^{*2}, \dots, X^{*B}$ .

For a sample statistic $\hat{\theta} = s(X)$ estimating a parameter $\theta$ , we compute the statistic for each bootstrap sample: $\hat{\theta}^{*b} = s(X^{*b})$ . The distribution of $\hat{\theta}^{*b}$ approximates the sampling distribution of $\hat{\theta}$ , enabling the construction of confidence intervals and hypothesis testing lacking parametric form.

The bootstrap standard error is the standard deviation of the bootstrap replicates: $\widehat{\text{SE}}(\hat{\theta}) = \sqrt{\frac{1}{B-1} \sum_{b=1}^B \left( \hat{\theta}^{*b} - \bar{\hat{\theta}}^* \right)^2 }$ where $\bar{\hat{\theta}}^*$ is the mean of the bootstrap estimates. Resampling procedures eliminate reliance on asymptotic normality assumptions, providing robust inferences particularly suitable for complex estimators or small sample sizes limit conventional asymptotic theory.

Kernel Density Estimation

Kernel Density Estimation (KDE) establishes a non-parametric perspective on estimating the probability density function of a continuous random variable. Parametric estimation fits a predetermined shape (e.g., normal, gamma) parameterized by equations. KDE estimates the density entirely from data.

Let $(x_1, x_2, \dots, x_n)$ be independent and identically distributed samples drawn from some distribution with an unknown density $f$ . The kernel density estimator is: $\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$ where $K$ constitutes the kernel (a non-negative function integrating to one) and $h > 0$ denotes a smoothing parameter known as the bandwidth. The bandwidth heavily influences the estimator. Small $h$ induces undersmoothing, yielding high variance (spurious fluctuations), whereas large $h$ evokes oversmoothing, yielding high bias (obscuring structural features of the distribution). Standard choices for $K$ include the Gaussian, Epanechnikov, and uniform kernels.

KDE vs. Histogram

Histograms and KDEs both attempt to model data density non-parametrically. Consider a dataset of highly clustered continuous physical measurements. A histogram forces boundaries at arbitrary bin edges. A KDE smooths out data without fixed bins.

Non-Parametric Statistics

Non-Parametric Statistics

The Sign Test

Why might the sign test discard observations equal to the hypothesized median $M_0$?

Wilcoxon Signed-Rank Test

Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

What condition reduces the power of the Mann-Whitney U test relative to an independent two-sample t-test?

Kruskal-Wallis one-way analysis of variance

Spearman’s Rank Correlation Coefficient

Bootstrap and Resampling Methods

Kernel Density Estimation

Why does standard continuous kernel density estimation often superior to a histogram for continuous distributions?