Title: | Statistical Inference and Sure Independence Screening via Ball Statistics |
---|---|
Description: | Hypothesis tests and sure independence screening (SIS) procedure based on ball statistics, including ball divergence <doi:10.1214/17-AOS1579>, ball covariance <doi:10.1080/01621459.2018.1543600>, and ball correlation <doi:10.1080/01621459.2018.1462709>, are developed to analyze complex data in metric spaces, e.g, shape, directional, compositional and symmetric positive definite matrix data. The ball divergence and ball covariance based distribution-free tests are implemented to detecting distribution difference and association in metric spaces <doi:10.18637/jss.v097.i06>. Furthermore, several generic non-parametric feature selection procedures based on ball correlation, BCor-SIS and all of its variants, are implemented to tackle the challenge in the context of ultra high dimensional data. A fast implementation for large-scale multiple K-sample testing with ball divergence <doi: 10.1002/gepi.22423> is supported, which is particularly helpful for genome-wide association study. |
Authors: | Jin Zhu [aut, cre] |
Maintainer: | Jin Zhu <[email protected]> |
License: | GPL-3 |
Version: | 1.3.13 |
Built: | 2025-02-16 03:46:11 UTC |
Source: | https://github.com/cran/Ball |
Sand, silt and clay compositions of 39 sediment samples of different water depth in an Arctic lake.
ArcticLake$depth
: water depth (in meters).
ArcticLake$x
: compositions of three covariates: sand, silt, and clay.
Sand, silt and clay compositions of 39 sediment samples at different water depth (in meters) in an Arctic lake. The additional feature is a concomitant variable or covariate, water depth, which may account for some of the variation in the compositions. In statistical terminology, we have a multivariate regression problem with sediment composition as predictors and water depth as a response. All row percentage sums to 100, except for rounding errors.
Courtesy of J. Aitchison
Aitchison: CODA microcomputer statistical package, 1986, the file name ARCTIC.DAT, here included under the GNU Public Library Licence Version 2 or newer.
Aitchison: The Statistical Analysis of Compositional Data, 1986, Data 5, pp5.
Computes Ball Covariance and Ball Correlation statistics, which are generic dependence measures in Banach spaces.
bcor(x, y, distance = FALSE, weight = FALSE) bcov(x, y, distance = FALSE, weight = FALSE)
bcor(x, y, distance = FALSE, weight = FALSE) bcov(x, y, distance = FALSE, weight = FALSE)
x |
a numeric vector, matrix, data.frame, or a list containing at least two numeric vectors, matrices, or data.frames. |
y |
a numeric vector, matrix, or data.frame. |
distance |
if |
weight |
a logical or character string used to choose the weight form of Ball Covariance statistic..
If input is a character string, it must be one of |
The sample sizes of the two variables must agree, and samples must not contain missing and infinite values.
If we set distance = TRUE
, arguments x
, y
can be a dist
object or a
symmetric numeric matrix recording distance between samples; otherwise, these arguments are treated as data.
bcov
and bcor
compute Ball Covariance and Ball Correlation statistics.
Ball Covariance statistics is a generic dependence measure in Banach spaces. It enjoys the following properties:
It is nonnegative and it is equal to zero if and only if variables are unassociated;
It is highly robust;
It is distribution-free and model-free;
it is interesting that the HHG is a special case of Ball Covariance statistics.
Ball correlation statistics, a normalized version of Ball Covariance statistics, generalizes Pearson correlation in two fundamental ways:
It is well-defined for random variables in arbitrary dimension in Banach spaces
BCor is equal to zero implies random variables are unassociated.
The definitions of the Ball Covariance and Ball Correlation statistics between two random variables are as follows.
Suppose, we are given pairs of independent observations
, where
and
can be of any dimension
and the dimensionality of
and
need not be the same.
Then, we define sample version Ball Covariance as:
where:
Among them, is a closed ball
with center
and radius
.
Similarly, we can define
and
.
We define Ball Correlation statistic as follows.
We can extend to measure the mutual independence between
random variables:
where are random variables and
is the
-th observations of
.
See bcov.test
for a test of independence based on the Ball Covariance statistic.
bcor |
Ball Correlation statistic. |
bcov |
Ball Covariance statistic. |
Wenliang Pan, Xueqin Wang, Heping Zhang, Hongtu Zhu & Jin Zhu (2019) Ball Covariance: A Generic Measure of Dependence in Banach Space, Journal of the American Statistical Association, DOI: 10.1080/01621459.2018.1543600
Wenliang Pan, Xueqin Wang, Weinan Xiao & Hongtu Zhu (2018) A Generic Sure Independence Screening Procedure, Journal of the American Statistical Association, DOI: 10.1080/01621459.2018.1462709
Jin Zhu, Wenliang Pan, Wei Zheng, and Xueqin Wang (2021). Ball: An R Package for Detecting Distribution Difference and Association in Metric Spaces, Journal of Statistical Software, Vol.97(6), doi: 10.18637/jss.v097.i06.
############# Ball Correlation ############# num <- 50 x <- 1:num y <- 1:num bcor(x, y) bcor(x, y, weight = "prob") bcor(x, y, weight = "chisq") ############# Ball Covariance ############# num <- 50 x <- rnorm(num) y <- rnorm(num) bcov(x, y) bcov(x, y, weight = "prob") bcov(x, y, weight = "chisq")
############# Ball Correlation ############# num <- 50 x <- 1:num y <- 1:num bcor(x, y) bcor(x, y, weight = "prob") bcor(x, y, weight = "chisq") ############# Ball Covariance ############# num <- 50 x <- rnorm(num) y <- rnorm(num) bcov(x, y) bcov(x, y, weight = "prob") bcov(x, y, weight = "chisq")
Generic non-parametric sure independence screening (SIS) procedure based on Ball Correlation. Ball correlation is a generic measure of dependence in Banach spaces.
bcorsis( x, y, d = "small", weight = c("constant", "probability", "chisquare"), method = "standard", distance = FALSE, category = FALSE, parms = list(d1 = 5, d2 = 5, df = 3), num.threads = 0 )
bcorsis( x, y, d = "small", weight = c("constant", "probability", "chisquare"), method = "standard", distance = FALSE, category = FALSE, parms = list(d1 = 5, d2 = 5, df = 3), num.threads = 0 )
x |
a numeric matrix or data.frame included |
y |
a numeric vector, matrix, or data.frame. |
d |
the hard cutoff rule suggests selecting |
weight |
a logical or character string used to choose the weight form of Ball Covariance statistic..
If input is a character string, it must be one of |
method |
specific method for the BCor-SIS procedure. It must be one of |
distance |
if |
category |
a logical value or integer vector indicating columns to be selected as categorical variables.
If |
parms |
parameters list only available when |
num.threads |
number of threads. If |
bcorsis
performs a model-free generic sure independence screening procedure,
BCor-SIS, to pick out variables from x
which are potentially associated with y
.
BCor-SIS relies on Ball correlation, a universal dependence measure in Banach spaces.
Ball correlation (BCor) ranges from 0 to 1. A larger BCor implies they are likely to be associated while
Bcor is equal to 0 implies they are unassociated. (See bcor
for details.)
Consequently, BCor-SIS pick out variables with larger Bcor values with y
.
Theory and numerical result indicate that BCor-SIS has following advantages:
BCor-SIS can retain the efficient variables even when the dimensionality (i.e., ncol(x)
) is
an exponential order of the sample size (i.e., exp(nrow(x))
);
It is distribution-free and model-free;
It is very robust;
It is works well for complex data, such as shape and survival data;
If x
is a matrix, the sample sizes of x
and y
must agree.
If x
is a list
object, each element of this list
must with the same sample size.
x
and y
must not contain missing or infinite values.
When method = "survival"
, the matrix or data.frame pass to y
must have exactly two columns, where the first column is
event (failure) time while the second column is a dichotomous censored status.
ix |
the indices vector corresponding to variables selected by BCor-SIS. |
method |
the method used. |
weight |
the weight used. |
complete.info |
a |
bcorsis
simultaneously computing Ball Correlation statistics with
"constant"
, "probability"
, and "chisquare"
weights.
Users can get other Ball Correlation statistics with different weight in the complete.info
element of output.
We give a quick example below to illustrate.
Wenliang Pan, Weinan Xiao, Xueqin Wang, Hongtu Zhu, Jin Zhu
Wenliang Pan, Xueqin Wang, Weinan Xiao & Hongtu Zhu (2018) A Generic Sure Independence Screening Procedure, Journal of the American Statistical Association, DOI: 10.1080/01621459.2018.1462709
## Not run: ############### Quick Start for bcorsis function ############### set.seed(1) n <- 150 p <- 3000 x <- matrix(rnorm(n * p), nrow = n) eps <- rnorm(n) y <- 3 * x[, 1] + 5 * (x[, 3])^2 + eps res <- bcorsis(y = y, x = x) head(res[["ix"]]) head(res[["complete.info"]][["statistic"]]) ############### BCor-SIS: Censored Data Example ############### data("genlung") result <- bcorsis(x = genlung[["covariate"]], y = genlung[["survival"]], method = "survival") index <- result[["ix"]] top_gene <- colnames(genlung[["covariate"]])[index] head(top_gene, n = 1) ############### BCor-SIS: Interaction Pursuing ############### set.seed(1) n <- 150 p <- 3000 x <- matrix(rnorm(n * p), nrow = n) eps <- rnorm(n) y <- 3 * x[, 1] * x[, 5] * x[, 10] + eps res <- bcorsis(y = y, x = x, method = "interaction") head(res[["ix"]]) ############### BCor-SIS: Iterative Method ############### library(mvtnorm) set.seed(1) n <- 150 p <- 3000 sigma_mat <- matrix(0.5, nrow = p, ncol = p) diag(sigma_mat) <- 1 x <- rmvnorm(n = n, sigma = sigma_mat) eps <- rnorm(n) rm(sigma_mat); gc(reset = TRUE) y <- 3 * (x[, 1])^2 + 5 * (x[, 2])^2 + 5 * x[, 8] - 8 * x[, 16] + eps res <- bcorsis(y = y, x = x, method = "lm", d = 15) res <- bcorsis(y = y, x = x, method = "gam", d = 15) res[["ix"]] ############### Weighted BCor-SIS: Probability weight ############### set.seed(1) n <- 150 p <- 3000 x <- matrix(rnorm(n * p), nrow = n) eps <- rnorm(n) y <- 3 * x[, 1] + 5 * (x[, 3])^2 + eps res <- bcorsis(y = y, x = x, weight = "prob") head(res[["ix"]]) # Alternative, chisq weight: res <- bcorsis(y = y, x = x, weight = "chisq") head(res[["ix"]]) ############### BCor-SIS: GWAS data ############### set.seed(1) n <- 150 p <- 3000 x <- sapply(1:p, function(i) { sample(0:2, size = n, replace = TRUE) }) eps <- rnorm(n) y <- 6 * x[, 1] - 7 * x[, 2] + 5 * x[, 3] + eps res <- bcorsis(x = x, y = y, category = TRUE) head(res[["ix"]]) head(res[["complete.info"]][["statistic"]]) x <- cbind(matrix(rnorm(n * 2), ncol = 2), x) # remove the first two columns: res <- bcorsis(x = x, y = y, category = c(-1, -2)) head(res[["ix"]]) x <- cbind(x[, 3:5], matrix(rnorm(n * p), ncol = p)) res <- bcorsis(x = x, y = y, category = 1:3) head(res[["ix"]], n = 10) ## End(Not run)
## Not run: ############### Quick Start for bcorsis function ############### set.seed(1) n <- 150 p <- 3000 x <- matrix(rnorm(n * p), nrow = n) eps <- rnorm(n) y <- 3 * x[, 1] + 5 * (x[, 3])^2 + eps res <- bcorsis(y = y, x = x) head(res[["ix"]]) head(res[["complete.info"]][["statistic"]]) ############### BCor-SIS: Censored Data Example ############### data("genlung") result <- bcorsis(x = genlung[["covariate"]], y = genlung[["survival"]], method = "survival") index <- result[["ix"]] top_gene <- colnames(genlung[["covariate"]])[index] head(top_gene, n = 1) ############### BCor-SIS: Interaction Pursuing ############### set.seed(1) n <- 150 p <- 3000 x <- matrix(rnorm(n * p), nrow = n) eps <- rnorm(n) y <- 3 * x[, 1] * x[, 5] * x[, 10] + eps res <- bcorsis(y = y, x = x, method = "interaction") head(res[["ix"]]) ############### BCor-SIS: Iterative Method ############### library(mvtnorm) set.seed(1) n <- 150 p <- 3000 sigma_mat <- matrix(0.5, nrow = p, ncol = p) diag(sigma_mat) <- 1 x <- rmvnorm(n = n, sigma = sigma_mat) eps <- rnorm(n) rm(sigma_mat); gc(reset = TRUE) y <- 3 * (x[, 1])^2 + 5 * (x[, 2])^2 + 5 * x[, 8] - 8 * x[, 16] + eps res <- bcorsis(y = y, x = x, method = "lm", d = 15) res <- bcorsis(y = y, x = x, method = "gam", d = 15) res[["ix"]] ############### Weighted BCor-SIS: Probability weight ############### set.seed(1) n <- 150 p <- 3000 x <- matrix(rnorm(n * p), nrow = n) eps <- rnorm(n) y <- 3 * x[, 1] + 5 * (x[, 3])^2 + eps res <- bcorsis(y = y, x = x, weight = "prob") head(res[["ix"]]) # Alternative, chisq weight: res <- bcorsis(y = y, x = x, weight = "chisq") head(res[["ix"]]) ############### BCor-SIS: GWAS data ############### set.seed(1) n <- 150 p <- 3000 x <- sapply(1:p, function(i) { sample(0:2, size = n, replace = TRUE) }) eps <- rnorm(n) y <- 6 * x[, 1] - 7 * x[, 2] + 5 * x[, 3] + eps res <- bcorsis(x = x, y = y, category = TRUE) head(res[["ix"]]) head(res[["complete.info"]][["statistic"]]) x <- cbind(matrix(rnorm(n * 2), ncol = 2), x) # remove the first two columns: res <- bcorsis(x = x, y = y, category = c(-1, -2)) head(res[["ix"]]) x <- cbind(x[, 3:5], matrix(rnorm(n * p), ncol = p)) res <- bcorsis(x = x, y = y, category = 1:3) head(res[["ix"]], n = 10) ## End(Not run)
Ball Covariance test of independence. Ball covariance are generic dependence measures in Banach spaces.
bcov.test(x, ...) ## Default S3 method: bcov.test( x, y = NULL, num.permutations = 99, method = c("permutation", "limit"), distance = FALSE, weight = FALSE, seed = 1, num.threads = 0, ... ) ## S3 method for class 'formula' bcov.test(formula, data, subset, na.action, ...)
bcov.test(x, ...) ## Default S3 method: bcov.test( x, y = NULL, num.permutations = 99, method = c("permutation", "limit"), distance = FALSE, weight = FALSE, seed = 1, num.threads = 0, ... ) ## S3 method for class 'formula' bcov.test(formula, data, subset, na.action, ...)
x |
a numeric vector, matrix, data.frame, or a list containing at least two numeric vectors, matrices, or data.frames. |
... |
further arguments to be passed to or from methods. |
y |
a numeric vector, matrix, or data.frame. |
num.permutations |
the number of permutation replications. When |
method |
if |
distance |
if |
weight |
a logical or character string used to choose the weight form of Ball Covariance statistic..
If input is a character string, it must be one of |
seed |
the random seed. Default |
num.threads |
number of threads. If |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
bcov.test
is non-parametric tests of independence in Banach spaces.
It can detect the dependence between two random objects (variables) and
the mutual dependence among at least three random objects (variables).
If two samples are pass to arguments x
and y
, the sample sizes (i.e. number of rows or length of the vector)
of the two variables must agree. If a list
object is passed to x
, this list
must contain at least
two numeric vectors, matrices, or data.frames, and each element of this list
must with the same sample size. Moreover, data pass to x
or y
must not contain missing or infinite values.
If distance = TRUE
, x
is considered as a distance matrix or a list containing distance matrices,
and y
is considered as a distance matrix; otherwise, these arguments are treated as data.
bcov.test
utilizes the Ball Covariance statistics (see bcov
) to measure dependence and
derives a -value via replicating the random permutation
num.permutations
times.
See Pan et al 2018 for theoretical properties of the test, including statistical consistency.
If num.permutations > 0
, bcov.test
returns a htest
class object containing the following components:
statistic |
Ball Covariance statistic. |
p.value |
the p-value for the test. |
replicates |
permutation replications of the test statistic. |
size |
sample size. |
complete.info |
a |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of test was performed. |
data.name |
description of data. |
If num.permutations = 0
, bcov.test
returns a statistic value.
Actually, bcov.test
simultaneously computing Ball Covariance statistics with
"constant"
, "probability"
, and "chisquare"
weights.
Users can get other Ball Covariance statistics with different weight and their corresponding -values
in the
complete.info
element of output. We give a quick example below to illustrate.
Wenliang Pan, Xueqin Wang, Heping Zhang, Hongtu Zhu, Jin Zhu
Wenliang Pan, Xueqin Wang, Heping Zhang, Hongtu Zhu & Jin Zhu (2019) Ball Covariance: A Generic Measure of Dependence in Banach Space, Journal of the American Statistical Association, DOI: 10.1080/01621459.2018.1543600
Jin Zhu, Wenliang Pan, Wei Zheng, and Xueqin Wang (2021). Ball: An R Package for Detecting Distribution Difference and Association in Metric Spaces, Journal of Statistical Software, Vol.97(6), doi: 10.18637/jss.v097.i06.
set.seed(1) ################# Quick Start ################# noise <- runif(50, min = -0.3, max = 0.3) x <- runif(50, 0, 4*pi) y <- cos(x) + noise # plot(x, y) res <- bcov.test(x, y) res ## get all Ball Covariance statistics: res[["complete.info"]][["statistic"]] ## get all test result: res[["complete.info"]][["p.value"]] ################# Quick Start ################# x <- matrix(runif(50 * 2, -pi, pi), nrow = 50, ncol = 2) noise <- runif(50, min = -0.1, max = 0.1) y <- sin(x[,1] + x[,2]) + noise bcov.test(x = x, y = y, weight = "prob") ################# Ball Covariance Test for Non-Hilbert Data ################# # load data: data("ArcticLake") # Distance matrix between y: Dy <- nhdist(ArcticLake[["x"]], method = "compositional") # Distance matrix between x: Dx <- dist(ArcticLake[["depth"]]) # hypothesis test with BCov: bcov.test(x = Dx, y = Dy, distance = TRUE) ################ Weighted Ball Covariance Test ################# data("ArcticLake") Dy <- nhdist(ArcticLake[["x"]], method = "compositional") Dx <- dist(ArcticLake[["depth"]]) # hypothesis test with weighted BCov: bcov.test(x = Dx, y = Dy, distance = TRUE, weight = "prob") ################# Mutual Independence Test ################# x <- rnorm(50) y <- (x > 0) * x + rnorm(50) z <- (x <= 0) * x + rnorm(50) data_list <- list(x, y, z) bcov.test(data_list) data_list <- lapply(data_list, function(x) { as.matrix(dist(x)) }) bcov.test(data_list, distance = TRUE) bcov.test(data_list, distance = FALSE, weight = "chi") ################# Mutual Independence Test for Meteorology data ################# data("meteorology") bcov.test(meteorology) ################ Testing via approximate limit distribution ################# ## Not run: set.seed(1) n <- 2000 x <- rnorm(n) y <- rnorm(n) bcov.test(x, y, method = "limit") bcov.test(x, y) ## End(Not run) ################ Formula interface ################ ## independence test: bcov.test(~ CONT + INTG, data = USJudgeRatings) ## independence test with chisquare weight: bcov.test(~ CONT + INTG, data = USJudgeRatings, weight = "chi") ## mutual independence test: bcov.test(~ CONT + INTG + DMNR, data = USJudgeRatings)
set.seed(1) ################# Quick Start ################# noise <- runif(50, min = -0.3, max = 0.3) x <- runif(50, 0, 4*pi) y <- cos(x) + noise # plot(x, y) res <- bcov.test(x, y) res ## get all Ball Covariance statistics: res[["complete.info"]][["statistic"]] ## get all test result: res[["complete.info"]][["p.value"]] ################# Quick Start ################# x <- matrix(runif(50 * 2, -pi, pi), nrow = 50, ncol = 2) noise <- runif(50, min = -0.1, max = 0.1) y <- sin(x[,1] + x[,2]) + noise bcov.test(x = x, y = y, weight = "prob") ################# Ball Covariance Test for Non-Hilbert Data ################# # load data: data("ArcticLake") # Distance matrix between y: Dy <- nhdist(ArcticLake[["x"]], method = "compositional") # Distance matrix between x: Dx <- dist(ArcticLake[["depth"]]) # hypothesis test with BCov: bcov.test(x = Dx, y = Dy, distance = TRUE) ################ Weighted Ball Covariance Test ################# data("ArcticLake") Dy <- nhdist(ArcticLake[["x"]], method = "compositional") Dx <- dist(ArcticLake[["depth"]]) # hypothesis test with weighted BCov: bcov.test(x = Dx, y = Dy, distance = TRUE, weight = "prob") ################# Mutual Independence Test ################# x <- rnorm(50) y <- (x > 0) * x + rnorm(50) z <- (x <= 0) * x + rnorm(50) data_list <- list(x, y, z) bcov.test(data_list) data_list <- lapply(data_list, function(x) { as.matrix(dist(x)) }) bcov.test(data_list, distance = TRUE) bcov.test(data_list, distance = FALSE, weight = "chi") ################# Mutual Independence Test for Meteorology data ################# data("meteorology") bcov.test(meteorology) ################ Testing via approximate limit distribution ################# ## Not run: set.seed(1) n <- 2000 x <- rnorm(n) y <- rnorm(n) bcov.test(x, y, method = "limit") bcov.test(x, y) ## End(Not run) ################ Formula interface ################ ## independence test: bcov.test(~ CONT + INTG, data = USJudgeRatings) ## independence test with chisquare weight: bcov.test(~ CONT + INTG, data = USJudgeRatings, weight = "chi") ## mutual independence test: bcov.test(~ CONT + INTG + DMNR, data = USJudgeRatings)
Compute Ball Divergence statistic, which is a generic dispersion measure in Banach spaces.
bd( x, y = NULL, distance = FALSE, size = NULL, num.threads = 1, kbd.type = c("sum", "maxsum", "max") )
bd( x, y = NULL, distance = FALSE, size = NULL, num.threads = 1, kbd.type = c("sum", "maxsum", "max") )
x |
a numeric vector, matrix, data.frame, or a list containing at least two numeric vectors, matrices, or data.frames. |
y |
a numeric vector, matrix, data.frame. |
distance |
if |
size |
a vector recording sample size of each group. |
num.threads |
number of threads. If |
kbd.type |
a character string specifying the |
Given the samples not containing missing values, bd
returns Ball Divergence statistics.
If we set distance = TRUE
, arguments x
, y
can be a dist
object or a
symmetric numeric matrix recording distance between samples;
otherwise, these arguments are treated as data.
Ball divergence statistic measure the distribution difference of two datasets in Banach spaces. The Ball divergence statistic is proven to be zero if and only if two datasets are identical.
The definition of the Ball Divergence statistics is as follows.
Given two independent samples with the associated probability measure
and
with
, where the observations in each sample are i.i.d.
Let
,
where
indicates whether
is located in the closed ball
with center
and radius
.
We denote:
represents the proportion of samples
located in the
ball
and
represents the proportion of samples
located in the ball
.
Meanwhile,
and
represent the corresponding proportions located in the ball
.
The Ball Divergence statistic is defined as:
Ball Divergence can be generalized to the K-sample test problem. Suppose we
have group samples, each group include
samples.
The definition of
-sample Ball Divergence statistic could be
to directly sum up the two-sample Ball Divergence statistics of all sample pairs (
kbd.type = "sum"
)
or to find one sample with the largest difference to the others (kbd.type = "maxsum"
)
to aggregate the most significant different two-sample Ball Divergence statistics (
kbd.type = "max"
)
where are the largest
two-sample Ball Divergence statistics among
. When
,
the three types of Ball Divergence statistics degenerate into two-sample Ball Divergence statistic.
See bd.test
for a test of distribution equality based on the Ball Divergence.
bd |
Ball Divergence statistic |
Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang
Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang. Ball Divergence: Nonparametric two sample test. Ann. Statist. 46 (2018), no. 3, 1109–1137. doi:10.1214/17-AOS1579. https://projecteuclid.org/euclid.aos/1525313077
############# Ball Divergence ############# x <- rnorm(50) y <- rnorm(50) bd(x, y)
############# Ball Divergence ############# x <- rnorm(50) y <- rnorm(50) bd(x, y)
Fast K-sample Ball Divergence Test for GWAS Data
bd.gwas.test( x, snp, screening.method = c("permute", "spectrum"), refine = TRUE, num.permutations, distance = FALSE, alpha, screening.result = NULL, verbose = TRUE, seed = 1, num.threads = 0, ... )
bd.gwas.test( x, snp, screening.method = c("permute", "spectrum"), refine = TRUE, num.permutations, distance = FALSE, alpha, screening.result = NULL, verbose = TRUE, seed = 1, num.threads = 0, ... )
x |
a numeric vector, matrix, data.frame, dist object. |
snp |
a numeric matrix recording the values of single nucleotide polymorphism (SNP). Each column must be an integer vector. |
screening.method |
if |
refine |
a logical value. If |
num.permutations |
the number of permutation replications. When |
distance |
if |
alpha |
the significance level. Default: |
screening.result |
A object return by |
verbose |
Show computation status and estimated runtimes. Default: |
seed |
the random seed. Default |
num.threads |
number of threads. If |
... |
further arguments to be passed to or from methods. |
bd.gwas.test returns a list containing the following components:
statistic |
ball divergence statistics vector. |
permuted.statistic |
a data.frame containing permuted ball divergence statistic for pre-screening SNPs.
If |
eigenvalue |
the eigenvalue of spectrum decomposition. If |
p.value |
the p-values of ball divergence test. |
refined.snp |
the SNPs have been refined. |
refined.p.value |
the refined |
refined.permuted.statistic |
a data.frame containing permuted ball divergence statistics for refining |
screening.result |
a list containing the result of screening. |
Jin Zhu
Yue Hu, Haizhu Tan, Cai Li, and Heping Zhang. (2021). Identifying genetic risk variants associated with brain volumetric phenotypes via K-sample Ball Divergence method. Genetic Epidemiology, 1–11. https://doi.org/10.1002/gepi.22423
library(Ball) set.seed(1234) num <- 200 snp_num <- 500 p <- 5 x <- matrix(rnorm(num * p), nrow = num) snp <- sapply(1:snp_num, function(i) { sample(0:2, size = num, replace = TRUE) }) snp1 <- sapply(1:snp_num, function(i) { sample(1:2, size = num, replace = TRUE) }) snp <- cbind(snp, snp1) res <- Ball::bd.gwas.test(x = x, snp = snp) mean(res[["p.value"]] < 0.05) mean(res[["p.value"]] < 0.005) ## only return the test statistics; res <- Ball::bd.gwas.test(x = x, snp = snp, num.permutation = 0) ## save pre-screening process results: x <- matrix(rnorm(num * p), nrow = num) snp <- sapply(1:snp_num, function(i) { sample(0:2, size = num, replace = TRUE, prob = c(1/2, 1/4, 1/4)) }) snp_screening <- Ball::bd.gwas.test(x = x, snp = snp, alpha = 5*10^-4, num.permutations = 19999) mean(res[["p.value"]] < 0.05) mean(res[["p.value"]] < 0.005) mean(res[["p.value"]] < 0.0005) ## refine p-value according to the pre-screening process result: res <- Ball::bd.gwas.test(x = x, snp = snp, alpha = 5*10^-4, num.permutations = 19999, screening.result = snp_screening[["screening.result"]])
library(Ball) set.seed(1234) num <- 200 snp_num <- 500 p <- 5 x <- matrix(rnorm(num * p), nrow = num) snp <- sapply(1:snp_num, function(i) { sample(0:2, size = num, replace = TRUE) }) snp1 <- sapply(1:snp_num, function(i) { sample(1:2, size = num, replace = TRUE) }) snp <- cbind(snp, snp1) res <- Ball::bd.gwas.test(x = x, snp = snp) mean(res[["p.value"]] < 0.05) mean(res[["p.value"]] < 0.005) ## only return the test statistics; res <- Ball::bd.gwas.test(x = x, snp = snp, num.permutation = 0) ## save pre-screening process results: x <- matrix(rnorm(num * p), nrow = num) snp <- sapply(1:snp_num, function(i) { sample(0:2, size = num, replace = TRUE, prob = c(1/2, 1/4, 1/4)) }) snp_screening <- Ball::bd.gwas.test(x = x, snp = snp, alpha = 5*10^-4, num.permutations = 19999) mean(res[["p.value"]] < 0.05) mean(res[["p.value"]] < 0.005) mean(res[["p.value"]] < 0.0005) ## refine p-value according to the pre-screening process result: res <- Ball::bd.gwas.test(x = x, snp = snp, alpha = 5*10^-4, num.permutations = 19999, screening.result = snp_screening[["screening.result"]])
Performs the nonparametric two-sample or -sample Ball Divergence test for
equality of multivariate distributions
bd.test(x, ...) ## Default S3 method: bd.test( x, y = NULL, num.permutations = 99, method = c("permutation", "limit"), distance = FALSE, size = NULL, seed = 1, num.threads = 0, kbd.type = c("sum", "maxsum", "max"), weight = c("constant", "variance"), ... ) ## S3 method for class 'formula' bd.test(formula, data, subset, na.action, ...)
bd.test(x, ...) ## Default S3 method: bd.test( x, y = NULL, num.permutations = 99, method = c("permutation", "limit"), distance = FALSE, size = NULL, seed = 1, num.threads = 0, kbd.type = c("sum", "maxsum", "max"), weight = c("constant", "variance"), ... ) ## S3 method for class 'formula' bd.test(formula, data, subset, na.action, ...)
x |
a numeric vector, matrix, data.frame, or a list containing at least two numeric vectors, matrices, or data.frames. |
... |
further arguments to be passed to or from methods. |
y |
a numeric vector, matrix, data.frame. |
num.permutations |
the number of permutation replications. When |
method |
if |
distance |
if |
size |
a vector recording sample size of each group. |
seed |
the random seed. Default |
num.threads |
number of threads. If |
kbd.type |
a character string specifying the |
weight |
a character string specifying the weight form of Ball Divergence statistic.
It must be one of |
formula |
a formula of the form |
data |
an optional matrix or data frame (or similar: see |
subset |
an optional vector specifying a subset of observations to be used. |
na.action |
a function which indicates what should happen when the data contain |
bd.test
is nonparametric test for the two-sample or -sample problem.
It can detect distribution difference between
sample even though sample size are imbalanced.
This test can cope well multivariate dataset or complex dataset.
If only x
is given, the statistic is
computed from the original pooled samples, stacked in
matrix where each row is a multivariate observation, or from the distance matrix
when distance = TRUE
. The first sizes[1]
rows of x
are the first sample, the next
sizes[2]
rows of x
are the second sample, etc.
If x
is a list
, its elements are taken as the samples to be compared,
and hence, this list
must contain at least two numeric data vectors, matrices or data.frames.
bd.test
utilizes the Ball Divergence statistics (see bd
) to measure dispersion and
derives a -value via replicating the random permutation
num.permutations
times.
The function simply returns the test statistic
when num.permutations = 0
.
The time complexity of bd.test
is around ,
where
=
num.permutations
and is sample size.
If num.permutations > 0
, bd.test
returns a htest
class object containing the following components:
statistic |
Ball Divergence statistic. |
p.value |
the |
replicates |
permutation replications of the test statistic. |
size |
sample sizes. |
complete.info |
a |
alternative |
a character string describing the alternative hypothesis. |
method |
a character string indicating what type of test was performed. |
data.name |
description of data. |
If num.permutations = 0
, bd.test
returns a statistic value.
Actually, bd.test
simultaneously computing "sum"
, "summax"
, and "max"
Ball Divergence statistics
when .
Users can get other Ball Divergence statistics and their corresponding
-values
in the
complete.info
element of output. We give a quick example below to illustrate.
Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang, Jin Zhu
Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang. Ball Divergence: Nonparametric two sample test. Annals of Statistics. 46 (2018), no. 3, 1109–1137. doi:10.1214/17-AOS1579. https://projecteuclid.org/euclid.aos/1525313077
Jin Zhu, Wenliang Pan, Wei Zheng, and Xueqin Wang (2021). Ball: An R Package for Detecting Distribution Difference and Association in Metric Spaces, Journal of Statistical Software, Vol.97(6), doi: 10.18637/jss.v097.i06.
################# Quick Start ################# set.seed(1) x <- rnorm(50) y <- rnorm(50, mean = 1) # plot(density(x)) # lines(density(y), col = "red") bd.test(x = x, y = y) ################# Quick Start ################# x <- matrix(rnorm(100), nrow = 50, ncol = 2) y <- matrix(rnorm(100, mean = 3), nrow = 50, ncol = 2) # Hypothesis test with Standard Ball Divergence: bd.test(x = x, y = y) ################# Simlated Non-Hilbert data ################# data("bdvmf") ## Not run: library(scatterplot3d) scatterplot3d(bdvmf[["x"]], color = bdvmf[["group"]], xlab = "X1", ylab = "X2", zlab = "X3") ## End(Not run) # calculate geodesic distance between sample: Dmat <- nhdist(bdvmf[["x"]], method = "geodesic") # hypothesis test with BD : bd.test(x = Dmat, size = c(150, 150), num.permutations = 99, distance = TRUE) ################# Non-Hilbert Real Data ################# # load data: data("macaques") # number of femala and male Macaca fascicularis: table(macaques[["group"]]) # calculate Riemannian shape distance matrix: Dmat <- nhdist(macaques[["x"]], method = "riemann") # hypothesis test with BD: bd.test(x = Dmat, num.permutations = 99, size = c(9, 9), distance = TRUE) ################ K-sample Test ################# n <- 150 bd.test(rnorm(n), size = c(40, 50, 60)) # alternative input method: x <- lapply(c(40, 50, 60), rnorm) res <- bd.test(x) res ## get all Ball Divergence statistics: res[["complete.info"]][["statistic"]] ## get all test result: res[["complete.info"]][["p.value"]] ################ Testing via approximate limit distribution ################# ## Not run: set.seed(1) n <- 1000 x <- rnorm(n) y <- rnorm(n) res <- bd.test(x, y, method = "limit") bd.test(x, y) ## End(Not run) ################ Formula interface ################ ## Two-sample test bd.test(extra ~ group, data = sleep) ## K-sample test bd.test(Sepal.Width ~ Species, data = iris) bd.test(Sepal.Width ~ Species, data = iris, kbd.type = "max")
################# Quick Start ################# set.seed(1) x <- rnorm(50) y <- rnorm(50, mean = 1) # plot(density(x)) # lines(density(y), col = "red") bd.test(x = x, y = y) ################# Quick Start ################# x <- matrix(rnorm(100), nrow = 50, ncol = 2) y <- matrix(rnorm(100, mean = 3), nrow = 50, ncol = 2) # Hypothesis test with Standard Ball Divergence: bd.test(x = x, y = y) ################# Simlated Non-Hilbert data ################# data("bdvmf") ## Not run: library(scatterplot3d) scatterplot3d(bdvmf[["x"]], color = bdvmf[["group"]], xlab = "X1", ylab = "X2", zlab = "X3") ## End(Not run) # calculate geodesic distance between sample: Dmat <- nhdist(bdvmf[["x"]], method = "geodesic") # hypothesis test with BD : bd.test(x = Dmat, size = c(150, 150), num.permutations = 99, distance = TRUE) ################# Non-Hilbert Real Data ################# # load data: data("macaques") # number of femala and male Macaca fascicularis: table(macaques[["group"]]) # calculate Riemannian shape distance matrix: Dmat <- nhdist(macaques[["x"]], method = "riemann") # hypothesis test with BD: bd.test(x = Dmat, num.permutations = 99, size = c(9, 9), distance = TRUE) ################ K-sample Test ################# n <- 150 bd.test(rnorm(n), size = c(40, 50, 60)) # alternative input method: x <- lapply(c(40, 50, 60), rnorm) res <- bd.test(x) res ## get all Ball Divergence statistics: res[["complete.info"]][["statistic"]] ## get all test result: res[["complete.info"]][["p.value"]] ################ Testing via approximate limit distribution ################# ## Not run: set.seed(1) n <- 1000 x <- rnorm(n) y <- rnorm(n) res <- bd.test(x, y, method = "limit") bd.test(x, y) ## End(Not run) ################ Formula interface ################ ## Two-sample test bd.test(extra ~ group, data = sleep) ## K-sample test bd.test(Sepal.Width ~ Species, data = iris) bd.test(Sepal.Width ~ Species, data = iris, kbd.type = "max")
Simulated random vectors following the von Mises-Fisher distribution
with mean direction and
,
and concentration parameter is
.
bdvmf$x
: A numeric matrix containing simulated von Mises-Fisher data.
bdvmf$group
: A group index vector.
In directional statistics, the von Mises–Fisher distribution
(named after Ronald Fisher and Richard von Mises), is a probability distribution
on the -dimensional sphere in
The parameters , and
, are called the mean direction and concentration
parameter, respectively. The greater the value of
,
the higher the concentration of the distribution around the mean
direction
,. The distribution is unimodal for
,
and is uniform on the sphere for
.
Embleton, N. I. Fisher, T. Lewis, B. J. J. (1993). Statistical analysis of spherical data (1st pbk. ed.). Cambridge: Cambridge University Press. pp. 115–116. ISBN 0-521-45699-1.
Publicly available lung cancer genomic data from the Chemores Cohort Study, containing the expression levels of mRNA, miRNA, artificial noise variables as well as clinical variables and response.
genlung$survival
: A data.frame containing complete observations.
The first column is disease-free survival time and the
second column is censoring status.
genlung$covariate
: A data.frame containing covariates.
Tissue samples were analysed from a cohort of 123 patients, who underwent complete surgical resection at the Institut Mutualiste
Montsouris (Paris, France) between 30 January 2002 and 26 June 2006. The studied outcome was the "Disease-Free Survival Time".
Patients were followed until the first relapse occurred or administrative censoring. In this genomic dataset,
the expression levels of Agilent miRNA probes () were included from the
cohort samples.
The miRNA data contains normalized expression levels. See below the paper by Lazar et al. (2013) and Array Express
data repository for the complete description of the samples, tissue preparation, Agilent array technology, and data normalization.
In addition to the genomic data, five clinical variables, also evaluated on the cohort samples, are included as
continuous variable ('Age') and nominal variables ('Type','KRAS.status','EGFR.status','P53.status').
See Lazar et al. (2013) for more details. Moreover, we add 1056 standard gaussian variables
which are independent with the censored response as noise covariates. This dataset represents a situation where the number of
covariates dominates the number of complete observations or
case.
Lazar V. et al. (2013). Integrated molecular portrait of non-small cell lung cancers. BMC Medical Genomics 6:53-65.
Male and female macaque skull data. 7 landmarks in 3 dimensions, 18 individuals (9 males, 9 females)
macaques$x
: An array of dimension
macaques$group
: A factor indicating the sex ('m' for male and 'f' for female)
In an investigation into sex differences in the crania of a species of Macaca fascicularis (a type of monkey), random samples of 9 male and 9 female skulls were obtained by Paul O’Higgins (Hull-York Medical School) (Dryden and Mardia 1993). A subset of seven anatomical landmarks was located on each cranium and the three-dimensional (3D) coordinates of each point were recorded.
Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis, Wiley, Chichester.
Dryden, I. L. and Mardia, K. V. (1993). Multivariate shape analysis. Sankhya Series A, 55, 460-480.
A meteorological data include 46 records about air, soil, humidity, wind and evaporation.
meteorology$air
: A data.frame containing 3 variables: maximum, minimum and average daily air temperature
meteorology$soil
: A data.frame containing 3 covariates: maximum, minimum and average daily soil temperature
meteorology$humidity
: A data.frame containing 3 covariates: maximum, minimum and average daily humidity temperature,
meteorology$wind
: a vector object record total wind, measured in miles per day
meteorology$evaporation
: a vector object record evaporation
This meteorological data containing 46 observations on five groups of variables: air temperature, soil temperature, relative humidity, wind speed as well as evaporation. Among them, maximum, minimum and average value for air temperature, soil temperature, and relative humidity are recorded. As regards to wind speed and evaporation, there are univariate numerical variables. We desire to test the independence of these five groups of variables.
This function computes and returns the numeric distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.
nhdist(x, method = "geodesic")
nhdist(x, method = "geodesic")
x |
a numeric matrix, data frame or numeric array of dimension |
method |
the distance measure to be used. This must be one of |
Available distance measures are geodesic, compositional and riemann.
Denoting any two sample in the dataset as and
,
we give the definition of distance measures as follows.
geodesic:
The shortest route between two points on the Earth's surface, namely, a segment of a great circle.
compositional:
First, we apply scale transformation to it, i.e.,
. Then, apply the square root transformation to data and calculate the geodesic distance between samples.
riemann:
array where
= number of landmarks,
= number of dimensions and
= sample size. Detail about
riemannian shape distance was given in Kendall, D. G. (1984).
numeric distance matrix
Kendall, D. G. (1984). Shape manifolds, Procrustean metrics and complex projective spaces, Bulletin of the London Mathematical Society, 16, 81-121.
data('bdvmf') Dmat <- nhdist(bdvmf[['x']], method = "geodesic") data("ArcticLake") Dmat <- nhdist(ArcticLake[['x']], method = "compositional") data("macaques") Dmat <- nhdist(macaques[["x"]], method = "riemann") # unambiguous substring also available: Dmat <- nhdist(macaques[["x"]], method = "rie")
data('bdvmf') Dmat <- nhdist(bdvmf[['x']], method = "geodesic") data("ArcticLake") Dmat <- nhdist(ArcticLake[['x']], method = "compositional") data("macaques") Dmat <- nhdist(macaques[["x"]], method = "riemann") # unambiguous substring also available: Dmat <- nhdist(macaques[["x"]], method = "rie")