R2BGLiMS.Rd
Calls BGLiMS - a Java package for fitting GLMs under Bayesian model selection. NOTE: The predictors to explore with model selection are specified via the model.space.priors argument - see examples. By Default a common, and unknown, prior SD is used for all predictors under model selection, which in turn is assigned an Inverse-Gamma hyper-prior. Fixed normal priors may be user specified for all predictor (and confounder) coefficients via the beta.priors argument.
R2BGLiMS(
likelihood = NULL,
data = NULL,
outcome.var = NULL,
times.var = NULL,
confounders = NULL,
model.selection = TRUE,
model.space.priors = NULL,
beta.priors = NULL,
beta.prior.partitions = NULL,
standardise.covariates = TRUE,
empirical.intercept.prior.mean.and.initial.value = NULL,
g.prior = TRUE,
tau = NULL,
xtx.ridge.term = 0,
enumerate.up.to.dim = 0,
X.ref = NULL,
cor.ref = NULL,
mafs.ref = NULL,
ns.each.ethnicity = NULL,
marginal.betas = NULL,
n = NULL,
n.iter = 1e+06,
n.mil.iter = NULL,
thinning.interval = NULL,
seed = NULL,
extra.arguments = NULL,
initial.model = NULL,
max.model.dim = -1,
save.path = NULL,
results.label = NULL,
burnin.fraction = 0.5,
trait.variance = NULL,
logistic.likelihood.weights = NULL,
mrloss.w = 0,
mrloss.function = "variance",
mrloss.marginal.by = NULL,
mrloss.marginal.sy = NULL,
mafs.if.independent = NULL,
extra.java.arguments = NULL,
debug = FALSE
)
Type of model to fit. Current options are "Logistic" (for binary data), "CLogLog" complementary log-log link for binary data, "Weibull" (for survival data), "Linear" (for linear regression), "LinearConj" (linear regression exploiting conjugate results), "JAM" (for conjugate linear regression using summary statistics, integrating out parameters) and "JAM_MCMC" (for linear regression using summary statistics, with full MCMC of all parameters).
Matrix or dataframe containing the data to analyse. Rows are indiviuals, and columns contain the variables and outcome. If modelling summary statistics specify X.ref, marginal.betas, and n instead (see below).
Name of outcome variable in data. For survival data see times.var below. If modelling summary statistics with JAM this can be left null but you must specify X.ref, marginal.beats and n instead (see below).
SURVIVAL DATA ONLY Name of column in data which contains the event times.
Optional vector of confounders to fix in the model at all times, i.e. exclude from model selection.
Whether to use model selection (default is TRUE). NB: Even if set to FALSE, please provide a dummy model.space.priors argument (see below). This will not be used mathematically, but the package requires taking the list of variable names from the "Variables" element therein.
Must be specified if model.selection is set to TRUE.
Two options are available. 1) A fixed prior is placed on the proportion of causal
covariates, and all models with the same number of covariates is equally likely. This
is effectively a Poisson prior over the different possible model sizes. A list must
be supplied for model.space.priors
with an element "Rate", specifying the prior
proportion of causal covariates, and an element "Variables" containing the list of covariates
included in the model search. 2) The prior proportion of causal covariates
is treated as unknown and given a Beta(a, b) hyper-prior, in which case
elements "a" and "b" must be included in the model.space.priors
list rather
than "Rate". Higher values of "b" relative to "a" will encourage sparsity.
NOTE: It is easy to specify different model space priors for different collections of
covariates by providing a list of lists, each element of which is itself a model.space.prior
list asm described above for a particular subset of the covariates.
This allows specifying fixed (potentially informative) priors for the covariate effect priors. A matrix must be passed, with named rows corresponding to parameters, and columns corresponding to the prior mean and variance in that order. When using this option priors must be specified for either just the confounders, which are otherwise given fixed N(0,1e6) priors, or for all covariates.
Covariate effects under variable selection are ascribed, by default, a common Normal prior, the standard deviation of which is treated as unknown, with a Unif(0.05,2) hyper-prior. This option can be used to partition the covariate effects into different prior groups, each with a seperate hierarchical normal prior. beta.prior.partitions must be a list with as many elements as desired covariate groups. The element for a particular group must in turn be a list containing the following named elements: "Variables" - a list of covariates in the prior group, and "UniformA" and "UniformB" the Uniform hyper parameters for the standard deviation of the normal prior across their effects.
Standardise covariates prior to RJMCMC such that they have a common (unit) standard deviation and mean zero. This is particularly recommended when covariates have substantially different variances, to improve the likelihood of exchangeable effect estimates, an asssumption that is required under the (default) common effect prior with unknown variance used in the RJMCMC. The standardisation is done invisibly to the user; parameter estimates are re-scaled back to unit increases on the original scale before producing posterior summaries. This is currently available for Logistic, Linear and Weibull regression. Default is TRUE. Note that when this option is used, covariate effects are re-scaled in the summary results table contained in the R2BGLiMS results object, such that they are still interpretable as effects corresponding to unit changes on the original covariate scale.
Empirically set the intercept inital value and prior mean according to the mean outcome value, i.e. the expected intercept value when covariates are mean-centred. In some settings this can markedly improve mixing but EXPERIMENTATION is encouraged since in other settings it can do more harm than good. For Weibull regression this option leads to both the intercept AND the scale parameter having their prior means and initial values set according to a simple NULL Weibull model fit using the survreg function. In our experience this always improved mixing under Weibull regression, so is enabled by default, but can be over-ridden by setting this option to FALSE. For logistic regression, setting this option to TRUE leads to the intercept prior mean and initial value being set to the logit(case fraction), and for linear regression to the mean outcome. For linear and logistic regression the usefulness of this option is less clear, so it is off by default.
Whether to use a g-prior for the beta's, i.e. a multivariate normal with correlation structure proportional to sigma^2*X'X^-1, which is thought to aid variable selection in the presence of strong correlation. By default this is enabled.
Value to use for sparsity parameter tau (under the tau*sigma^2 parameterisation). When using the g-prior, a recommended default is max(n, P^2) where n is the number of individuals, and P is the number of predictors.
Value to add to the constant of the diagonal of X'X before JAM takes the Cholesky decomposition.
Whether to make posterior inference by exhaustively calculating the posterior support for every possible model up to this dimension. Leaving at 0 to disable and use RJMCMC instead. The current maximum allowed value is 5.
Reference genotype matrix used by JAM to impute the SNP-SNP correlations. If multiple regions are to be analysed this should be a list containing reference genotype matrices for each region. Individual's genotype must be coded as a numeric risk allele count 0/1/2. Non-integer values reflecting imputation may be given. NB: The risk allele coding MUST correspond to that used in marginal.betas. These matrices must each be positive definite and the column names must correspond to the names of the marginal.betas vector.
Alternatively to a reference genotype matrix, a reference correlation matrix AND mafs may be supplied to JAM. NB: The risk allele coding MUST correspond to that used in marginal.betas. These matrices must each be positive definite and the column and row names must correspond to the names of the marginal.betas vector.
Alternatively to a reference genotype matrix, a reference correlation matrix AND mafs may be supplied to JAM. NB: The risk allele coding MUST correspond to that used in marginal.betas. This must be a named vector with names correspond to the names of the marginal.betas vector.
For mJAM: A vector of the sizes of each ethnicity dataset in which the summary statistics were calculated.
Vector of (named) marginal effect estimates to re-analyse with JAM under multivariate models. For multi-ethnic "mJAM" please provide a list of vectors, each element of which is a vector of marginal effects for a specific ethnicity over the same variants.
The size of the dataset in which the summary statistics (marginal.betas) were calculated
Number of iterations to run (default is 1e6)
Number of million iterations to run. Can optionally be used instead of n.iter for convenience, which it will overide if specified.
Every nth iteration to store (i.e. for the Java algorithm to write to a file and then read into R). By default this is the number of iterations divided by 1e4 (so for 1 million iterations every 100th is stored.) Higher values (so smaller posterior sample) can lead to faster runtimes for large numbers of covariates.
Which random number seed to use in the RJMCMC sampler. If none is provided, a random seed is picked between 1 and 2^16.
A named list of additional arguments for which there are not currently dedicated options for. This can be used to modify various "under the hood" settings, including all prior hyper-parameters, MCMC mixing parameters such as the probabilities of add/delete/swap moves as well as the adaption settings. A list of all settings available for modification can be seen by typing "data(DefaultArguments)" and then "default.arguments", which lists their names and default values.
An initial model for the covariates under selection can be specified as a vector of 0s and 1s. If left un-specified the null (empty) model is used.
The maximum model dimension can be specified, therefore truncating the model space. We do not recommend using this option but it might sometimes be useful for robust-ness checks. When left un-specified there is no restriction on the model size.
By default R2BGLiMS writes the posterior samples to a temporary file that is deleted after they have been read in to R. By specifying a file path this can be kept, along with the temporary files used for the data and arguments, which can be useful sometimes for de-bugging.
When using the save.path option, this allows the user to specify a handle with which to name the files.
Initial fraction of the iterations to throw away, e.g. setting to 0 would mean no burn-in. The default of 0.5 corresponds to the first half of iterations being discarded.
Check NULL values.
An optional vector of likelihood weights for logistic regression. These weights multiply the log-likeihood contribution of each individual. The order should match the order of rows in the data matrix.
The relative weight of the MR log loss function for pleiotropy vs the log likelihood. Default 0.
Choice of pleiotropic loss function from "steve", "variance" (default variance)
Marginal associations between SNPs and outcome for the MR loss function model.
Standard errors of marginal associations between SNPs and outcome for the MR loss function model (not required for mrloss.function "variance")
If the SNPs are independent then a reference genotype matrix is not required. However, it is still necessary to provide SNP MAFs here as a named vector. Doing so will lead to X.ref being ignored and the SNPs to be modelled as if they are independent. Note that this option does not work with enumeration.
A character string to be passed through to the java command line. E.g. to specify a different temporary directory by passing "-Djava.io.tmpdir=/Temp".
Whether to output extra information (such as final adaption proposal SDs) which might help with debugging (default is FALSE).
An R2BGLiMS_Results class object is returned. See the slot 'posterior.summary.table' for a posterior summary of all parameters. See slot 'mcmc.output' for a matrix containing the raw MCMC output from the saved posterior samples (0 indicates a covariate is excluded from the model in a particular sample. Some functions for summarising results are listed under "see also".
Summary results are stored in the slot posterior.summary.table. See ManhattanPlot
for a visual
summary of covariate selection probabilities. For posterior model space summaries see TopModels
. For
convergence checks see TracePlots
and AutocorrelationPlot
.