Introduction

MFM is a package to simultaneously fine-map (select most likely set of causal variants) multiple related diseases with the same set of controls and share information between them. It relies on output from the package GUESSFM, which fine-maps a single disease via stochastic search in a Bayesian framework using GUESS.

This vignette introduces MFM, and gives an illustration with simulated data for two diseases with shared controls. Custom code for our analyses is available at https://github.com/chr1swallace/MFM-analysis.

MFM Overview

The computational efficiency of MFM is greatly increased by making use of our approximation to the joint Bayes’ factor BF that depends only the individual disease BFs (from a logistic regression model) and the sample and model sizes. In the setting of \(d\) diseases, the multinomial BF, \(B^M\) is approximated by

\[B^M \approx \eta + \sum_d B_d^L, \]

where \(B_d^L\) is the Bayes’ factor for disease \(d\) and \(\eta\) is a term that depends on the sample and model sizes.

Rather than reporting posterior probabilities for each SNP model, we focus on SNP groups that we construct such that SNPs in the same group are in LD, have a similar effect on disease, and rarely appear together in a disease model.

Simulation Example

Genotype and phenotype data of two diseases with shared controls may be simulated using MFMextra, together with hapgen2 and reference panel data (e.g.  CEU of 1000 Genomes.

    library(devtools)
    install_github("jennasimit/MFMextra")

First, we need to simulate some null data from which we will sample to generate the two sets of cases with shared controls. The entire genetic region of interest is simulated to maintain the linkage disequilibrium (LD) structure. In running simulations, there are typically some models of interest that are selected based on analysis of the data. For example, in a previous fine-mapping of \(IL2RA\) in a large international sample, several SNP groups were identified as having the majority of the association signals with the autoimmune diseases multiple sclerosis (MS) and type 1 diabetes (T1D). These groups, together with previously identified lead SNPs for other autoimmune diseases (autoimmune thyroid disease (ATD; rs706799), alopecia areata (AA; rs3118470), rheumatoid arthritis (RA; rs10795791), and ulcerative colitis (UC; rs4147359)) will compose models that contribute to the non-negligible posterior probabilities. Therefore, for computational efficiency, we extract these SNPs from the generated data and focus on these in the fine-mapping simulation analysis.

Below is an example of hapgen2 code to simulate the region based on the CEU of 1000 Genomes reference panel, where keep-snps.txt would be the list of snp positions that are to be retained.

    ./hapgen2
    -m ./genetic_map_chr10_combined_b37.txt \
    -l ./IL2RA.impute.legend \
    -h ./IL2RA.impute.hap \
    -n 100000 0 # 100,000 \
    -no_haps_output -no_gens_output \
    -t ./keep-snps.txt \
    -Ne 11418 \
    -o ./null_100k \

Here is an example where diseases 1 and 2 each have two causal variants of which one is shared: both have causal variant rs61839660 in group A; disease 1 has additional causal variant rs56382813 in group D; disease 2 has additional causal variant rs11594656 in group C. The file “null_100k.controls.gen” is output from the above hapgen code and needed below. An example file is provided with this vignette, but for efficiency the convert.fn step has been run and the resulting matrix has been transposed for quicker reading into R (transposed matrix has SNPs as columns and individuals as rows).

All steps of simulation and analyses are included here. However, if GUESSFM results are available for each disease then one may skip down to the MFM step. We provide an example of GUESSFM output for simulated data from the steps below, which may input directly into MFM.

Simulate 2 diseases with shared controls

    library(MFMextra)
    #g0 <- read.table(null_100k.controls.gen,header=FALSE,as.is=TRUE)
    #Nn <- (dim(g0)[2]-5)/3
    #snpG <- convert.fn(g0) # convert to a genotype matrix (snp rows, indiv cols)
    
    
    dis <- c("AD","AC")
    c12 <- grep("rs61839660",rownames(snpG)) # A SNP for both diseases
    c1 <- grep("rs56382813",rownames(snpG))  # D SNP for disease 1
    c2 <- grep("rs11594656",rownames(snpG))  # C SNP for disease 2
    
    causals1.ind <- c(c12,c1)
    causals2.ind <- c(c12,c2)
    prev <- 0.1 # prevalence for purpose of method evaluation
    
    N0 <- 3000 # disease 1 size
    N1 <- 3000 # disease 2 size
    N2 <- 3000 # controls size
    ND=vector("list",2) # vector of sizes for cases
    names(ND) <- dis
    ND[[1]]<-N1
    ND[[2]]<-N2
    
    
    OR1a <- 1.4 # OR for A, disease 1
    OR2a <- 1.25  # OR for D, disease 1
    OR1b <- 1.4 # OR for A, disease 2
    OR2b <- 1.25  # OR for C, disease 2
    
# snpG included in MFMextra package
    sim <- phen.gen.fn(beta1=c(log(prev),log(OR1a),log(OR2a)),beta2=c(log(prev),log(OR1b),log(OR2b)),snpG=snpG,N0=N0,N1=N1,N2=N2,causals1.ind,causals2.ind)
    Gm <- new("SnpMatrix",(sim$G+1)) # snp cols, indivs rows # convert to SnpMatrix format, needed for GUESSFM
    Gm

GUESSFM on each disease

Next, we run GUESSFM on each disease with controls and store in SM2 list; this is our input to MFM. Note that input to GUESSFM (run.bvs is first step) requires a snpMatrix object and a phenotype vector.

    c0 <- grep("control.",rownames(Gm))
    c1 <- grep("case1.",rownames(Gm))
    c2 <- grep("case2.",rownames(Gm))
    
    G1 <- Gm[c(c0,c1),] # SnpMatrix for disease 1 and controls
    G2 <- Gm[c(c0,c2),] # SnpMatrix for disease 2 and   controls
    t1pheno <- c(rep(0,N0),rep(1,N1)) # phenotype vector for disease 1 and  controls
    t2pheno <- c(rep(0,N0),rep(1,N2)) # phenotype vector for disease 2 and  controls
    
    DIRin <- "tmpdirectory" # path to a directory where to save GUESSFM results
    mydir <- paste(DIRin,"/GFMbvs",sep="") # store GUESSFM results in a previously created directory DIRin
    SM2 <- vector("list",2) # collect the GUESSFM output from each disease

      library(GUESSFM)      
      run.bvs(X=G1,Y=t1pheno,tag.r2=.95,nexp=3,nsave=2000,gdir=mydir,wait=TRUE) # run with expected number of causal variants 3 for better mixing
      d <- read.snpmod(mydir)
      load(file.path(mydir,"tags.RData"))
      dx <- expand.tags(d,tags)
      best <- best.models(dx,pp.thr=0.0001)
      abf <- abf.calc(y=t1pheno,x=G1,models=best$str,family="binomial")
      SM2[[1]] <- abf2snpmod(abf,expected=2,nsnps=854) # find approximate Bayes' factors (ABFs) using expected number of causal variants 2 and the number of SNPs in the re$
    
      run.bvs(X=G2,Y=t2pheno,tag.r2=.95,nexp=3,nsave=2000,gdir=mydir,wait=TRUE)
      d <- read.snpmod(mydir)
      load(file.path(mydir,"tags.RData"))
      dx <- expand.tags(d,tags)
      best <- best.models(dx,pp.thr=0.0001)
      abf <- abf.calc(y=t2pheno,x=G2,models=best$str,family="binomial")
      SM2[[2]] <- abf2snpmod(abf,expected=2,nsnps=854)
    
    names(SM2) <- dis

MFM on both diseases

# SM2 is a list of GUESSFM output for each disease and an example is included in MFM
    library(MFM)
#> Warning: replacing previous import 'data.table::melt' by 'reshape::melt'
#> when loading 'GUESSFM'
    SM2
#> $AD
#> snpmod object, containing information on 270 models / 26 SNPs.
#> PP ranges from 0.000-0.396 (sum: 1.000).
#> 
#> $AC
#> snpmod object, containing information on 207 models / 26 SNPs.
#> PP ranges from 0.000-0.239 (sum: 1.000).
    dis <- names(SM2)
    N0 <- 3000 # disease 1 size
    ND=list(AD=N0,AC=N0) # vector of sizes for cases, which here, are same size as controls

    target.odds <- 1 # could also provide a vector here; this setting corresponds to a 50:50 odds of non-sharing to sharing of causal variants between diseases
    # snpGroups included in MFMextra package
    PP <- PPmarginal.multiple.fn(SM2,dis,thr=0.999,target.odds,N0,ND,nsnps=854)
#> 
#> 
#> CPP threshold = 0.999
#>  n.each (AD/AC) = 89/16

The PP object is a list with components:

  • PP, containing the SNP model posterior probabilities for each disease and each target odds

  • MPP, containing the SNP marginal posterior probabilities of inclusion and each target odds

It is easier to interpret in terms of SNP groups, so next run the following, where the snpGroups object is available in the MFM package:

    mpp.pp <- MPP.PP.groups.fn(PP$MPP,PP$PP,dis,c("null",target.odds),snpGroups)
mpp.pp
#> $mppGS
#>                A            B            C        D E F            J
#> AD.null 1.000000 0.000000e+00 0.0000000000 1.000000 0 0 0.000000e+00
#> AD.1    1.000007 3.890584e-06 0.0001494211 1.000197 0 0 4.842034e-05
#> AC.null 1.000000 0.000000e+00 1.0000000000 0.000000 0 0 0.000000e+00
#> AC.1    1.000000 0.000000e+00 1.0000000000 0.000000 0 0 0.000000e+00
#>                  sAA          sUC          sRA
#> AD.null 0.0000000000 0.0000000000 0.0000000000
#> AD.1    0.0002426082 0.0001735028 0.0001159202
#> AC.null 0.0000000000 0.0000000000 0.0000000000
#> AC.1    0.0000000000 0.0000000000 0.0000000000
#> 
#> $gPP
#> $gPP[[1]]
#>         null            1
#> null       0 1.411368e-13
#> A.D        1 9.990578e-01
#> B.D        0 3.890584e-06
#> A.D.sUC    0 1.735028e-04
#> A.D.sAA    0 2.426082e-04
#> A.D.sRA    0 1.159202e-04
#> A.D.D      0 1.971661e-04
#> A.C.D      0 1.494211e-04
#> A.D.J      0 4.842034e-05
#> A.A.D      0 1.122915e-05
#> 
#> $gPP[[2]]
#>      null            1
#> null    0 6.837469e-24
#> A.C     1 1.000000e+00

The mpp.pp object is a list with componets:

  • mppGS: matrix of SNP group MPP

  • gPP: list of disease SNP group PP matrices

The entires headed by “null” give the results for independent analyses of the diseases and other entries are headed by the target odds used in MFM.