Flashfm (flexible and shared information fine-mapping) is a package to simultaneously fine-map genetic associations for multiple quantitative traits that are measured in the same study(s) by sharing information between the traits. It is flexible to the inclusion of related individuals in the sample and to missing trait measurements.
Flashfm makes use of summary-level data and requires as input:
This vignette introduces flashfm and gives an illustration of its use - with both forms of input for the reference genotype information - on simulated data of two traits with a shared causal variant. A simulated data set is provided in flashfm.
Flashfm shares information between traits by up-weighting joint models that have a shared causal variant between traits, in a similar Bayesian framework to Multinomial Fine-Mapping (MFM) for joint fine-mapping of multiple diseases with shared controls.
To generate posterior support for fine-mapping models, flashfm needs to calculate the Bayes’ factor (BF) for all possible model combinations across SNPs and traits. To make this computationally feasible, we derive an expression for the the joint BF that depends on the individual (marginal) trait BFs and terms that depend on the GWAS summary statistics, covariance matrix of the traits, and sample sizes; either a genotype matrix or both a SNP covariance matrix and relative allele frequencies (RAFs) are needed from a reference panel or in-study sample.
As in MFM, rather than reporting posterior probabilities (PPs) for each SNP model, we focus on SNP groups that we construct such that SNPs in the same group are in LD, have a similar effect on traits, and rarely appear together in a trait model. We constuct SNP groups based on the model PPs from single-trait fine-mapping and a separate set of SNP groups based on the model PPs from flashfm. In general, the groups constructed by flashfm tend to be a subset of those from single-trait fine-mapping, indicating that flashfm gives finer resolution that single-trait fine-mapping.
Details of how to run flashfm on your own data are explained through detailed examples of simulated data. We provide four ways to run flashfm, depending on whether or not single-trait fine-mapping results are already available or not.
Single-trait fine-mapping has not yet been run - need GWAS summary statistics for each trait, SNP correlation matrix, reference allele frequency vector, and trait covariance matrix. In one command, run single-trait fine-mapping (using JAM) and flashfm, as well as construct SNP groups for each approach and summarise results by SNPs and by SNP groups.
Single-trait fine-mapping has not yet been run - need GWAS summary statistics for each trait, SNP correlation matrix, reference allele frequency vector, and trait covariance matrix. In one command, run single-trait fine-mapping (using FINEMAP) and flashfm, as well as construct SNP groups for each approach and summarise results by SNPs and by SNP groups.
Single-trait fine-mapping has not yet been run - need GWAS summary statistics for each trait, SNP matrix, and trait covariance matrix. See second example.
Single-trait fine-mapping has been run using FINEMAP - need *config files from FINEMAP for each trait, GWAS summary statistics for each trait, SNP correlation matrix, reference allele frequency vector, and trait covariance matrix. See third (last) example.
In this simulated data example, we simulate two traits that each have two causal variants, of which one (rs61839660) is shared between traits; trait 1 has second causal variant rs62626317 and trait 2 has second causal variant rs11594656. The trait correlatin is 0.4.
For a region of 345 SNPs in chromosome 10p-6030000-6220000 (GRCh37/hg19), containing IL2RA, we generated a population of 100,000 individuals based on the CEU 1000 Genomes Phase 3 data using HapGen2. We selected a random sample of 2000 from this population and only retained the 334 variants with MAF > 0.005 in this sample. This genotype matrix is available in flashfm as the object X.
Measurements of two traits were generated using the mvrnorm function in R and genotype matrix X. In this example there are no missing measurements, so N=2000 for both traits.
library(flashfm)
head(snpinfo)
#> ID pos allele0 allele1
#> 1 rs12722563 6069561 G A
#> 2 rs12572054 6069853 T C
#> 3 rs12722561 6069893 C T
#> 4 rs75578321 6070158 C T
#> 5 rs12722559 6070273 C A
#> 6 rs12722558 6070276 A T
covY#> AD AC
#> AD 1.0222726 0.4259619
#> AC 0.4259619 1.0652991
ybar#> AD AC
#> 0.1861202 0.2435154
lapply(beta,head)
#> $AD
#> rs12722563 rs12572054 rs12722561 rs12722559 rs12722558 rs12722557
#> 0.1976915 0.0879946 0.1918139 0.1918139 0.1797842 0.1593434
#>
#> $AC
#> rs12722563 rs12572054 rs12722561 rs12722559 rs12722558 rs12722557
#> 0.2132664 -0.1483218 -0.1272511 -0.1272511 0.2219491 -0.3573181
1:5,1:5]
X[#> rs12722563 rs12572054 rs12722561 rs12722559 rs12722558
#> indiv7452 0 0 0 0 0
#> indiv8016 0 0 0 0 0
#> indiv7162 0 0 0 0 0
#> indiv8086 0 0 0 0 0
#> indiv7269 1 0 0 0 1
N#> [1] 2000 2000
head(raf)
#> rs12722563 rs12572054 rs12722561 rs12722559 rs12722558 rs12722557
#> 0.09450 0.04100 0.15200 0.15200 0.08925 0.02600
This example takes seconds to run since the output is saved as an object, fm, in the flashfm package - details follow.
In this example, we make use of a SNP correlation matrix and reference allele frequency (RAF) vector as input to flashfm and use the internal flashfm function for single-trait fine-mappping, JAMexpandedCor.multi. This function is an expanded version of JAM. The function FLASHFMwithJAM provides a wrapper to run single-trait fine-mapping on each trait and then uses these results as input to flashfm. The SNP groups are also constructed within this wrapper function, as well as output from flashfm summarised at both the SNP-level and SNP group level.
NOTE: SNP names must not contain “:” in FLASHFMwithJAM, JAMexpanded.multi, and JAMexpandedCor.multi. So, if SNP names are in the form “19:45301088_T_C” change to “chr19_45301088_T_C”. This could be done using the command
paste0("chr", gsub(":", "_", snpnames)
We first need the correlation matrix, which we compute from our genotype matrix X, retaining only the SNPs that are in the raf vector; the raf vector was previously filtered at MAF 0.005.
<- names(raf) # filtered on maf 0.005
msnps = cor(X[,msnps]) corX
The command to run is as below, and for convenience the output from this function is saved in the object fm. In order to use JAM for single-trait fine-mapping, the R2BGLiMS library must be loaded, in addition to flashfm.
library(R2BGLiMS)
<- FLASHFMwithJAM(beta, corX, raf, ybar, N, save.path="DIRtmp",TOdds=1,covY,cpp=0.99,NCORES=2) fm
where arguments are described in detail here, though each object gives an example of format:
beta is a list of named vectors for each trait to be included in fine-mapping; see example format of object beta in this package; create this object using b <- vector(“list”,2) and then assign beta vectors to each b[[i]], i=1,2. At most 5 traits can be used with this function. There is a fast approximation version called FLASHFMwithJAMhat that is used in the exact same way as FLASHFMwithJAM, and allows at most 6 traits. Only include traits that have a signal in the region, e.g. traits that have min p < 1E-6 in the region.
corX is a SNP correlation matrix from an in-sample or reference panel; corX may have column and row names as in this example, though this is not necessary. Otherwise, column names are taken from the raf vector; ; SNPs in corX MUST be in the same order as SNPs in raf. The allele coding for corX and raf MUST be based on the effect allele used for beta.
raf is a reference allele frequency named vector with names given by snp IDs; snps must appear in same order as in corX; the reference allele must match the effect allele in the GWAS that generated the beta vectors
ybar is a vector of trait means; if traits transformed to normal(0,1), set ybar to a vector of 0.
N is a vector of sample sizes for traits, in same order as trait effect vectors given in beta. This can be based on data, if unrelated samples, as here. Especially for samples with related individuals, we recommend finding the effective sample size for each trait using the built-in function:
Neff(raf, seB, Vy = 1),
where raf is the reference allele frequency vector, seB is the vector of standard errors of the effect estimates in the same order of SNPs in raf, and Vy is the trait variance, which is 1 by default, assuming traits are transformed to normal(0,1).
save.path lists a path to a directory to save the intermediate files generated by JAM. This directory can be created within R using the command dir.create(save.path).
TOdds=1 is the default for flashfm which gives a 50/50 odds that the traits share at least one causal variant
cpp=0.99 is the default for flashfm which prioritises single-trait fine-mapping models with cumulative posterior probability 0.99
NCORES is set to the number of traits to allow parallel processing in flashfm, though if on a Windows machine use NCORES=1
Our top group models are:
$mpp.pp$PPg
fm#> $AD
#> null 178
#> C%J 0.78953919 0.862765722
#> C%G%J 0.06563398 0.056235616
#> C%E%J 0.05310548 0.043432271
#> C%D%J 0.02213726 0.023118026
#> A%I 0.02059711 0.000000000
#> B 0.01875517 0.001065602
#> C 0.01090246 0.000619439
#>
#> $AC
#> null 178
#> A%J 1 1
Let’s check if our causal variants (in cvs vector) are included in the SNP groups from single and multi-trait fine-mapping and the sizes of the SNP groups
:::groupIDs.fn(fm$snpGroups[[1]],cvs) # based on FINEMAP SNP groups
flashfm#> [1] "J" "C" "A"
:::groupIDs.fn(fm$snpGroups[[2]],cvs) # based on flashfm SNP groups
flashfm#> [1] "J" "C" "A"
$snpGroups$group.sizes # first set of SNP groups is from FINEMAP, second set of SNP groups is from flashfm
fm#> A B C D E F G H I J
#> null 5 1 7 1 2 1 7 2 9 20
#> 178 3 1 7 1 2 0 6 2 0 6
Notice that the groups for the causal variants that were not shared (cvs[c(2,3)]) are the same size for both methods and that for the shared variant (cvs[1]), flashfm constructs a smaller group than FINEMAP. This illustrates how sharing information between the traits can refine fine-mapping resolution.
This example takes seconds to run since the output is saved as an object, fm2, in the flashfm package - details follow.
In this example, we make use of a SNP correlation matrix and reference allele frequency (RAF) vector as input to flashfm and use the FINEMAP software (Benner et al. 2016). The function FLASHFMwithFINEMAP provides a wrapper to run single-trait fine-mapping on each trait and then uses these results as input to flashfm. The SNP groups are also constructed within this wrapper function, as well as output from flashfm summarised at both the SNP-level and SNP group level.
We first need the correlation matrix, which we compute from our genotype matrix X, retaining only the SNPs that are in the raf vector; the raf vector was previously filtered at MAF 0.005.
<- names(raf) # filtered on maf 0.005
msnps = cor(X[,msnps]) corX
The command to run is as below, and for convenience the output from this function is saved in the object fm2.
<- FLASHFMwithFINEMAP(gwas.list,corX,raf,ybar,N,fstub="DIRresults/eg",TOdds=1,covY,cpp=.99,NCORES=2,FMpath="/software/finemap_v1.4_x86_64/finemap_v1.4_x86_64") fm2
where arguments are described in detail here, though each object gives an example of format:
gwas.list is a list of two data.frames - one for each trait. It has columns “rsid”, “chromosome”, “position”, “allele1”, “allele2”, “maf”, “beta”, “se”, which are, respectively, the SNP id, base-pair position, effect allele, non-effect allele, minor allele frequency, effect estimate, standard error of effect estimate; this is the same format as the .z file of FINEMAP. At most 5 traits can be used with this function. Only include traits that have a signal in the region, e.g. traits that have min p < 1E-6 in the region.
corX is a SNP correlation matrix from an in-sample or reference panel; corX may have column and row names as in this example, though this is not necessary. Otherwise, column names are taken from the raf vector; SNPs in corX MUST be in the same order as SNPs in raf. The allele coding for corX and raf MUST be based on the effect allele used for beta.
raf is a reference allele frequency named vector with names given by snp IDs; snps must appear in same order as in corX; the reference allele must match the effect allele in the GWAS that generated the beta vectors
ybar is a vector of trait means; if traits transformed to normal(0,1), set ybar to a vector of 0.
N is a vector of sample sizes for traits, in same order as trait effect vectors given in beta. This can be based on data, if unrelated samples, as here. Especially for samples with related individuals, we recommend finding the effective sample size for each trait using the built-in function:
Neff(raf, seB, Vy = 1),
where raf is the reference allele frequency vector, seB is the vector of standard errors of the effect estimates in the same order of SNPs in raf, and Vy is the trait variance, which is 1 by default, assuming traits are transformed to normal(0,1).
fstub is the file stub for input/output files of FINEMAP, e.g. if fstub=“DIRresults/region1”, FINEMAP files of the form “DIRresults/region1.z” will be created
TOdds=1 is the default for flashfm which gives a 50/50 odds that the traits share at least one causal variant
cpp=0.99 is the default for flashfm which prioritises single-trait fine-mapping models with cumulative posterior probability 0.99
NCORES is set to the number of traits to allow parallel processing in flashfm, though if on a Windows machine use NCORES=1
FMpath is the file pathway to the FINEMAP software e.g. “/software/finemap_v1.4_x86_64/finemap_v1.4_x86_64”
Our top group models are:
lapply(fm2$mpp.pp$PPg,head)
#> $AD
#> null 178
#> F%G 0.51730036 0.582679505
#> D%F%G 0.14201391 0.119526442
#> A%F%G 0.13593911 0.126942848
#> C%F%G 0.08873371 0.074572455
#> B%F%G 0.02782625 0.025925704
#> A%D%F%G 0.00968424 0.007232427
#>
#> $AC
#> null 178
#> E%F 0.845159439 0.878740258
#> E%F%rs184023299 0.009990423 0.007476973
#> E%F%rs12722557 0.009574463 0.007445177
#> E%F%rs11256464 0.006689188 0.005138940
#> E%F%F 0.005154478 0.002898676
#> E%F%rs79496268 0.004633628 0.003615495
Let’s check if our causal variants (in cvs vector) are included in the SNP groups from single and multi-trait fine-mapping and the sizes of the SNP groups
:::groupIDs.fn(fm2$snpGroups[[1]],cvs)
flashfm#> [1] "F" "G" "E"
:::groupIDs.fn(fm2$snpGroups[[2]],cvs)
flashfm#> [1] "F" "G" "E"
$snpGroups$group.sizes # first set of SNP groups is from FINEMAP, second set of SNP groups is from flashfm
fm2#> A B C D E F G
#> null 7 7 1 2 3 20 7
#> 178 7 7 1 2 3 11 7
Notice that the groups for the causal variants that were not shared (cvs[c(2,3)]) are the same size for both methods and that for the shared variant (cvs[1]), flashfm (fm2\(snpGroups[[2]]) constructs a smaller group than FINEMAP (fm2\)snpGroups[[1]]). This illustrates how sharing information between the traits can refine fine-mapping resolution.
This example takes seconds to run.
In this example, we make use of a reference genotype matrix as input to flashfm and use the internal flashfm function for single-trait fine-mappping, JAMexpanded.multi. This function is an expanded version of JAM, and requires a pathway to the software PLINK.
NOTE: SNP names must not contain “:” in FLASHFMwithJAM, JAMexpanded.multi, and JAMexpandedCor.multi. So, if SNP names are in the form “19:45301088_T_C” change to “chr19_45301088_T_C”. This could be done using the command
paste0("chr", gsub(":", "_", snpnames)
For efficiency, we provide the output from JAMexpanded.multi in the object JAMmain.input. The command used to generate it is:
<- JAMexpanded.multi(beta,X,snpinfo,ybar,diag(covY),N,chr=10,fstub=fstub,mafthr=0.005,path2plink="/software/plink",r2=0.99,save.path=fstubJ,related=FALSE,y=NULL) JAMmain.input
In the above command, ftsub and fstubJ are file prefixes, including file path, to save intermediate results. The related logical flag indicates whether or not there are related individuals in the sample. In the simulated data there is no relatedness, so relate=FALSE. If relatedness, then effective sample sizes need to be input for N; the built-in function Neff estimates effective sample size for each trait from the GWAS summary statistics.
Flashfm is then run with the following lines:
<- summaryStats(Xmat=TRUE,ybar.all=ybar,main.input=JAMmain.input)
ss.stats <- flashfm(JAMmain.input, TOdds=1,covY,ss.stats,cpp=0.99,maxmod=NULL,fastapprox=FALSE)
fm.multi #> Loading required package: GUESSFM
#> Warning: replacing previous import 'data.table::melt' by 'reshape::melt' when
#> loading 'GUESSFM'
#>
#> Attaching package: 'GUESSFM'
#> The following objects are masked from 'package:flashfm':
#>
#> calc.maxmin, tag
#> Trait 1 (AD) has cpp before adjustment: 0.99
#> Trait 2 (AC) has cpp before adjustment: 0.99
and SNP groups for both single-trait fine-mapping and flashfm are constructed by
<- makeSNPgroups2(JAMmain.input,fm.multi,is.snpmat=TRUE,r2.minmerge=0.7)
snpGroups #> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> SNP group sizes based on single-trait results are:
#> Group Size
#> A 1
#> B 2
#> C 2
#> D 5
#> E 3
#> F 1
#> G 6
#> H 15
#> SNP group sizes based on flashfm results are:
#> Group Size
#> A 1
#> B 2
#> C 1
#> D 1
#> E 3
#> G 6
#> H 5
where the is.snpmat flag=TRUE indicates that the reference genotype matrix was provided and r2.minmerge is the minimum LD between SNPs in different groups for them to merge; r2.minmerge=0.7 means that if a SNP in 1 group has r2>0.7 with a SNP in another group, these 2 groups are merged. The first list of SNP groups were based on single-trait fine-mapping and the second list is based on flashfm.
The next function summarises the fine-mapping results using the SNP groups:
<- PPsummarise(fm.multi, snpGroups, minPP=0.05) mpp.pp
This returns a list of 4 objects: MPP lists trait-specific PP of SNP inclusion in a model; MPPg lists trait-specific PP of SNP group inclusion in a model; PP lists trait-specific model PP; PPg lists trait-specific model PP in terms of SNP group Setting minPP=0.05 restricts output for group models to only those with PP>0.05 for at least one trait.
Our top group models are:
$PPg
mpp.pp#> $AD
#> null 182
#> G%H 0.78639873 0.84899405
#> B%G%H 0.06359364 0.05141176
#>
#> $AC
#> null 182
#> E%H 0.9929384 0.9967118
Let’s check if our causal variants (in cvs vector) are included in the SNP groups from single and multi-trait fine-mapping and the sizes of the SNP groups
:::groupIDs.fn(snpGroups[[1]],cvs)
flashfm#> [1] "H" "G" "E"
:::groupIDs.fn(snpGroups[[2]],cvs)
flashfm#> [1] "H" "G" "E"
$group.sizes
snpGroups#> A B C D E F G H
#> null 1 2 2 5 3 1 6 15
#> 182 1 2 1 1 3 0 6 5
Notice that the groups for the causal variants that were not shared (cvs[c(2,3)]) are the same size for both methods and that for the shared variant (cvs[1]), flashfm constructs a smaller group than JAMexpanded.multi. This illustrates how sharing information between the traits can refine fine-mapping resolution.
This example takes around one minute to run.
Here, we use single-trait fine-mapping results from FINEMAP and show how to prepare it for use in flashfm.
The *.config files from FINEMAP are processed by the lines
<- vector("list",2)
FMconfig for(i in 1:2) FMconfig[[i]] <- read.table(paste0("finemap",i,".config"), header = TRUE, as.is = TRUE, sep = " ")
and saved in the flashfm object FMconfig, which is prepared for flashfm by creating a list of the FINEMAP results for each trait:
<- vector("list",2)
modPP.list for(i in 1:2) {modPP.list[[i]] <- data.frame(FMconfig[[i]][,c("config","prob")]); colnames(modPP.list[[i]]) <- c("snps","PP")}
names(modPP.list) <- names(beta)
lapply(modPP.list,head)
#> $AD
#> snps PP
#> 1 rs12722496,rs62626317 0.0667081
#> 2 rs12722495,rs62626317 0.0492379
#> 3 rs7909519,rs62626317 0.0355550
#> 4 rs12722508,rs62626317 0.0340655
#> 5 rs12722522,rs62626317 0.0269347
#> 6 rs12722496,rs41295055 0.0189747
#>
#> $AC
#> snps PP
#> 1 rs7909519,rs11594656 0.0882367
#> 2 rs7909519,rs35285258 0.0757998
#> 3 rs61839660,rs11594656 0.0732223
#> 4 rs12722496,rs11594656 0.0717890
#> 5 rs12722508,rs11594656 0.0653850
#> 6 rs61839660,rs35285258 0.0629490
We provide the covariance matrix of the genotypes from X:
<- names(raf) # filtered on maf 0.005
msnps = cov(X[,msnps]) covX
However, if the SNP correlation matrix is available, the covariance matrix may be found using the correlation matrix and reference allele frequencies in the function cor2cov. This function is from the MBESS package and has been imported to flashfm; cor2cov is authored by Ken Kelley (University of Notre Dame; () and Keke Lai (the package), with modifications by Dustin Fife .
<- cor2cov(corX,sd=sqrt(2*raf*(1-raf))) covX
N is a vector of sample sizes for traits, in same order as trait effect vectors given in beta. This can be based on data, if unrelated samples, as here. Especially for samples with related individuals, we recommend finding the effective sample size for each trait using the built-in function:
Neff(raf, seB, Vy = 1),
where raf is the reference allele frequency vector, seB is the vector of standard errors of the effect estimates in the same order of SNPs in raf, and Vy is the trait variance, which is 1 by default, assuming traits are transformed to normal(0,1).
Similar lines to the previous example are run, but flashfm.input is needed to prepare input to flashfm, and its output has the same form as JAMexpanded.multi.
<- flashfm.input(modPP.list,beta1.list=beta,Gmat=covX,Nall=N,ybar.all=ybar,is.snpmat=FALSE,raf=raf)
FMmain.input #> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
<- summaryStats(Xmat=FALSE,ybar.all=ybar,main.input=FMmain.input)
ss.stats <- flashfm(FMmain.input, TOdds=1,covY,ss.stats,cpp=0.99,maxmod=NULL,fastapprox=FALSE)
fm.multi #> Trait 1 (AD) has cpp before adjustment: 0.99
#> Trait 2 (AC) has cpp before adjustment: 0.99
<- makeSNPgroups2(FMmain.input,fm.multi,is.snpmat=FALSE,r2.minmerge=0.6)
snpGroups #> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> calculating marginal SNP inclusion probabilities
#> SNP group sizes based on single-trait results are:
#> Group Size
#> A 3
#> B 20
#> C 7
#> D 1
#> E 2
#> F 7
#> G 7
#> SNP group sizes based on flashfm results are:
#> Group Size
#> A 3
#> B 11
#> C 7
#> D 1
#> E 2
#> F 7
#> G 7
<- PPsummarise(fm.multi, snpGroups, minPP=0.05) mpp.pp
Our top group models are:
$PPg
mpp.pp#> $AD
#> null 178
#> B%C 0.52565016 0.59025606
#> B%C%E 0.13577230 0.11475159
#> B%C%F 0.12812612 0.11900824
#> B%C%D 0.08820581 0.07504605
#>
#> $AC
#> null 178
#> A%B 0.8460312 0.8790166
Let’s check if our causal variants (in cvs vector) are included in the SNP groups from single and multi-trait fine-mapping and the sizes of the SNP groups
:::groupIDs.fn(snpGroups[[1]],cvs)
flashfm#> [1] "B" "C" "A"
:::groupIDs.fn(snpGroups[[2]],cvs)
flashfm#> [1] "B" "C" "A"
$group.sizes
snpGroups#> A B C D E F G
#> null 3 20 7 1 2 7 7
#> 178 3 11 7 1 2 7 7
Notice that the groups for the causal variants that were not shared (cvs[c(2,3)]) are the same size for both methods and that for the shared variant (cvs[1]), flashfm constructs a smaller group than FINEMAP. This illustrates how sharing information between the traits can refine fine-mapping resolution.