Intro
GENOSCORES is a platform which encapsulates an R package, and a web platform which are linked by a public API. This provides flexibility of using GENOSCORES, as one is able to:
- Use R package with the example script directly (recommended).
- Write own R code to interact with the R package and/or the API.
- Write custom calls to the API in any language of choice.
The R package is by far the easiest way to use GENOSCORES to compute genotypic scores. One can simply change the parameters in the provided example script and run the analysis. The script will then call relevant package functions which will make API calls to retrieve the data automatically. Hence, no knowledge of the package, or the API is needed.
Score computation workflow
Genotypic scores are computed as a sum of genotypes, g1, weighted by the effect size estimate, β, and further multiplied by the inverse of the SNP-SNP correlation matrix2, R, to adjust the score for LD structure.
The score for an individual
i
for traitt
is given by a product of β for traitt
by the inverse of R by the vector of genotypes g for individuali
.
Genotypes g are provided by the user in one of the supported formats: PLINK, dosage or bgendosage (bgen file transformed into dosage format with qctools). However, we only support input file preprocessing for files in PLINK format.
The weights (effect sizes) β are stored in
GENOSCORES3 database.
Effect sizes β for selected studies and/or trait types
are downloaded automatically during analysis. Summary statistics (weights) are
filtered at a p-value of 1E-5
. This can be changed in
example script depending on the analysis needs.
We use a reference panel to compute the correlation matrix R. User is expected to provide a reference panel file in PLINK format. For example, we often use European samples in the 1000Genomes project as a reference panel. If the reference panel is not provided, LD adjustment will be omitted.
Details
We compute genotypic scores separately for each trait-associated region of the genome. This means that for each GWAS requested from our database you may get more than 1 score computed.
Example: If the GWAS of gene expression for a given gene has detected a cis-eQTL and a trans-eQTL, then two scores would be computed corresponding to the cis locus and the trans locus.
Trait-associated regions are defined as genomic regions containing at least one
SNP with p-value < 1E-7
. You can change this threshold in
example script if needed (e.g. use genome-wide significance).
The boundaries of each genomic region are defined by positions at which there
is a gap of at least one megabase from any other SNP in the filtered set of SNPs
for that GWAS. SNPs in the filtered set that are not assigned to trait-associated
regions are assigned to a global “background” score for the GWAS. The
"background" score has a score id with suffix _0
.
To compute a genome-wide score, you can simply add all the regional scores and the background score for a given GWAS.
Getting started
You may need to run GENOSCORES on high performance computers depending on your use case4.
Create user account
Navigate to the registration page and create your user account.
This is needed to authenticate the requests to the web platform that R package makes on your behalf.
Download and install the package
Download the R package, and the example script.
To install the R package navigate to the containing directory and execute the following command:
R CMD INSTALL genoscores_xxx.tar.gz
here xxx
stands for the package version.
Install the dependencies
GENOSCORES R package requires the fallowing R packages to be installed:
"RMariaDB" "Rcpp" "bigmemory" "corrplot" "data.table"
"doParallel" "foreach" "ggplot2" "gtools" "httr"
"jsonlite" "Matrix" "RColorBrewer" "yaml" "BH"
"RcppEigen"
To install the above package in a single command, simply run inside R session:
install.packages(c("RMariaDB", "Rcpp", "bigmemory", "corrplot", "data.table",
"doParallel", "foreach", "ggplot2", "gtools", "httr", "jsonlite", "Matrix",
"RColorBrewer", "yaml", "BH", "RcppEigen"))
You also need to install bedtools, liftOver and PLINK (if your data is in PLINK format).
Notes on installing bedtools
Pull the package with
wget
:wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz
Extract the package with
tar
:tar xzf bedtools-2.30.0.tar.gz
Compile and build bedtools:
Navigate to the source directory and run:
make install
By default, bedtools will be installed in
/usr/local/bin
. If this directory is not accessible, use a custom path:make DESTDIR=/custom/dir install
do not forget to add that custom directory to your system path
Initialise the package
To use GENOSCORES you need to authorize the package to make web requests to the web application on your behalf. During initialization the package will also check that all the dependencies have been installed correctly.
Initialisation is interactive. In R session load GENOSCORES library and run initialisation like this:
library("genoscores")
initialise.genoscores()
This will start an interactive initialisation session. Follow the on-screen instructions.
Upon successful initialisation, a hidden file .genoscores.yml
will be
created in your HOME
directory. It contains credentials for authenticating
the package
requests to the web platform. Please do not move,
copy or otherwise expose this information to any third parties.
At this point everything should be ready to run the analysis!
Inputs
All the analysis input parameters are described in detail in the example script, and you are encouraged to examine this script carefully. Here we give a generic overview of all the configuration that the example script provides.
Specify which studies and/or trait types to use
You can select which GWAS to be used in the analysis by providing a vector of study ids and/or a vector of trait types. Check Studies page for details.
In example script this is controlled by the parameters:
studyids
and trait.types
.
Both studyids
and trait.types
or either of them
can be specified.
Configure system resources
This section of example script controls resource allocation and configures execution of the package functions in parallel mode.
Important parameters to tweak are:
num.cores
and num.batches
.
Also consider parameter memory.plink
is the input file format is PLINK.
Set parameters for input data processing
The input file should contain genotype data from the patients for which the regional genotypic scores are to be computed. We currently support PLINK, dosage, bgendosage (bgen file transformed into dosage format with qctools) as input file formats. Note, that we currently support the input file pre-processing only for PLINK files.
For input data specified in PLINK format we offer two stages of data pre-processing:
- Update the SNP ids in the input file to the most recent build.
Checkout
prepare.plinkfile
. - Filter SNP ids in the input file to keep only those present in
the GENOSCORES database. Checkout
filter.snpids
.
Updating of the genomic positions requires a liftOver chain file to lift positions to build hg38. If your data is already in build hg38 you can skip this step.
If you want to find out the build of your data, you can use
guess.genome.build(input.file)
function shipped with the package.
Set parameters for score computation and adjustment for LD
This section of example script contains quite a few parameters which control various features of the analysis such as:
- splitting the genome into genomic regions
- filtering SNPs by
p-value
- LD adjustment reference panel
- LD adjustment methods
On linkage disequilibrium
The presence of linkage disequilibrium (LD) can affect the quality of the scores generated from the target input file, as the contribution of SNPs in high LD may overwhelm that of other SNPs. To avoid a potential distortion of the scores, the effect sizes (betas) of SNPs in high LD need to be adjusted before the computation of scores.
GENOSCORES is able to perform such LD adjustment automatically if a reference
population genome is provided in the ld.refplinkfile
option of get.scores()
.
Therefore, it is strongly recommended to provide reference PLINK file for the
same population as in the target file.
As an example, below we present the list of steps required to generate such file using EUR ancestry in 1000 Genomes Project (other reference projects such as UK10K can be used as well):
- Download the reference genotypes from 1000 Genomes Project data portal. Make sure that the genotypes are in PLINK format and only samples of European ancestry are present, if your target population is of European ancestry.
- Update RSids to build hg38. You can use
prepare.plinkfile()
from GENOSCORES to do this.
Now you can use this file as ld.refplinkfile
for LD adjustment in GENOSCORES.
Outputs
When you execute the example script, GENOSCORES will compute a genotypic score for each sample in the target cohort (provided genotypes g) and each trait-associated region for the selected GWAS. See Score computation workflow for details about the score computation.
GENOSCORES will produce the following files and store them in the specified output directory.
genotypicscore.Rdata.gz
Contains the (number_of_samples X number_of_scores
) matrix with the computed
genotypic scores.
For example:
sample/score | 1_0 | 1_2 | 1_3 | 2_0 | 2_2 | 2_3 |
---|---|---|---|---|---|---|
f1_1 | 0.00 | -0.34 | -0.26 | 0.00 | -0.29 | -0.23 |
f1_2 | 0.17 | 0.00 | -0.23 | 0.15 | 0.00 | -0.21 |
f1_3 | 0.00 | -0.69 | 0.00 | 0.00 | -0.58 | 0.00 |
The rows of the matrix contain IDs of the individuals. If the input file format is PLINK, then a composite ID5 is used.
The columns of the matrix contain score IDs. Consider score 1_0
as example.
The first digit 1
stands for gwasid
: identifier of a particular GWAS from
which β (weights) were used to compute the score.
The second digit 0
specifies that the score is a background score.
If the second digit is not 0
, than the score is a regional score.
catalog.Rdata.gz
Is R data.table
containing aggregated information on scores, GWAS,
β (weights) used to compute these scores and studies
from which these weights come.
Column | - | Description |
---|---|---|
gwasid |
ID of a GWAS used to compute a score. | |
region |
ID of a genomic region of a score. If region=0 , the score is a background score. |
|
chrom |
Chromosome that carries the top SNP of a score. | |
pos |
Genomic position of the top SNP of a score. | |
studyid |
ID of a study containing the GWAS used in computing a score. | |
traitid |
ID of a trait for which GWAS was performed. | |
dbsnpid |
ID of the top SNP of a score in DbSNP table. | |
trait_name |
Name of the trait for which GWAS was performed. For example rheumatoid arthritis . |
|
trait_type |
Type of the trait. For example clinical . |
|
tissue |
Tissue in which the trait was measured. For example whole blood . |
|
allele |
The reference allele we have considered. | |
beta |
Effect size (weight) taken from the GWAS for a given trait which was used in computing a score. | |
pvalue |
p-value of the top SNP of a score. |
|
dbsnpref |
Reference allele as reported by DbSNP | |
dbsnpalt |
Alt allele as reported by DbSNP | |
pubmedid |
PubMed identifier of a publication containing the GWAS used to compute a score. | |
n_samples |
Number of samples used in the study. | |
n_cases |
Number of cases if this is a case-control study. | |
n_controls |
Number of controls if this is a case-control study. | |
ancestry |
Genetic ancestry of the samples in the study. | |
chrom_int |
The chromosome as an integer (with X=23, Y=24, etc). | |
snp |
SNP as provided by the user. | |
startpos |
Genomic position of the beginning of the score. | |
endpos |
Genomic position of the end of the score. | |
minpvalue |
Minimum p-value recorded within a region. | |
scoreid |
Score identifier. |
Combination of gwasid
and region
makes a score ID.
scoresinfo.Rdata.gz
Is R data.table
containing information for each score. Most columns here are
identical to catalog.Rdata.gz
except:
Column | - | Description |
---|---|---|
n_snps |
Specifies how many SNPs contribute to the score. | |
nearby_genes |
Lists genes which are located close to the region of the score. |
Note that identical information is also saved to scoresinfo.csv
.
For traits of type QTLs
, user may call function annotate.qtls
and pass
the scoresinfo
table loaded from scoresinfo.Rdata.gz
(or scoresinfo.csv
) to
Annotate the table of scores information with gene positions of QTLs.
Currently, the trait types supported by this function are "transcript" and
"splicing". This procedure appends additional columns to scoresinfo
:
Column | - | Description |
---|---|---|
gene_symbol |
Short name of a gene as specified in Ensembl genome browser | |
gene_distance |
Is a distance from a gene to the score | |
qtl_type |
cis (gene < 50kb from the score), cis-x (gene < 5Mb from the score), trans (gene is far from score). |
Column scoreid
contains the score ID and can be
used to match each column in genotypicscore.Rdata.gz
matrix with scoresinfo.Rdata.gz
.
settings.Rdata.gz
This file contains all the settings specified in the example script and used to compute scores (p-value thresholds, LD-adjustment parameters, etc). Full specification is available in example script. The key parameters are listed in Inputs. This is useful for reproducibility.
For example, if the input file format is PLINK, the genotypes are the scored (as 0, 1, 2)copies of the reference allele. ↩
We have implemented a pseudo-inverse solution to invert singular correlation matrices occurring in regions of high LD. ↩
We search for, download and process publicly available summary statistics from GWASs on a number of traits including: disease states (especially autoimmune diseases), clinical traits, gene expression in various tissues, metabolites and proteins, and others. You can check the Studies page for a complete list of studies and trait types currently available. ↩
GENOSCORES R package can be installed and run on LINUX and macOS. We do not currently support Windows. GENOSCORES performs expensive matrix operations on user computers to calculate scores. If large input datasets are used, it may be beneficial to use high performance multiple cores Linux machines or an HPC. ↩
For example, if sample has ID
f1_2
,1
stands for the family ID and2
represents a within-family ID. Refer to PLINK .fam format specification for more details. ↩