Quickstart guide to GENOSCORES

Intro

GENOSCORES is a platform which encapsulates an R package, and a web platform which are linked by a public API. This provides flexibility of using GENOSCORES, as one is able to:

  • Use R package with the example script directly (recommended).
  • Write own R code to interact with the R package and/or the API.
  • Write custom calls to the API in any language of choice.

The R package is by far the easiest way to use GENOSCORES to compute genotypic scores. One can simply change the parameters in the provided example script and run the analysis. The script will then call relevant package functions which will make API calls to retrieve the data automatically. Hence, no knowledge of the package, or the API is needed.

Score computation workflow

Genotypic scores are computed as a sum of genotypes, g1, weighted by the effect size estimate, β, and further multiplied by the inverse of the SNP-SNP correlation matrix2, R, to adjust the score for LD structure.

The score for an individual i for trait t is given by a product of β for trait t by the inverse of R by the vector of genotypes g for individual i.

Genotypes g are provided by the user in one of the supported formats: PLINK, dosage or bgendosage (bgen file transformed into dosage format with qctools). However, we only support input file preprocessing for files in PLINK format.

The weights (effect sizes) β are stored in GENOSCORES3 database. Effect sizes β for selected studies and/or trait types are downloaded automatically during analysis. Summary statistics (weights) are filtered at a p-value of 1E-5. This can be changed in example script depending on the analysis needs.

We use a reference panel to compute the correlation matrix R. User is expected to provide a reference panel file in PLINK format. For example, we often use European samples in the 1000Genomes project as a reference panel. If the reference panel is not provided, LD adjustment will be omitted.

Details

We compute genotypic scores separately for each trait-associated region of the genome. This means that for each GWAS requested from our database you may get more than 1 score computed.

Example: If the GWAS of gene expression for a given gene has detected a cis-eQTL and a trans-eQTL, then two scores would be computed corresponding to the cis locus and the trans locus.

Trait-associated regions are defined as genomic regions containing at least one SNP with p-value < 1E-7. You can change this threshold in example script if needed (e.g. use genome-wide significance). The boundaries of each genomic region are defined by positions at which there is a gap of at least one megabase from any other SNP in the filtered set of SNPs for that GWAS. SNPs in the filtered set that are not assigned to trait-associated regions are assigned to a global “background” score for the GWAS. The "background" score has a score id with suffix _0.

To compute a genome-wide score, you can simply add all the regional scores and the background score for a given GWAS.

Getting started

You may need to run GENOSCORES on high performance computers depending on your use case4.

Create user account

Navigate to the registration page and create your user account.

This is needed to authenticate the requests to the web platform that R package makes on your behalf.

Download and install the package

Download the R package, and the example script.

To install the R package navigate to the containing directory and execute the following command:

R CMD INSTALL genoscores_xxx.tar.gz

here xxx stands for the package version.

Install the dependencies

GENOSCORES R package requires the fallowing R packages to be installed:

"RMariaDB"     "Rcpp"         "bigmemory"    "corrplot"     "data.table"
"doParallel"   "foreach"      "ggplot2"      "gtools"       "httr"
"jsonlite"     "Matrix"       "RColorBrewer" "yaml"         "BH"
"RcppEigen"

To install the above package in a single command, simply run inside R session:

install.packages(c("RMariaDB", "Rcpp", "bigmemory", "corrplot", "data.table",
"doParallel", "foreach", "ggplot2", "gtools", "httr", "jsonlite", "Matrix",
"RColorBrewer", "yaml", "BH", "RcppEigen"))

You also need to install bedtools, liftOver and PLINK (if your data is in PLINK format).

Notes on installing bedtools

Pull the package with wget:

wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools-2.30.0.tar.gz

Extract the package with tar:

tar xzf bedtools-2.30.0.tar.gz

Compile and build bedtools:

Navigate to the source directory and run:

make install

By default, bedtools will be installed in /usr/local/bin. If this directory is not accessible, use a custom path:

make DESTDIR=/custom/dir install

do not forget to add that custom directory to your system path

Initialise the package

To use GENOSCORES you need to authorize the package to make web requests to the web application on your behalf. During initialization the package will also check that all the dependencies have been installed correctly.

Initialisation is interactive. In R session load GENOSCORES library and run initialisation like this:

library("genoscores")
initialise.genoscores()

This will start an interactive initialisation session. Follow the on-screen instructions.

Upon successful initialisation, a hidden file .genoscores.yml will be created in your HOME directory. It contains credentials for authenticating the package requests to the web platform. Please do not move, copy or otherwise expose this information to any third parties.

At this point everything should be ready to run the analysis!

Inputs

All the analysis input parameters are described in detail in the example script, and you are encouraged to examine this script carefully. Here we give a generic overview of all the configuration that the example script provides.

Specify which studies and/or trait types to use

You can select which GWAS to be used in the analysis by providing a vector of study ids and/or a vector of trait types. Check Studies page for details.

In example script this is controlled by the parameters: studyids and trait.types. Both studyids and trait.types or either of them can be specified.

Configure system resources

This section of example script controls resource allocation and configures execution of the package functions in parallel mode.

Important parameters to tweak are: num.cores and num.batches.

Also consider parameter memory.plink is the input file format is PLINK.

Set parameters for input data processing

The input file should contain genotype data from the patients for which the regional genotypic scores are to be computed. We currently support PLINK, dosage, bgendosage (bgen file transformed into dosage format with qctools) as input file formats. Note, that we currently support the input file pre-processing only for PLINK files.

For input data specified in PLINK format we offer two stages of data pre-processing:

  • Update the SNP ids in the input file to the most recent build. Checkout prepare.plinkfile.
  • Filter SNP ids in the input file to keep only those present in the GENOSCORES database. Checkout filter.snpids.

Updating of the genomic positions requires a liftOver chain file to lift positions to build hg38. If your data is already in build hg38 you can skip this step.

If you want to find out the build of your data, you can use guess.genome.build(input.file) function shipped with the package.

Set parameters for score computation and adjustment for LD

This section of example script contains quite a few parameters which control various features of the analysis such as:

  • splitting the genome into genomic regions
  • filtering SNPs by p-value
  • LD adjustment reference panel
  • LD adjustment methods

On linkage disequilibrium

The presence of linkage disequilibrium (LD) can affect the quality of the scores generated from the target input file, as the contribution of SNPs in high LD may overwhelm that of other SNPs. To avoid a potential distortion of the scores, the effect sizes (betas) of SNPs in high LD need to be adjusted before the computation of scores.

GENOSCORES is able to perform such LD adjustment automatically if a reference population genome is provided in the ld.refplinkfile option of get.scores(). Therefore, it is strongly recommended to provide reference PLINK file for the same population as in the target file.

As an example, below we present the list of steps required to generate such file using EUR ancestry in 1000 Genomes Project (other reference projects such as UK10K can be used as well):

  1. Download the reference genotypes from 1000 Genomes Project data portal. Make sure that the genotypes are in PLINK format and only samples of European ancestry are present, if your target population is of European ancestry.
  2. Update RSids to build hg38. You can use prepare.plinkfile() from GENOSCORES to do this.

Now you can use this file as ld.refplinkfile for LD adjustment in GENOSCORES.

Outputs

When you execute the example script, GENOSCORES will compute a genotypic score for each sample in the target cohort (provided genotypes g) and each trait-associated region for the selected GWAS. See Score computation workflow for details about the score computation.

GENOSCORES will produce the following files and store them in the specified output directory.

genotypicscore.Rdata.gz

Contains the (number_of_samples X number_of_scores) matrix with the computed genotypic scores.

For example:

sample/score 1_0 1_2 1_3 2_0 2_2 2_3
f1_1 0.00 -0.34 -0.26 0.00 -0.29 -0.23
f1_2 0.17 0.00 -0.23 0.15 0.00 -0.21
f1_3 0.00 -0.69 0.00 0.00 -0.58 0.00

The rows of the matrix contain IDs of the individuals. If the input file format is PLINK, then a composite ID5 is used.

The columns of the matrix contain score IDs. Consider score 1_0 as example. The first digit 1 stands for gwasid: identifier of a particular GWAS from which β (weights) were used to compute the score. The second digit 0 specifies that the score is a background score. If the second digit is not 0, than the score is a regional score.

catalog.Rdata.gz

Is R data.table containing aggregated information on scores, GWAS, β (weights) used to compute these scores and studies from which these weights come.

Column - Description
gwasid ID of a GWAS used to compute a score.
region ID of a genomic region of a score. If region=0, the score is a background score.
chrom Chromosome that carries the top SNP of a score.
pos Genomic position of the top SNP of a score.
studyid ID of a study containing the GWAS used in computing a score.
traitid ID of a trait for which GWAS was performed.
dbsnpid ID of the top SNP of a score in DbSNP table.
trait_name Name of the trait for which GWAS was performed. For example rheumatoid arthritis.
trait_type Type of the trait. For example clinical.
tissue Tissue in which the trait was measured. For example whole blood.
allele The reference allele we have considered.
beta Effect size (weight) taken from the GWAS for a given trait which was used in computing a score.
pvalue p-value of the top SNP of a score.
dbsnpref Reference allele as reported by DbSNP
dbsnpalt Alt allele as reported by DbSNP
pubmedid PubMed identifier of a publication containing the GWAS used to compute a score.
n_samples Number of samples used in the study.
n_cases Number of cases if this is a case-control study.
n_controls Number of controls if this is a case-control study.
ancestry Genetic ancestry of the samples in the study.
chrom_int The chromosome as an integer (with X=23, Y=24, etc).
snp SNP as provided by the user.
startpos Genomic position of the beginning of the score.
endpos Genomic position of the end of the score.
minpvalue Minimum p-value recorded within a region.
scoreid Score identifier.

Combination of gwasid and region makes a score ID.

scoresinfo.Rdata.gz

Is R data.table containing information for each score. Most columns here are identical to catalog.Rdata.gz except:

Column - Description
n_snps Specifies how many SNPs contribute to the score.
nearby_genes Lists genes which are located close to the region of the score.

Note that identical information is also saved to scoresinfo.csv.

For traits of type QTLs, user may call function annotate.qtls and pass the scoresinfo table loaded from scoresinfo.Rdata.gz (or scoresinfo.csv) to Annotate the table of scores information with gene positions of QTLs. Currently, the trait types supported by this function are "transcript" and "splicing". This procedure appends additional columns to scoresinfo:

Column - Description
gene_symbol Short name of a gene as specified in Ensembl genome browser
gene_distance Is a distance from a gene to the score
qtl_type cis (gene < 50kb from the score), cis-x (gene < 5Mb from the score), trans (gene is far from score).

Column scoreid contains the score ID and can be used to match each column in genotypicscore.Rdata.gz matrix with scoresinfo.Rdata.gz.

settings.Rdata.gz

This file contains all the settings specified in the example script and used to compute scores (p-value thresholds, LD-adjustment parameters, etc). Full specification is available in example script. The key parameters are listed in Inputs. This is useful for reproducibility.


  1. For example, if the input file format is PLINK, the genotypes are the scored (as 0, 1, 2)copies of the reference allele. 

  2. We have implemented a pseudo-inverse solution to invert singular correlation matrices occurring in regions of high LD. 

  3. We search for, download and process publicly available summary statistics from GWASs on a number of traits including: disease states (especially autoimmune diseases), clinical traits, gene expression in various tissues, metabolites and proteins, and others. You can check the Studies page for a complete list of studies and trait types currently available. 

  4. GENOSCORES R package can be installed and run on LINUX and macOS. We do not currently support Windows. GENOSCORES performs expensive matrix operations on user computers to calculate scores. If large input datasets are used, it may be beneficial to use high performance multiple cores Linux machines or an HPC

  5. For example, if sample has ID f1_2, 1 stands for the family ID and 2 represents a within-family ID. Refer to PLINK .fam format specification for more details.