ic, a post-Imputation data checking program

Background

ic is a set of programs designed to produce a single html page visual summary of one or more imputed data sets from the most common imputation programs. The poster from the ASHG 2016 meeting that describes this program, and the pre-imputation checking, can be downloaded here (3.8MB).

Download:

Two programs are currently required one to parse the VCF file the other to do the plots. The programs are available to download here:

v.1.0.2
ic.v1.0.2.zip
v.1.0.3
ic.v1.0.3.zip
v.1.0.4
ic.v1.0.4.zip
v.1.0.5
ic.v1.0.5.zip
vcfparse.zip
vcfparse.zip

Update History:

v.1.0.1 to v.1.0.2 Updated path to Java executable
v.1.0.2 to v.1.0.3 Updated to run using the info files from Michigan Imputation server
v.1.0.3 to v.1.0.4 Fixed bug with the 1000G parsing
v.1.0.4 to v.1.0.5 Added ability to calculate AF from AC and AN for Umich data

Installation

No installation is required, extract all the files in the zip file to a directory. There are however a number of dependencies, see Requirements below on how to install these.

Requirements:

The program requires the GD libraries to be installed to create the plots.

Install libGD

Installation will vary by system, on Ubuntu installation of libGD can be done using the following commands:

Before 16.04:
sudo apt-get -y install libgd2-xpm-dev build-essential
16.04 onwards:
sudo apt-get -y install libgd-dev build-essential

Install Perl GD::Graph

Install cpanm and then GD::Graph:
sudo cpan App::cpanminus
sudo cpanm GD::Graph

Download the reference panel:

1000G phase 3 summary (from here 1.44GB) or the tab delimited HRC summary (HRC release 1 or 1.1 from the HRC web site).

Usage

The current version requires only the first 8 columns from the VCF output file, use the vcfparse.pl script in downloads to extract them.

vcfparse usage:

vcfparse.pl -d <directory of VCFs> -o <outputname> [-g]
where

-d
The path to the directory containing imputed VCF files.
-o
Specifies the output directory name, will be created if it doesn't exist.
-g
Flag to specify the output files are gzipped.

The program will not overwrite files of the same name and this process will be required for each imputed data set.
Once the VCFs are converted the main program can be run with the following options.

ic Usage:

ic -d<directory> -r <Reference panel> [-h | (-g -p <population> )][-f <mappings file>] [-o <output directory>]

Options:

-d --directory Directory Top level directory containing either one set of per chromosome files, or multiple directories each containing a set of per chromosome files
This directory will be searched recursively for files matching the required formats
Files may be gzipped or uncompressed
-f --file Mapping file Mapping file of directory name to cohort name, optional but recommended when using multiple data sets
-r --ref Reference panel Reference panel summary file, either 1000G or the tab delimited HRC (r1 or r1.1)
-h --hrc
Flag to indicate Reference panel file is HRC, defaults to HRC if no option is given
-g --1000g
Flag to indicate Reference panel file is 1000G
-p --pop Population Population to check allele frequency against
Applies to 1000G only, defaults to ALL if not supplied
Options available ALL, EUR, AFR, AMR, SAS, EAS
-o
Output Directory
Top level directory to contain all the output folders

Mapping file

The mapping file should consist of two columns:

The directory name (optionally including the path)
The name you wish to use for the output files

Example mapping file:

/full/path/to/folder/1 MyStudy1
./path/to/folder/2 MyStudy2
folder3 MyStudy3



If no mapping file is supplied the program will attempt to determine a unique set of names from the top level directory and/or sub-directories supplied with the -d option, this may or may not end up with unique folders for each output, if not the program will start an auto-increment on the file names within the directory (these will be consistent across each data set).
One advantage of using a mapping file is the data sets provided need not be all in the same base path.


Formats Supported

Currently Imputed files from the University of Michigan and the Sanger Institute. Impute is also supported but requires a different reformatter, contact me in this case.


Output

An example of the html output can be found here: Sample QC Report (5.2MB).

Known issues

Gzip in Perl does not support bgzip chunks, hence the requirement to unzip the reference panel and summarise the VCF files.
If there are errors on the allele frequency plots this could be as earlier versions of the University of Michigan imputation output did not contain the AF in the VCF. Contact me for an earlier version that can support this.

Future plans

As mentioned in the poster presented at the ASHG 2016 meeting we are planning a web based version to avoid any issues that the requirement to install GD might cause.
Future download versions will also be able to read the information from the VCF directly, removing the need for the vcfparse program (this will still be needed for the online version to reduce file sizes and therefore upload time).