ic is a set of programs designed to produce a single html page visual summary of one or more imputed data sets from the most common imputation programs. The poster from the ASHG 2016 meeting that describes this program, and the pre-imputation checking, can be downloaded here (3.8MB).
Two programs are currently required one to parse the VCF file the other to do the plots. The programs are available to download here:
No installation is required, extract all the files in the zip file to a directory. There are however a number of dependencies, see Requirements below on how to install these.
The program requires the GD libraries to be installed to create the plots.
Installation will vary by system, on Ubuntu installation of libGD can be done using the following commands:
Before 16.04:
sudo apt-get -y install libgd2-xpm-dev build-essential
16.04 onwards:
sudo apt-get -y install libgd-dev build-essential
Install cpanm and then GD::Graph:
sudo cpan App::cpanminus
sudo cpanm GD::Graph
1000G phase 3 summary (from here 1.44GB) or the tab delimited HRC summary (HRC release 1 or 1.1 from the HRC web site).
vcfparse.pl -d <directory of VCFs>
-o <outputname> [-g]
where
| -d |
The path to the directory containing imputed VCF files. |
| -o |
Specifies the output directory name, will be created if it doesn't exist. |
| -g |
Flag to specify the output files are gzipped. |
The program will not overwrite files of the same name and this
process will be required for each imputed data set.
Once the VCFs are converted the main program can be run with the
following options.
ic -d<directory> -r
<Reference panel> [-h |
(-g -p <population>
)][-f <mappings file>] [-o
<output directory>]
| -d --directory | Directory | Top
level directory containing either one set of per
chromosome files, or multiple directories each containing
a set of per chromosome files This directory will be searched recursively for files matching the required formats Files may be gzipped or uncompressed |
| -f --file | Mapping file | Mapping
file of directory name to cohort name, optional but
recommended when using multiple data sets |
| -r --ref | Reference panel | Reference panel summary file, either
1000G
or the tab delimited HRC (r1 or r1.1) |
| -h --hrc | Flag
to indicate Reference panel file is HRC, defaults to HRC
if no option is given |
|
| -g --1000g | Flag to indicate Reference panel file is 1000G | |
| -p --pop | Population | Population
to check allele frequency against Applies to 1000G only, defaults to ALL if not supplied Options available ALL, EUR, AFR, AMR, SAS, EAS |
| -o |
Output
Directory |
Top
level directory to contain all the output folders |
The mapping file should consist of two columns:
Example mapping file:
| /full/path/to/folder/1 | MyStudy1 |
| ./path/to/folder/2 | MyStudy2 |
| folder3 | MyStudy3 |
If no mapping file is supplied the program will attempt to
determine a unique set of names from the top level directory
and/or sub-directories supplied with the -d option, this may or
may not end up with unique folders for each output, if not the
program will start an auto-increment on the file names within the
directory (these will be consistent across each data set).
One advantage of using a mapping file is the data sets provided
need not be all in the same base path.
Currently Imputed files from the University of Michigan and the Sanger Institute. Impute is also supported but requires a different reformatter, contact me in this case.
Gzip in
Perl does not support bgzip chunks, hence the requirement to unzip
the reference panel and summarise the VCF files.
If there are errors on the allele frequency
plots this could be as earlier versions of the University of
Michigan imputation output did not contain the AF in the VCF.
Contact me for an earlier version that can support this.
As
mentioned in the poster presented at the ASHG 2016 meeting we are
planning a web based version to avoid any issues that the
requirement to install GD might cause.
Future download versions will also be able to
read the information from the VCF directly, removing the need for
the vcfparse program (this will still be needed for the online
version to reduce file sizes and therefore upload time).