This page lists all the tools available, for working with strand updates, pre- and post-imputation checking, plus a few others

If you have any issues with these programs please contact me, Will Rayner
william dot rayner at helmholtz-munich dot de
and/or
will dot rayner at strand dot org dot uk

ScatterShot

A simple to use cluster plotting program.
The ASHG 2013 poster describing the use of this program can be downloaded here:

WR-ASHG2013posterPP-portrait.pdf

Detailed instructions and usage examples can be found in the zip download.

Download (13Mb)

ScatterShot.2014JAN17.zip

Previous Versions

ScatterShot.2013DEC09.zip (13Mb)
ScatterShot.2013NOV25.r1.zip (12.9Mb)
ScatterShot_2013OCT25_r1.zip (16.4Mb)
ScatterShot_2013OCT24_r2.zip (15.5Mb)
ScatterShot.zip (12.9Mb)

Physical Activity Program

Program to derive Kcal values from questionnaire data
See the accompanying paper here for details

Download here: PhysicalActivityProgram.zip

HRC or 1000G Imputation preparation and checking

Program to check a QC'd plink .bim file against the HRC, 1000G or CAAPA reference SNP list in advance of imputation

Downloads

This tool is also available in the Docker and Singularity containers listed as part of the post-QC checking program (IC). See the downloads section of that program for the links

Version 4.2.1 HRC-1000G-check-bim-v4.2.zip
Version 4.2.2 HRC-1000G-check-bim.v4.2.2.zip
Version 4.2.3 HRC-1000G-check-bim.v4.2.3.zip
Version 4.2.4 HRC-1000G-check-bim.v4.2.4.zip
Version 4.2.5 HRC-1000G-check-bim.v4.2.5.zip
Version 4.2.6 HRC-1000G-check-bim-v4.2.6.zip
Version 4.2.7 HRC-1000G-check-bim-v4.2.7.zip
Version 4.2.8 HRC-1000G-check-bim-v4.2.8.zip
Version 4.2.9 HRC-1000G-check-bim-v4.2.9.zip
Version 4.2.10 HRC-1000G-check-bim-v4.2.10.zip
Version 4.2.11 HRC-1000G-check-bim-v4.2.11.zip
Version 4.2.13 HRC-1000G-check-bim-v4.2.13.zip
Version 4.3.0 HRC-1000G-check-bim-v4.3.0.zip
Version 4.3.1 HRC-1000G-check-bim-v4.3.1.zip
Version 4.3.2 HRC-1000G-check-bim-v4.3.2.zip
Version 4.3.3 HRC-1000G-check-bim-v4.3.3.zip

All versions below 4.3.0 are designed for use in an interactive session which does not work well on a high performance cluster. This version is designed to run in a non-interactive manner or use the more recent versions 4.3.0 and above (recommended).

Download here:HRC-1000G-check-bim-v4.2.11-NoReadKey.zip


Summary of Version Changes

V4.2.2
Added two new options for allele frequency thresholds (-t <difference>, -n)
-t 0.3 sets the allele difference threshold to 0.3, the default if not set is 0.2. Use this to change the allele frequency difference used to exclude SNPs in the final file, range 0 - 1, the larger the difference the fewer SNPs that will be excluded.
-n flag to specify that you do not wish to exclude any SNPs based on allele frequency difference, if -n is used -t has no effect.
V4.2.3
Fixed bug whereby SNPs that were incorrectly mapped in the bim file were updated only for position and not chromosome.
V4.2.4
Added support for X chromosome in HRC release v1.1
V4.2.5
Fixed bug with Chromosome X being left out of plink command file
V4.2.6
Added the ability to use gzipped reference panels
Implemented a check to ensure the same number of variants are present in the .bim and .frq files
V4.2.7
Minor update to the usage information display
V4.2.8
Added new flag -c to specify checking individual chromosome(s) rather than assuming genome wide
V4.2.9
Update to allow reading of bim files with allele codes 1,2,3,4
Fixed a minor bug that resulted in a warning of a null entry if there were no differences between the bim file and the reference
V4.2.10
Update to check path and name of bim and frequency file are correct and report meaningfully if not
V4.2.11
Changed the plink commands to update and retain the Ref/Alt alleles in the plink conversion commands
V4.2.12
Added -a flag to disable the automatic removal of palindromic SNPs with MAF > 0.4
V4.2.13
Added ability to read the bgzipped reference panels, as well as plain gzipped files
Added -l flag to set path to preferred plink executable
Added better support for the paths to the plink files in the shell script
Added -o flag to allow the final output path for all the plink and VCF files to be specified
V4.3.0
Added TOPMed and rebuilt code to use less memory
Removed the interactive terminal size check, this version will work both interactively and on a cluster
V4.3.1
Added flag to allow variants not found in the reference panel to be kept in the output, if used in conjunction with -n it will keep all the input variants
V4.3.2
Changed the help messages to be clearer and work on more systems
V4.3.3
Added new plink 2 run script with updated options, this is now created alongside the plink v1.9 run script, both work with the latest development versions of plink
tested with plink 1.9 (beta 7.11) and plink 2 (v2.0.0-a.7LM)
Use the -l command to specify the plink executable, with or without path, otherwise the assumption is the commands plink and plink2 are in the path and will be used by default.
If the plink and/or plink2 executable is in the same directory as the bim file it will be used in the output.

Summary of checks performed and outputs:

Checks:
Strand, alleles, position, Ref/Alt assignments and frequency differences. In addition to the reference file v4 and above require the plink .bim and (from the plink --freq command) .frq files.
Produces:
A set of plink, and from v4.3.3 plink2, commands to update or remove SNPs based on checks against the reference file specified as well as the possibility to create a file (FreqPlot) of cohort allele frequency vs reference panel allele frequency.
Updates:
Strand, position, ref/alt assignment
Removes:
A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference, SNPs not in reference panel.
All of these removal steps can be adapted or turned off in the latest versions >v4.3.1

Usage with HRC reference panel:

Requires the unzipped (or V4.2.13 onwards, gzipped) tab delimited HRC reference (currently v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab). This appears to no longer be available from the Haplotype Reference Consortium Website but is available from on this site, link here: HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz

Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -h

Usage with 1000G reference panel

Requires the unzipped 1000G legend file (instructions to create this are below) or (recommended) download here (1.4G in size): 1000GP_Phase3_combined.legend.gz for autosomes only or here with X chromosome included 1000GP_Phase3_combined.legend.incX.gz

Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -g -p <population>
1000G population will default to ALL if not specified.

Usage with CAAPA reference panel

The CAAPA reference panel can be downloaded here (513MB): all.caapa.sorted.zip
Many thanks to Kathleen Barnes and Michelle Daya of CAAPA for sharing this and also to Margaret Parker and Michael Cho for the initial reformatting

Usage is the same as for the HRC panel, namely using the flags -h -r all.caapa.sorted.txt

Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r all.caapa.sorted.txt -h

Creating the 1000G reference panel file

This section is included for the sake of completeness but is no longer needed as the 1000G reference panel file, with and without X chromosome, can be downloaded on the links above or below.

The reference file for 1000G can be created from the legend files on the impute website or downloaded from the link below.
To create the file you will need to extract the legend files from the 1000GP_phase3.tgz file: https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html

Download the .tgz and extract all .legend files, place them in a directory together with the following script: concatenate-1000GP.zip and run the script (perl concatenate-1000GP.pl). This will create a file (1000GP_Phase3_combined.legend) suitable for use with the checking program.

Alternatively the file can be downloaded from this site here: 1000GP_Phase3_combined.legend with X chromosome 1000GP_Phase3_combined.legend.incX.gz

Post Imputation Checking (IC)

Background

IC is a program designed to produce a single page visual summary of one or more imputed data sets from the most common imputation programs. The poster from the ASHG 2016 meeting that describes this program, and the pre-imputation checking, can be downloaded here (3.8MB)

Downloads

For older versions two programs were required, one to parse the VCF file (vcfparse), the other to do the plots. This is no longer necessary with all versions >= v1.0.8

Version 1.0.2 ic.v1.0.2.zip
Version 1.0.3 ic.v1.0.3.zip
Version 1.0.4 ic.v1.0.4.zip
Version 1.0.5 ic.v1.0.5.zip
Version 1.0.6 ic.v1.0.6.zip
Version 1.0.7 ic.v1.0.7.zip
Version 1.0.8 ic.v1.0.8.zip
Version 1.0.9 ic.v1.0.9.zip
Version 1.0.10 ic.v1.0.10.zip
Version 1.0.11 ic.v1.0.11.zip Docker Container locally
Docker Site
Singularity Container locally
Syslabs Site
VCF Parse VCFparse.zip

Update History:


v1.0.2 Updated path to Java executable
v1.0.3 Updated to run using the info files from Michigan Imputation server
v1.0.4 Fixed bug with the 1000G parsing
v1.0.5 Added ability to calculate AF from AC and AN for Umich data
v1.0.6 Added function to read summary level data
v1.0.7 Changes to speed up processing of summary level data, addition of a function to create blank plots if a chromosome is missing, bug fix on the info score summary plot
v1.0.8 Updated file reading to cope with bgzipped files, removing need for vcfparse. Added better handling of paths to ensure the Java executable is found at run time
v1.0.9 Added code to read gzipped reference panels directly
v1.0.10 Fixed bug in reading directly from VCFs
v1.0.11 Code tidying and Updated help page formatting and error messages

Installation

No installation is required, for the easiest usage the containerised versions are now recommended. extract all the files in the zip file to a directory. There are however a number of library dependencies, see Requirements below on how to install these if you are running the program locally (requires root access). If running on a server/cluster recommend using the containerised version as it has all the libraries, instructions on use below.

Installation of the libraries required to run IC requires sudo (root) access, to avoid this being an issue we have created Docker and Singularity containers v1.0.11 (and above) and these contain all the required dependencies.

Containerised versions

There are Docker and Singularity containers of all the tools, including IC available

Docker

To run the Docker image there are two options:

1) Pull the image from Docker hub:
docker pull itganalytics/strand_tools

2) Download the image from this site on the link above and import into Docker using:
docker load < strand_tools.tar.gz

Singularity

To run the Singularity image

1) Pull the image from Singularity hub:
singularity pull --arch amd64 library://hmgu-itg/default/strand_tools:latest

2) Download the image from this site on the link above

Use the command:
singularity shell hmgu-itg_default_strand_tools.sif
To generate a shell in the container and run ic.pl with the commands as normal It is best to invoke the container shell while being in the directory or an immediate higher level directory to that containing the data for processing as by default this shell will have access to the user home directory and the current directory. Should it be necessary to add other directories in these can be added with the -B, --bind command.

Requirements To Run Locally

The program requires the GD libraries to be installed to create the plots. This can be done on the local machine if root priviliges are available, if not see the section on running the containerised versions and these would be recommended above any installation.

Install libGD

Installation will vary by system, on Ubuntu installation of libGD can be done using the following commands if they are not already installed on the system and if permissions allow it. Should this not be possible, then the latest version in a container can be used instead as this requires no installation.

Before 16.04:
sudo apt-get -y install libgd2-xpm-dev build-essential
16.04 onwards:
sudo apt -y install libgd-dev build-essential

Install Perl GD::Graph

Install locally cpanm configure automatically and then to local::lib and then GD::Graph:
cpan App::cpanminus
cpanm GD::Graph

Download the reference panel:

1000G phase 3 summary (from here 1.44GB) or the tab delimited HRC summary (HRC release 1.1 from here ).

Usage


The current version does not requires the use of the vcfparse.pl script. Older versions (<1.0.8) require the first 8 columns from the VCF output file to be extracted before running the main program, use vcfparse.pl script.

vcfparse usage:

vcfparse.pl -d <directory of VCFs> -o <outputname> [-g]
where

-d
The path to the directory containing imputed VCF files.
-o
Specifies the output directory name, will be created if it doesn't exist.
-g
Flag to specify the output files are gzipped.

The program will not overwrite files of the same name and this process will be required for each imputed data set.
Once the VCFs are converted the main program can be run with the following options.

ic Usage:

ic -d<directory> -r <Reference panel> [-h | (-g -p <population> )][-f <mappings file>] [-o <output directory>]

Options:

-d --directory Directory Top level directory containing either one set of per chromosome files, or multiple directories each containing a set of per chromosome files
This directory will be searched recursively for files matching the required formats
Files may be gzipped or uncompressed
-f --file Mapping file Mapping file of directory name to cohort name, optional but recommended when using multiple data sets
-r --ref Reference panel Reference panel summary file, either 1000G or the tab delimited HRC (r1 or r1.1)
-h --hrc
Flag to indicate Reference panel file is HRC, defaults to HRC if no option is given
-g --1000g
Flag to indicate Reference panel file is 1000G
-p --pop Population Population to check allele frequency against
Applies to 1000G only, defaults to ALL if not supplied
Options available ALL, EUR, AFR, AMR, SAS, EAS
-o
Output Directory
Top level directory to contain all the output folders

Mapping file

The mapping file should consist of two columns:

The directory name (optionally including the full or relative path)
The name you wish to use for the output files

Example mapping file:

/full/path/to/folder/1 MyStudy1
./path/to/folder/2 MyStudy2
folder3 MyStudy3


If no mapping file is supplied the program will attempt to determine a unique set of names from the top level directory and/or sub-directories supplied with the -d option, this may or may not end up with unique folders for each output, if not the program will start an auto-increment on the file names within the directory (these will be consistent across each data set).
One advantage of using a mapping file is the data sets provided need not be all in the same base path.

Should the run fail at the final step, the command to rerun the Java code is printed at the end, with the correct paths, this can be copied and pasted to rerun this section.

Formats Supported

Currently VCF files imputed on the from the University of Michigan, the Sanger Institute and Helmholtz Munich server are supported. Other VCFs may be supported and older versions of Impute are supported but require a different reformatter, contact me in this case.


Output

An example of the html output can be found here: Sample QC Report (5.2MB).

Known issues

Gzip in Perl does not support bgzip chunks, hence the requirement to unzip the reference panel and summarise the VCF files in the early versions.
If there are errors on the allele frequency plots this could be as earlier versions of the University of Michigan imputation output did not contain the AF in the VCF. Contact me for an earlier version that can support this.

Future plans

We are still planning a web based version of this code to simplify usage. In the meantime for easiest use please use the docker or singularity container versions

UkBioBank Phenotype Extraction Script

Perl script to extract a subset of data from the UKBioBank phenotype data download.

Requires the Stata .dct and R .tab files to run. Run without any arguments to get a description of usage.

Downloads:

Version 1.0 extract.zip

Version 1.2.2 extract-v1.2.2.zip

Phenotype File Summary and Fingerprinting

This program was developed to give an overview of phenotypes (columns) in a tab delimited text file, to do this it attempts to determine the type of each column (numeric, text, mixed text/numeric or coded variables) and summarise it appropriately.

Downloads

v3.1
phenotype_qc.zip
v4.1
Updated to better accommodate the missing column names in the dct file.
phenotype_qc.v4.1.zip

Usage:

Version 3.1

Usage: perl phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h]

Version 4.1

Usage: perl phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h -p -k <#columns per chunk>]

Version 4.1 introduces a new method for extracting the data from the phenotype file, this is necessary as for large data sets (>5GB) where v3.1 would not perform well or at all. This version works on data sets of any size however it will only work in a Unix/Linux environment (Cygwin possible but not tested) as it requires the UNIX cut command to extract the data. The option -k specifies how many columns at a time to extract using cut, the default is 25 which is a trade off between the number of calls to cut vs the amount of memory required. If you have problems with memory usage set this a smaller number. Higher numbers are possible but on test sets to date the time to complete a run increases, probably due to memory allocation.

The second new option (-p) is to plot all the numeric phenotypes, this allows a quick visual check of the distribution of values. Running this version, with or without plotting will still require GD::Graph to be installed, for instructions on how to do this see here: GD::Graph

Phenotype QC program command line options:

-f <phenotypeFile> A tab delimited phenotype file, all rows should contain the same number of columns.
-i <sampleIDcolumn> Set the column that contains the sample ids, column numbers start from 0, default is 0.
-m <missingValueIdentifier> Set the missing value identifier, default is NA.
-c <column> Specify a single column to check, if using a header file this can be either the human readable or original header.
-s <columnFile> Provide a file containing a list of columns to extract, one per row. As with the single column checking can be either header, takes precedence over -c.
-l <headerLookup> Provide an optional header file for the data set, for use where the headers within the file are not human readable, requires two tab delimited columns, the first column should contain the ids that match the header in phenotypeFile.
-h
Show the help message.

Version 4.1 Only
-k <#columns per chunk> Specify the number of columns to extract in each chunk for processing, default if not specified is 25.
-p
Specify to plot every column identified as numeric, or numeric coded, default is not to plot.

The categories reported by the program are:

Numeric: this category reports:
- Total number of non-missing variables.
- Total number of missing entries.
- Mean.
- Median.
- Standard deviation.
- Number of entries that are either greater or less than three standard deviations from the mean.
Text, Mixed, and Coded
These categories all report:
- Total number of non-missing variables.
- Total number of missing entries.
- Total number of unique entries, and if the number of unique entries is below 50 these are displayed with counts for each entry.
Coded:
Entries in this category can be text coded, numeric coded or mixed coded.
A column is described as containing a coded set of variables if the total number of unique values is less than 0.1% of the total number of entries, or 20, whichever is the greater.

The latest versions (v3.1 & v4.1) also produce a "fingerprint" summary of each row (assumed sample) and column (assumed phenotype) these are designed to verify whether a sample or phenotype has changed between data releases.
The fingerprint files (prefixed with FP, fingerprint phenotype, and FPS, fingerprint Sample) both contain the following columns:

Sample ID or column header.
Total numeric values.
Total non-numeric values.
Total non missing values.
Total unique values.
Total missing values.
Determined type.
These files can be used to determine whether a phenotype has changed between data releases, a script to do this comparison is under development and will be posted as soon as it is finalised.


Excel to Tab

A Perl wrapper using Spreadsheet::Read to convert various spreadsheet formats (xls, xlsx, csv, ods) to tab delimited text files.


Where the format supports sheets the output file name will include the sheet name and there will be one file per sheet.
Depending on your system you may need to add one or more of the following modules to your Perl install:
Spreadsheet::Read
Spreadsheet::ReadSXC
Spreadsheet::ParseExcel
Spreadsheet::ParseXLSX
Text::CSV_XS

Download

xls2tab.zip

Usage:

xls2tab -i <inputfile> [-o <outputfilestem> -f]

-i
inputfile
Path and name of the input file
-o
outputfilestem
Optional, if supplied will form the stem for all the output. For example using MyFile will lead to MyFile.Sheet1.txt, MyFile.Sheet2.txt (assuming there are 2 sheets named Sheet1 and Sheet2). If not supplied the name will be the same as the input, with the sheet name, if it exists, and the file extension, .txt.
-f

Optional, tells the program to select the formatted values from the cells, not the raw values. Default is for raw values


Pre-Imputation Data Checking

The Chipendium tool to identify the array and strand orientation from the SNP list in a plink format bim file is currently unavailable but will be returning soon.