This page lists all the tools available, for working with strand updates, pre- and post-imputation checking, plus a few others
If you have any issues with these programs please contact me, Will Rayner
william dot rayner at helmholtz-munich dot de
and/or
will dot rayner at strand dot org dot uk
A simple to use cluster plotting program.
The ASHG 2013 poster describing the use of this program can be downloaded here:
Detailed instructions and usage examples can be found in the zip download.
ScatterShot.2013DEC09.zip (13Mb)
ScatterShot.2013NOV25.r1.zip (12.9Mb)
ScatterShot_2013OCT25_r1.zip (16.4Mb)
ScatterShot_2013OCT24_r2.zip (15.5Mb)
ScatterShot.zip (12.9Mb)
Program to derive Kcal values from questionnaire data
See the accompanying paper here for details
Download here: PhysicalActivityProgram.zip
Program to check a QC'd plink .bim file against the HRC, 1000G or CAAPA reference SNP list in advance of imputation
This tool is also available in the Docker and Singularity containers listed as part of the post-QC checking program (IC). See the downloads section of that program for the links
| Version 4.2.1 | HRC-1000G-check-bim-v4.2.zip |
| Version 4.2.2 | HRC-1000G-check-bim.v4.2.2.zip |
| Version 4.2.3 |
HRC-1000G-check-bim.v4.2.3.zip |
| Version 4.2.4 | HRC-1000G-check-bim.v4.2.4.zip |
| Version 4.2.5 | HRC-1000G-check-bim.v4.2.5.zip |
| Version 4.2.6 | HRC-1000G-check-bim-v4.2.6.zip |
| Version 4.2.7 | HRC-1000G-check-bim-v4.2.7.zip |
| Version 4.2.8 | HRC-1000G-check-bim-v4.2.8.zip |
| Version 4.2.9 | HRC-1000G-check-bim-v4.2.9.zip |
| Version 4.2.10 | HRC-1000G-check-bim-v4.2.10.zip |
| Version 4.2.11 | HRC-1000G-check-bim-v4.2.11.zip |
| Version 4.2.13 | HRC-1000G-check-bim-v4.2.13.zip |
| Version 4.3.0 | HRC-1000G-check-bim-v4.3.0.zip |
| Version 4.3.1 | HRC-1000G-check-bim-v4.3.1.zip |
| Version 4.3.2 | HRC-1000G-check-bim-v4.3.2.zip |
| Version 4.3.3 | HRC-1000G-check-bim-v4.3.3.zip |
All versions below 4.3.0 are designed for use in an interactive session which does not work well on a high performance cluster. This version is designed to run in a non-interactive manner or use the more recent versions 4.3.0 and above (recommended).
Download here:HRC-1000G-check-bim-v4.2.11-NoReadKey.zip
Requires the unzipped (or V4.2.13 onwards, gzipped) tab delimited HRC reference (currently v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab). This appears to no longer be available from the Haplotype Reference Consortium Website but is available from on this site, link here: HRC.r1-1.GRCh37.wgs.mac5.sites.tab.gz
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -h
Requires the unzipped 1000G legend file (instructions to create this are below) or (recommended) download here (1.4G in size): 1000GP_Phase3_combined.legend.gz for autosomes only or here with X chromosome included 1000GP_Phase3_combined.legend.incX.gz
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r <Reference panel> -g -p <population>
1000G population will default to ALL if not specified.
The CAAPA reference panel can be downloaded here (513MB): all.caapa.sorted.zip
Many thanks to Kathleen Barnes and Michelle Daya of CAAPA for sharing this and also to Margaret Parker and Michael Cho for the
initial reformatting
Usage is the same as for the HRC panel, namely using the flags -h -r all.caapa.sorted.txt
Usage: perl HRC-1000G-check-bim.pl -b <bim file> -f <Frequency file> -r all.caapa.sorted.txt -h
This section is included for the sake of completeness but is no longer needed as the 1000G reference panel file, with and without X chromosome, can be downloaded on the links above or below.
The reference file for 1000G can be created from the legend files on the impute website or downloaded from the link below.
To create the file
you will need to extract the legend files from the 1000GP_phase3.tgz file:
https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html
Download the .tgz and extract all .legend files, place them in a directory together with the following script: concatenate-1000GP.zip and run the script (perl concatenate-1000GP.pl). This will create a file (1000GP_Phase3_combined.legend) suitable for use with the checking program.
Alternatively the file can be downloaded from this site here: 1000GP_Phase3_combined.legend with X chromosome 1000GP_Phase3_combined.legend.incX.gz
IC is a program designed to produce a single page visual summary of one or more imputed data sets from the most common imputation programs. The poster from the ASHG 2016 meeting that describes this program, and the pre-imputation checking, can be downloaded here (3.8MB)
For older versions two programs were required, one to parse the VCF file (vcfparse), the other to do the plots. This is no longer necessary with all versions >= v1.0.8
| Version 1.0.2 | ic.v1.0.2.zip | ||
| Version 1.0.3 | ic.v1.0.3.zip | ||
| Version 1.0.4 | ic.v1.0.4.zip | ||
| Version 1.0.5 | ic.v1.0.5.zip | ||
| Version 1.0.6 | ic.v1.0.6.zip | ||
| Version 1.0.7 | ic.v1.0.7.zip | ||
| Version 1.0.8 | ic.v1.0.8.zip | ||
| Version 1.0.9 | ic.v1.0.9.zip | ||
| Version 1.0.10 | ic.v1.0.10.zip | ||
| Version 1.0.11 | ic.v1.0.11.zip |
Docker Container locally
Docker Site |
Singularity Container locally
Syslabs Site |
| VCF Parse | VCFparse.zip |
| v1.0.2 | Updated path to Java executable |
| v1.0.3 | Updated to run using the info files from Michigan Imputation server |
| v1.0.4 | Fixed bug with the 1000G parsing |
| v1.0.5 | Added ability to calculate AF from AC and AN for Umich data |
| v1.0.6 | Added function to read summary level data |
| v1.0.7 | Changes to speed up processing of summary level data, addition of a function to create blank plots if a chromosome is missing, bug fix on the info score summary plot |
| v1.0.8 | Updated file reading to cope with bgzipped files, removing need for vcfparse. Added better handling of paths to ensure the Java executable is found at run time |
| v1.0.9 | Added code to read gzipped reference panels directly |
| v1.0.10 | Fixed bug in reading directly from VCFs |
| v1.0.11 | Code tidying and Updated help page formatting and error messages |
No installation is required, for the easiest usage the containerised versions are now recommended. extract all the files in the zip file to a directory. There are however a number of library dependencies, see Requirements below on how to install these if you are running the program locally (requires root access). If running on a server/cluster recommend using the containerised version as it has all the libraries, instructions on use below.
Installation of the libraries required to run IC requires sudo (root) access, to avoid this being an issue we have created Docker and Singularity containers v1.0.11 (and above) and these contain all the required dependencies.
There are Docker and Singularity containers of all the tools, including IC available
To run the Docker image there are two options:
1) Pull the image from Docker hub:
docker pull itganalytics/strand_tools
2) Download the image from this site on the link above and import into Docker using:
docker load < strand_tools.tar.gz
To run the Singularity image
1) Pull the image from Singularity hub:
singularity pull --arch amd64 library://hmgu-itg/default/strand_tools:latest
2) Download the image from this site on the link above
Use the command:The program requires the GD libraries to be installed to create the plots. This can be done on the local machine if root priviliges are available, if not see the section on running the containerised versions and these would be recommended above any installation.
Installation will vary by system, on Ubuntu installation of libGD can be done using the following commands if they are not already installed on the system and if permissions allow it. Should this not be possible, then the latest version in a container can be used instead as this requires no installation.
Before 16.04:
sudo apt-get -y install libgd2-xpm-dev build-essential
16.04 onwards:
sudo apt -y install libgd-dev build-essential
Install locally cpanm configure automatically and then to local::lib and then GD::Graph:
cpan App::cpanminus
cpanm GD::Graph
1000G phase 3 summary (from here 1.44GB) or the tab delimited HRC summary (HRC release 1.1 from here ).
vcfparse.pl -d <directory of VCFs>
-o <outputname> [-g]
where
| -d |
The path to the directory containing imputed VCF files. |
| -o |
Specifies the output directory name, will be created if it doesn't exist. |
| -g |
Flag to specify the output files are gzipped. |
The program will not overwrite files of the same name and this
process will be required for each imputed data set.
Once the VCFs are converted the main program can be run with the
following options.
ic -d<directory> -r
<Reference panel> [-h |
(-g -p <population>
)][-f <mappings file>] [-o
<output directory>]
| -d --directory | Directory | Top
level directory containing either one set of per
chromosome files, or multiple directories each containing
a set of per chromosome files This directory will be searched recursively for files matching the required formats Files may be gzipped or uncompressed |
| -f --file | Mapping file | Mapping
file of directory name to cohort name, optional but
recommended when using multiple data sets |
| -r --ref | Reference panel | Reference panel summary file, either
1000G
or the tab delimited HRC (r1 or r1.1) |
| -h --hrc | Flag
to indicate Reference panel file is HRC, defaults to HRC
if no option is given |
|
| -g --1000g | Flag to indicate Reference panel file is 1000G | |
| -p --pop | Population | Population
to check allele frequency against Applies to 1000G only, defaults to ALL if not supplied Options available ALL, EUR, AFR, AMR, SAS, EAS |
| -o |
Output
Directory |
Top
level directory to contain all the output folders |
The mapping file should consist of two columns:
Example mapping file:
| /full/path/to/folder/1 | MyStudy1 |
| ./path/to/folder/2 | MyStudy2 |
| folder3 | MyStudy3 |
If no mapping file is supplied the program will attempt to
determine a unique set of names from the top level directory
and/or sub-directories supplied with the -d option, this may or
may not end up with unique folders for each output, if not the
program will start an auto-increment on the file names within the
directory (these will be consistent across each data set).
One advantage of using a mapping file is the data sets provided
need not be all in the same base path.
Should the run fail at the final step, the command to rerun the Java code is printed at the end, with the correct paths, this can be copied and pasted to rerun this section.
Currently VCF files imputed on the from the University of Michigan, the Sanger Institute and Helmholtz Munich server are supported. Other VCFs may be supported and older versions of Impute are supported but require a different reformatter, contact me in this case.
Gzip in
Perl does not support bgzip chunks, hence the requirement to unzip
the reference panel and summarise the VCF files in the early versions.
If there are errors on the allele frequency
plots this could be as earlier versions of the University of
Michigan imputation output did not contain the AF in the VCF.
Contact me for an earlier version that can support this.
We are still planning a web based version of this code to simplify usage. In the meantime for easiest use please use the docker or singularity container versions
Perl script to extract a subset of data from the UKBioBank phenotype data download.
Requires the Stata .dct and R .tab files to run. Run without any arguments to get a description of usage.
Downloads:
Version 1.0 extract.zip
Version 1.2.2 extract-v1.2.2.zip
This program was developed to give an overview of phenotypes (columns) in a tab delimited text file, to do this it attempts to determine the type of each column (numeric, text, mixed text/numeric or coded variables) and summarise it appropriately.
Usage: perl phenotype_qc.pl -f <phenotypeFile> [-m
<missingValueIdentifier> -i <sampleIDcolumn> -c
<column> -s <columnFile> -l <headerLookup> -h]
Usage: perl phenotype_qc.pl -f <phenotypeFile> [-m <missingValueIdentifier> -i <sampleIDcolumn> -c <column> -s <columnFile> -l <headerLookup> -h -p -k <#columns per chunk>]
Version 4.1 introduces a new method for extracting the data from the phenotype file, this is necessary as for large data sets (>5GB) where v3.1 would not perform well or at all. This version works on data sets of any size however it will only work in a Unix/Linux environment (Cygwin possible but not tested) as it requires the UNIX cut command to extract the data. The option -k specifies how many columns at a time to extract using cut, the default is 25 which is a trade off between the number of calls to cut vs the amount of memory required. If you have problems with memory usage set this a smaller number. Higher numbers are possible but on test sets to date the time to complete a run increases, probably due to memory allocation.
The second new option (-p) is to plot all the numeric phenotypes, this allows a quick visual check of the distribution of values. Running this version, with or without plotting will still require GD::Graph to be installed, for instructions on how to do this see here: GD::Graph
| -f | <phenotypeFile> | A tab delimited phenotype file, all rows should contain the same number of columns. |
| -i | <sampleIDcolumn> | Set the column that contains the sample ids, column numbers start from 0, default is 0. |
| -m | <missingValueIdentifier> | Set the missing value identifier, default is NA. |
| -c | <column> | Specify a single column to check, if using a header file this can be either the human readable or original header. |
| -s | <columnFile> | Provide a file containing a list of columns to extract, one per row. As with the single column checking can be either header, takes precedence over -c. |
| -l | <headerLookup> | Provide an optional header file
for the data set, for use where the headers within
the file are not human readable, requires two tab
delimited columns, the first column should contain
the ids that match the header in phenotypeFile. |
| -h | Show the help message. | |
| Version 4.1 Only | ||
| -k | <#columns per chunk> | Specify the number of columns to extract in each chunk for processing, default if not specified is 25. |
| -p | |
Specify to plot every column identified as numeric, or numeric coded, default is not to plot. |
The latest versions (v3.1 & v4.1)
also produce a "fingerprint" summary of each row (assumed
sample) and column (assumed phenotype) these are designed to
verify whether a sample or phenotype has changed between
data releases.
The fingerprint files (prefixed with FP, fingerprint
phenotype, and FPS, fingerprint Sample) both contain the
following columns:
A Perl wrapper using Spreadsheet::Read to convert various spreadsheet formats (xls, xlsx, csv, ods) to tab delimited text files.
| -i |
inputfile |
Path and name of the input file |
| -o |
outputfilestem |
Optional, if supplied will form the stem
for all the output. For example using MyFile will lead to
MyFile.Sheet1.txt, MyFile.Sheet2.txt (assuming there are 2
sheets named Sheet1 and Sheet2). If not supplied the name
will be the same as the input, with the sheet name, if it
exists, and the file extension, .txt. |
| -f |
Optional, tells the program to select the
formatted values from the cells, not the raw values.
Default is for raw values |
The Chipendium tool to identify the array and strand orientation from the SNP list in a plink format bim file is currently unavailable but will be returning soon.