An Introduction to Illumina Genotyping Array VCF

The Illumina Genotyping Array Pipeline v1.11.0 generates a Variant Call Format (VCF) output with genotype information and data processing details. The VCF adheres to the VCF 4.2 specification and includes additional unique fields and attributes specific to the Arrays pipeline.

This document outlines the distinctive VCF fields and attributes of the Array pipeline that are not found in the standard VCF specification. For more information on the pipeline, refer to the Illumina Genotyping Array Pipeline Overview.

Table of Contents

Meta information fields

Each VCF has meta information fields with attributes that generally describe the processing of the sample in the array.

arrayType – Name of the genotyping array that was processed.
autocallDate – Date that the genotyping array was processed by ‘autocall’ (aka gencall), the Illumina genotype calling software
autocallGender – Gender (sex) that autocall determined for the sample processed
autocallVersion – Version of the autocall/gencall software used
chipWellBarcode – Chip well barcode (a unique identifier for sample as processed on a specific location on the Illumina genotyping array)
clusterFile – Cluster file used
extendedIlluminaManifestVersion – Version of the ‘extended Illumina manifest’ used by the VCF – generation software.
extendedManifestFile – File name of the ‘extended Illumina manifest’ used by the VCF generation software
fingerprintGender – Gender (sex) determined using an orthogonal fingerprinting technology, populated by an optional parameter used by the VCF generation software
gtcCallRate – GTC call rate of the sample processed that is generated by the autocall/gencall software and represents the fraction of callable loci that had valid calls
imagingDate – Creation date for the chip well barcode IDATs (raw image scans)
manifestFile – Name of the Illumina manifest (.bpm) file used by the VCF generation software
sampleAlias – Sample name

Header line columns

Following the meta-information, there are 8 standard header line columns:

#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO

These columns are standard to each VCF, but the Illumina Genotyping Array Pipeline has an additional FORMAT column. Pipeline-specific column attributes are described below:

FILTER

DUPE – Filter applied if there are multiple rows in the VCF for the same loci and alleles. For example, if there are two or more rows that share the same chromosome, position, ref allele and alternate alleles, all but one of them will have the ‘DUPE’ filter set
TRIALLELIC – Filter applied if there is a site at which there are two alternate alleles and neither of them is the same as the reference allele
ZEROED_OUT_ASSAY – Filter applied if the variant at the site was ‘zeroed out’ in the Illumina cluster file; typically applied when the calls at the site are intentionally marked as unusual. Genotypes called for sites that are ‘zeroed out’ will always be no-calls

FORMAT (genotype)

GT – GENOTYPE; standard field described in the VCF specification
IGC – Illumina GenCall Confidence Score. Measure of the call confidence
X – Raw X intensity as scanned from the original genotyping array
Y – Raw Y intensity as scanned from the original genotyping array
NORMX – Normalized X intensity
NORMY – Normalized Y intensity
R – Normalized R Value (one of the polar coordinates after the transformation of NORMX and NORMY)
THETA – Normalized Theta value (one of the polar coordinates after the transformation of NORMX and NORMY)
LRR – Log R Ratio
BAF – B Allele Frequency

INFO

This column contains attributes specific to the array’s probe.

AC – Allele Count in genotypes for each ALT allele; standard field in the VCF specification
AF – Allele Frequency; standard field in the VCF specification
AN – Allele Number; standard field in the VCF specification
ALLELE_A – Allele A as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
ALLELE_B – Allele B as annotated in the Illumina manifest (a *suffix indicates this is the reference allele)
BEADSET_ID – BeadSet ID; an Illumina identifier used for normalization
GC_SCORE – Illumina GenTrain Score; a quality score describing the probe design
ILLUMINA_BUILD – Genome Build for the design probe sequence, as annotated in the Illumina manifest
ILLUMINA_CHR – Chromosome of the design probe sequence as annotated in the Illumina manifest
ILLUMINA_POS – Position of the design probe sequence (on ILLUMINA_CHR) as annotated in the Illumina manifest
ILLUMINA_STRAND – Strand for the design probe sequence as annotated in the Illumina manifest
PROBE_A – Allele A probe sequence as annotated in the Illumina manifest
PROBE_B – Allele B probe sequence as annotated in the Illumina manifest; only present on strand ambiguous SNPs
SOURCE – Probe source as annotated in the Illumina manifest
refSNP – dbSNP rsId for this probe