File Formats
BAM format
BAM is a compressed version of the SAM (Sequence Alignment/Map) binary format. BAM uses an index file to give fast access to small sections of the file.
Please note that if you are attaching a BAM file from a URL, you will need to have an index file with the extension .bam.bai in the same directory as your data file, with the same name.
Additional information about SAM/BAM is available at the SAMtools development site.
BigBED format
BigBED is an indexed format created from standard BED files using the bedToBigBed utility program which is part of the bedtools toolset. It allows large datasets to be processed much faster than a conventional BED file.
BigWig format
The BigWig format is designed for dense, continuous data that is intended to be displayed as a graph. Files can be created from WIG or BedGraph files using the appropriate utility program.
VCF format
The VCF format is a tab delimited format for storing variant calls and and individual genotypes. It is able to store all variant calls from single nucleotide variants to large scale insertions and deletions.
More information on this format can be obtained from the SAMtools specifications site.
Please note that if you are attaching a large VCF file from a URL, you will need to have an index file with the extension .vcf.gz.tbi in the same directory as your data file, with the same name.
In order to produce the indexed vcf file with the .gz.tbi extension you must follow the following steps:
- Compress your vcf file using bgzip
- Index the vcf.gz file using tabix. Use will need to pass the option -p vcf to tabix, for example "/usr/bin/tabix -p vcf my_file.vcf.gz"
VEP format
VEP (Variant Effect Predictor) format is a simple whitespace-separated format (columns may be separated by space or tab characters), containing five required columns plus an optional identifier column:
- chromosome - just the name or number, with no 'chr' prefix
- start
- end
- allele - pair of alleles separated by a '/', with the reference allele first
- strand - defined as + (forward) or - (reverse).
- identifier - this identifier will be used in the VEP's output. If not provided, the VEP will construct an identifier from the given coordinates and alleles.
1 881907 881906 -/C + 5 140532 140532 T/C + 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T - 19 66520 66520 G/A + var1 8 150029 150029 A/T + var2
An insertion (of any size) is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:
8 12601 12600 -/C +
A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:
8 12600 12602 CGT/- -