Hi,
In a nutshell: a recipe to show SOLID3 reads in IGV
We wish to visualize SOLiD3 reads on the human genome, but we've been unable to use ABI's SAB (performance issues). To instead use IGV (https://www.broadinstitute.org/igv/) which - in our hands - is much more scalable, we needed to convert the ABI v2 flavor gff to SAM/BAM (the alignment optimized BAM format might by itself explain some of the performance superiority of IGV over SAB).
We therefore tried to install the ABI "GffToSam" utility of the "SOLiD BaseQV Tool" package.
First, to compile GffToSam on debian based Linux, one has to add the following include line to the c++ source code:
Second, when run against SOLiD3 gff alignment files, the GffToSam utility crashes with:
After much time debugging, it appears that the SOLID3 gff files contain color_seq and color_qual definitions that are mostly longer than the sequence definition for each read. It seems the GffToSam utility never anticipated this, leading to "out_of_range" errors. Maybe the gff file specs have changed in SOLID3?
Anyhow, here's a code hack of GffToSam.cpp around line 1220 that has allowed me to successfully produce SAM format files for SOLID3. Has anyone encountered this issue, as I'd like to make sure that at least the approach is valid (ie shortening color_seq and color_qual to seq length):
Nitty gritty details of gff to bam conversion:
-install SOLiD_BaseQV_Tool_1.0.0 from http://solidsoftwaretools.com/gf/project/sam/frs/
-edit GffToSam.cpp as described above, compile
-convert SOLID gff:
-correct indexed chromosome names, eg with a fix.pl perl script:
-install "samtools" from http://samtools.sourceforge.net/samtools.shtml
-define chromosome reference sequence (here chr17) from FASTA file:
-convert SAM file to BAM:
-sort BAM file:
-index BAM file:
-load in IGV with File->Load "alignement_chr17_ASG05_fixed17.sam.sorted.bam"
There must be a better way, but this works:
In a nutshell: a recipe to show SOLID3 reads in IGV
We wish to visualize SOLiD3 reads on the human genome, but we've been unable to use ABI's SAB (performance issues). To instead use IGV (https://www.broadinstitute.org/igv/) which - in our hands - is much more scalable, we needed to convert the ABI v2 flavor gff to SAM/BAM (the alignment optimized BAM format might by itself explain some of the performance superiority of IGV over SAB).
We therefore tried to install the ABI "GffToSam" utility of the "SOLiD BaseQV Tool" package.
First, to compile GffToSam on debian based Linux, one has to add the following include line to the c++ source code:
Code:
#include <cstdlib>
Code:
GffToSam -i alignement_chr17_ASG05_200000.v2.gff -id S1 -sm ASG05bis -o test.sam Total number of gff records: 199920 Starting frag gff to sam conversion.. Total number of chromosomes: 1 Chr Indexes: 0 terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::at Abandon
After much time debugging, it appears that the SOLID3 gff files contain color_seq and color_qual definitions that are mostly longer than the sequence definition for each read. It seems the GffToSam utility never anticipated this, leading to "out_of_range" errors. Maybe the gff file specs have changed in SOLID3?
Anyhow, here's a code hack of GffToSam.cpp around line 1220 that has allowed me to successfully produce SAM format files for SOLID3. Has anyone encountered this issue, as I'd like to make sure that at least the approach is valid (ie shortening color_seq and color_qual to seq length):
Code:
/* Read Gff attributes */ while (getline(linestream, token, ';')) // Attributes { if (token.substr(0,2).compare("b=")==0) { seq = token.substr(2); seq = toUpperCase (seq); bvalid = true; } else if (token.substr(0,2).compare("g=")==0) { color_seq = token.substr(2); //hack start, force color_seq same length as seq if (strand.compare("-")==0) { start_pos += color_seq.length() - seq.length(); } color_seq = color_seq.substr(0,seq.size());//hack end gvalid = true; } else if (token.substr(0,2).compare("q=")==0) { //hack start, force color_qual same length as seq color_qual = token.substr(2); istringstream linestream2 (color_qual); string token2; int nbqv = 0; string color_qual2 = ""; while (getline(linestream2, token2, ',')) // QV's { nbqv++; if (nbqv <= seq.size()) color_qual2 += ',' + token2; } color_qual = color_qual2.substr(1);//hack end //color_qual = token.substr(2); qvalid = true; } else if (token.substr(0,2).compare("s=")==0) { annotations = token.substr(2); } else if (token.substr(0,2).compare("r=")==0) { ref_changes = token.substr(2); } else if (token.substr(0,2).compare("c=")==0) { mates_att = token.substr(2); cvalid = true; } else if (token.substr(0,2).compare("i=")==0) { ref_index = token.substr(2); rvalid = true; } }
-install SOLiD_BaseQV_Tool_1.0.0 from http://solidsoftwaretools.com/gf/project/sam/frs/
-edit GffToSam.cpp as described above, compile
-convert SOLID gff:
Code:
GffToSam -i alignement_chr17_ASG05.v2.gff -o alignement_chr17_ASG05.sam -id S1 -sm ASG05
Code:
#!/usr/bin/perl while ($line = <>){ $line =~ s/^(\S+\s\d+)\s\S+/$1\tchr17/; print $line; }
Code:
./fix.pl alignement_chr17_ASG05.sam > alignement_chr17_ASG05_fixed17.sam
-define chromosome reference sequence (here chr17) from FASTA file:
Code:
samtools faidx chr17.fa
Code:
samtools import chr17.fa.fai alignement_chr17_ASG05_fixed17.sam alignement_chr17_ASG05_fixed17.sam.bam
Code:
samtools sort alignement_chr17_ASG05_fixed17.sam.bam alignement_chr17_ASG05_fixed.sorted
Code:
samtools index alignement_chr17_ASG05_fixed17.sam.sorted.bam
There must be a better way, but this works:
Comment