Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowtie, Color-space reads, and confusing base qualities at variable sites

    I'm aligning color-space reads to the NBCI36 prebuilt index using bowtie. I'll get to my question at the end, but I ask you put up with me setting up the problem initially. Here's one read I'll use as an example for something strange I'm seeing globally in my assembly:

    from the .csfasta:
    Code:
    >611_1320_1660
    T02223020230322302111212131103031121202322002111112
    from the .qual:
    Code:
    >611_1320_1660_F3
    31 31 29 31 28 29 33 22 32 28 33 32 28 30 32 30 29 23 29 20 20 22 20 24 26 29 33 30 23 4 16 30 25 29 6 25 14 29 17 22 27 32 25 22 6 30 28 27 20 11
    After conversion to fastq using the BFast script solid2fastq 0.6.3d, I have the following:
    Code:
    @611_1320_1660
    T02223020230322302111212131103031121202322002111112
    +
    @@>@=>B7A=BA=?A?>8>55759;>B?8%1?:>':/>27<A:7'?=<5,
    So far, so good. Now, using bowtie with the following parameters:

    Code:
    bowtie_0.12.2/bowtie -t -C -S --chunkmbs 256 --nomaqround --best -n 2 -e 90 -l 28 -p 8 h_sapiens_asm_c reads_91.fastq bowtie_run3.sam
    this read aligns uniquely to chromosome 5. Here is the SAM line:

    Code:
    611_1320_1660	0	gi|51511721|ref|NC_000005.8|NC_000005	177633203	255	48M	*	0	0	CTCGGAAGCCGAGCCTGTGACTGCACCGGCACTGAAGCTCCCTGTGTG	]\Z_XW]^b][__\UURI!!%SX_`V<5OXWD@HLOHR\ZP=E[XP@,	XA:i:2	MD:Z:19C28	NM:i:1	CM:i:2
    Wonderful! Now, getting to the point, looking a little more closely at the line-up of the original color-space read sequence, the base sequence, and the two quality strings:
    Code:
       C T C G G A A G C C G A G C C T G T G A C T G C A C C G G C A C T G A A G C T C C C T G T G T G
       ] \ Z _ X W ] ^ b ] [ _ _ \ U U R I ! ! % S X _ ` V < 5 O X W D @ H L O H R \ Z P = E [ X P @ ,
      @ @ > @ = > B 7 A = B A = ? A ? > 8 > 5 5 7 5 9 ; > B ? 8 % 1 ? : > ' : / > 2 7 < A : 7 ' ? = < 5 ,
    T 0 2 2 2 3 0 2 0 2 3 0 3 2 2 3 0 2 1 1 1 2 1 2 1 3 1 1 0 3 0 3 1 1 2 1 2 0 2 3 2 2 0 0 2 1 1 1 1 1 2
    I'm having difficulty seeing how these four rows should line up. It seems clear that I've misaligned the color sequence with the base sequence, but I can't seem to find the alignment. The reason I"m looking at this is so carefully is that 4th 'A' and it's base quality,'!' (0). If my alignment is correct, the color qualities flanking this base are 20 and 20, well within the norm for the rest of the read, but only the preceding G and this A have base-qualities of 0.

    Now I"ll point out that every base in this read matches the reference, save this A. Blasting the read confirms this mapping location. Actually, there are several lines of evidence indicating the individual is heterozygous at this site with the C being the reference allele. What is notable is there are 37 other reads calling an 'A' at this site, and each of them similarly has a phred quality of 0 only at this alignment position, occasionally along with a position adjacent to it. Furthermore this seems to be a data-set wide phenomena.. every base that mismatches the reference is showing up as having 0 quality. The result of this of course is in my consensus sequence, I am calling no snps since all of the non-reference nucleotides are only supported by 0-quality bases.

    Any suggestions about what might be going on here? I don't think "all of the variable sites really are sequencing errors" is a satisfying answer. Why would the phred base quality not rise and fall with the color quality?

  • #2
    Hi keebs,

    Bowtie is not expecting to receive FASTQ files formatted the way BFAST formats them (though I should probably fix it so that it handles BFAST-style too since you're the second person who's reported this problem). I have been working from Galaxy's and BWA's conversion tools, both of which chop off the primer base (which doesn't have a quality in BFAST's output) and the first color (which does), and prints only the qualities that correspond to the remaining colors.

    Hope that helps - I will try to make Bowtie do the right thing with the BFAST tool's output in the future though.

    Ben

    Comment


    • #3
      To recap, the problem is that for every read base that doesn't match the reference, Bowtie is reporting a 0 base quality, regardless of the original color-space quality of the read. Bowtie shouldn't be altering the base-quality of a read regardless of the reference sequence, correct? The problem still isn't fixed by using the BWA csfasta to fastq conversion script. I've used PerM to successfully align the read, maintaining the base quality. Both aligners mapped the read to the same genomic position (NCBI36 is the reference).

      Here is the seqeunce & base quality pulled from the SAM file created by Bowtie (using BWA solid2fastq) :

      Code:
      TCGGAAGCCGAGCCTGTGACTGCACCGGCACTGAAGCTCCCTGTGTG	
      \Z_XW]^b][__\UURI!!%SX_`V<5OXWD@HLOHR\ZP=E[XP@,
      Here is the same read pulled from the SAM file created by PerM (operates directly on csfasta files, so no fastq conversion is necessary):

      Code:
      TCGGAAGCCGAGCCTGTGACTGCACCGGCACTGAAGCTCCCTGTGTG     
      \Z_XW]^b][__\UURIKKMSX_`V<5OXWD@HLOHR\ZP=E[XP@,
      Note the difference in the qualities at the three bases, GAC. The 'A' is the heterozygous position. It's clear Bowtie is reporting a different base quality string than PerM only at these sites (!!% vs KKM). Given that the original quality string of the color-space read (in the _QV.qual file) is more similar to the PerM version, I wonder if this is a bug in Bowtie.

      Again this is a data-set wide issue.. not just happening with one read.

      Comment


      • #4
        Originally posted by keebs42 View Post
        Note the difference in the qualities at the three bases, GAC. The 'A' is the heterozygous position. It's clear Bowtie is reporting a different base quality string than PerM only at these sites (!!% vs KKM). Given that the original quality string of the color-space read (in the _QV.qual file) is more similar to the PerM version, I wonder if this is a bug in Bowtie.
        OK, sorry for not having fully understood the question. There are two parts to the answer, one of which *does* involve a bug in Bowtie, so many thanks for pointing this out.

        1. Bowtie reports alignments in nucleotide space, and thus employs a decoding scheme to decide which nucleotides correspond to the colors in the read. The decoding scheme also calculates quality scores for the decoded nucleotides. The decoded quality score for a given nucleotide is a function of the two covering colors from the original and whether they "match" the decoded nucleotides. If one or both colors don't "match" (i.e. if the SOLiD software called the color incorrectly), then we intentionally assign the nucleotide a low quality score (often close to 0). This is sensible because that nucleotide call is actually *refuted* by one or both of the covering color calls, so we can't have much confidence in it even if it is corroborated by the other color. The exact calculation is described in the manual and is borrows from is borrowed directly from the BWA paper.

        2. There is a bug in Bowtie whereby some colors are incorrectly penalized when, in fact, the covering colors are *correct* with respect to the decoded nucleotides. This happens when the decoded nucleotide is a SNP. Your example may be one of those instances (I can't tell without seeing the colors), in which case you're 100% right that it's due to a bug. I will have a fix for this very soon; sorry for the inconvenience,

        Ben

        Comment


        • #5
          This is an example of where it's better to do Perms strategy of mapping in color space...

          Comment


          • #6
            Just wanted to report that the latest release of Bowtie, 12.3, fixes the bug I mentioned in this thread. In addition I'm getting a much higher % of color space reads to align now with Bowtie than with PerM ( 30-60% vs 20-40%), even with the --best flag.

            Also the -Q option combined with -f removes the need to convert to fastq from csfasta, making it as easy to use as PerM with SOLiD reads.

            Thanks for the work, Ben!

            Comment


            • #7
              Hello, this is Yang-Ho from USC, PerM's author. Sorry for not visiting the site for a long long long time.

              May I ask what parameter you use for Bowtie and PerM. Because With --seed F4 -v (many) you can be full sensitive to four and partial sensitive to (many) mismatches, as you want, ex: 6 ~ 10 although those mapping may not be very trustable?

              Note the full, means from end-to-end, ALL alignments within the mismatches threshold should be found. If you don't want the low quality end, you can "trim" the end.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              57 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              45 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X