Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • chrisbala
    Member
    • Jan 2010
    • 82

    quality scores, low mapped%?

    Hi,

    I'm trying to figure out why I am getting such a low % of mapped reads (using tophat/bowtie). I'm still experimenting with parameters in bowtie, but thus far, I can't get much above 30%. I'm working with a new genome with plenty of gaps and things, so that might explain part of it. BUt I also don't fully understand the quality scores. Do these look funny to any of you? It seems to be the higher quality scores are on the end of these reads? (the Bs?). Any thoughts?

    @HWI-EAS385_0044:2:1:4:1884#0/1
    CAGCTGGNAGGCTCCACGGCGGGCGTGCGCCAAGTGCCGGGGCTGCACAACGGGAGCCAAGCCTTCCTCTTCTCA
    +HWI-EAS385_0044:2:1:4:1884#0/1
    \\X[\LTDTT_Vb__X_Z_V`XUceZcfcc_PTPKVb__\]bee]]X_BBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWI-EAS385_0044:2:1:4:1477#0/1
    GGGCCATNGCATCTGTGGGCACGGGAGGGGCCAGCACAGCCGCAGGACTACTGGCCGAGGCCCCCGCCGCGGCAG
    +HWI-EAS385_0044:2:1:4:1477#0/1
    ecdce[bE]`TTTSS\Wb\bTW^XNRMURVO\PX]Q`N^^R]SK\\\MVM\P\V^M[LPX`^BBBBBBBBBBBBB
    @HWI-EAS385_0044:2:1:4:849#0/1
    GTCGTACTCCTAGGGCTCGTGGTCGGCTGCGCCGGCTTGTCGTTTCGCTTCGCCTGCGGGCTGGGCTCCGTCGTG
    +HWI-EAS385_0044:2:1:4:849#0/1
    bXb_[`c_cc\U`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
  • lh3
    Senior Member
    • Feb 2008
    • 686

    #2
    If you map human reads to human, good aligners can map 95% of them. If you align these reads to chimpanzee, which is 1.2-1.3% different from human, about 90% can be mapped. If you are talking about 5-10% mismatching rate, most short aligners would not work well. Perhaps ssaha2 is less affected. In addition, bowtie does not do gapped alignment. Also tuning bowtie "-e" may help. Alternatively, you may consider to de novo assemble your reads first and then align the contigs.

    Comment

    • strob
      Member
      • Nov 2008
      • 84

      #3
      maybe handy to first read this paper in order to know what is what:

      The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

      Cock et al, 2009

      Published in NAR

      Comment

      • Zigster
        Jeremy Leipzig
        • May 2009
        • 117

        #4
        i assume the spaces i see in your sequences is an artifact of copy-paste?
        --
        Jeremy Leipzig
        Bioinformatics Programmer
        --
        My blog
        Twitter

        Comment

        • chrisbala
          Member
          • Jan 2010
          • 82

          #5
          spaces

          thanks for the responses

          yes the space is a copy-paste thing

          also, i thought I did understand the quality scores ... just checking that I am correct in my understanding ... but I think I got it .. I had the quality scores backwards ... so Bs are actually quite bad reads? Much of my data looks like what I posted. Is this about what is expected?

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            Originally posted by Zigster View Post
            i assume the spaces i see in your sequences is an artifact of copy-paste?
            You can avoid this by putting [ code ] and [ /code ] tags round the example. There is a little icon with a # symbol on it on the edit box to make this easy.

            Comment

            • chrisbala
              Member
              • Jan 2010
              • 82

              #7
              thanks. that is good to know...

              any thoughts about the Bs???

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Originally posted by chrisbala View Post
                thanks. that is good to know...

                any thoughts about the Bs???
                The ASCII code for B is 66, with an offset of 64 as used in Solexa/Illumina gives a quality score of 2 (very poor).

                At the start of the reads you have things like X, which is ASCII 88, thus a score of 24, which is OK.

                i.e. The start of your reads have OK scores, but this rapidly trails off and the middle and ends of your reads have poor scores.

                So yes, you did have the score interpretation backwards in your earlier posts.

                [I'm assuming you have Solexa or Illumina style FASTQ files here]

                Comment

                • chrisbala
                  Member
                  • Jan 2010
                  • 82

                  #9
                  uuggh

                  that is what i feared. and I assume this is, in general, worse that what people usually get in their Illumina data?

                  Comment

                  • maubp
                    Peter (Biopython etc)
                    • Jul 2009
                    • 1544

                    #10
                    Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.

                    Comment

                    • kmcarr
                      Senior Member
                      • May 2008
                      • 1181

                      #11
                      Originally posted by maubp View Post
                      Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.
                      Due to the way the Illumina pipeline (or RTA) sorts the output to FASTQ files the reads at the beginning of the file always look bad. A FASTQ file for a lane of GAII data will have the reads sorted first by tile # and then by x-coordinate. Thus at the start of the file (or really at the start of every block of reads for each tile) you will have reads from the extreme edge of the tile. Reads at the edge are inherently poorer quality. You can't make any assessment about the overall quality of the run by looking at a few, non-randomly selected reads.

                      The reads at the top of my FASTQ files always have Q-scores like the ones shown here.

                      Comment

                      • maubp
                        Peter (Biopython etc)
                        • Jul 2009
                        • 1544

                        #12
                        Nice tip kmcarr

                        Comment

                        • chrisbala
                          Member
                          • Jan 2010
                          • 82

                          #13
                          yeah, thanks for that. the sequencing group here also pointed that out to me (I should have posted a followup). so now i am doing some real QC.... (but i still still think the data quality might be a bit low)

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM
                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Today, 11:10 AM
                          0 responses
                          6 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          42 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-09-2026, 11:58 AM
                          0 responses
                          102 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-05-2026, 10:09 AM
                          0 responses
                          124 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...