Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • quality scores, low mapped%?

    Hi,

    I'm trying to figure out why I am getting such a low % of mapped reads (using tophat/bowtie). I'm still experimenting with parameters in bowtie, but thus far, I can't get much above 30%. I'm working with a new genome with plenty of gaps and things, so that might explain part of it. BUt I also don't fully understand the quality scores. Do these look funny to any of you? It seems to be the higher quality scores are on the end of these reads? (the Bs?). Any thoughts?

    @HWI-EAS385_0044:2:1:4:1884#0/1
    CAGCTGGNAGGCTCCACGGCGGGCGTGCGCCAAGTGCCGGGGCTGCACAACGGGAGCCAAGCCTTCCTCTTCTCA
    +HWI-EAS385_0044:2:1:4:1884#0/1
    \\X[\LTDTT_Vb__X_Z_V`XUceZcfcc_PTPKVb__\]bee]]X_BBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWI-EAS385_0044:2:1:4:1477#0/1
    GGGCCATNGCATCTGTGGGCACGGGAGGGGCCAGCACAGCCGCAGGACTACTGGCCGAGGCCCCCGCCGCGGCAG
    +HWI-EAS385_0044:2:1:4:1477#0/1
    ecdce[bE]`TTTSS\Wb\bTW^XNRMURVO\PX]Q`N^^R]SK\\\MVM\P\V^M[LPX`^BBBBBBBBBBBBB
    @HWI-EAS385_0044:2:1:4:849#0/1
    GTCGTACTCCTAGGGCTCGTGGTCGGCTGCGCCGGCTTGTCGTTTCGCTTCGCCTGCGGGCTGGGCTCCGTCGTG
    +HWI-EAS385_0044:2:1:4:849#0/1
    bXb_[`c_cc\U`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

  • #2
    If you map human reads to human, good aligners can map 95% of them. If you align these reads to chimpanzee, which is 1.2-1.3% different from human, about 90% can be mapped. If you are talking about 5-10% mismatching rate, most short aligners would not work well. Perhaps ssaha2 is less affected. In addition, bowtie does not do gapped alignment. Also tuning bowtie "-e" may help. Alternatively, you may consider to de novo assemble your reads first and then align the contigs.

    Comment


    • #3
      maybe handy to first read this paper in order to know what is what:

      The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

      Cock et al, 2009

      Published in NAR

      Comment


      • #4
        i assume the spaces i see in your sequences is an artifact of copy-paste?
        --
        Jeremy Leipzig
        Bioinformatics Programmer
        --
        My blog
        Twitter

        Comment


        • #5
          spaces

          thanks for the responses

          yes the space is a copy-paste thing

          also, i thought I did understand the quality scores ... just checking that I am correct in my understanding ... but I think I got it .. I had the quality scores backwards ... so Bs are actually quite bad reads? Much of my data looks like what I posted. Is this about what is expected?

          Comment


          • #6
            Originally posted by Zigster View Post
            i assume the spaces i see in your sequences is an artifact of copy-paste?
            You can avoid this by putting [ code ] and [ /code ] tags round the example. There is a little icon with a # symbol on it on the edit box to make this easy.

            Comment


            • #7
              thanks. that is good to know...

              any thoughts about the Bs???

              Comment


              • #8
                Originally posted by chrisbala View Post
                thanks. that is good to know...

                any thoughts about the Bs???
                The ASCII code for B is 66, with an offset of 64 as used in Solexa/Illumina gives a quality score of 2 (very poor).

                At the start of the reads you have things like X, which is ASCII 88, thus a score of 24, which is OK.

                i.e. The start of your reads have OK scores, but this rapidly trails off and the middle and ends of your reads have poor scores.

                So yes, you did have the score interpretation backwards in your earlier posts.

                [I'm assuming you have Solexa or Illumina style FASTQ files here]

                Comment


                • #9
                  uuggh

                  that is what i feared. and I assume this is, in general, worse that what people usually get in their Illumina data?

                  Comment


                  • #10
                    Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.

                    Comment


                    • #11
                      Originally posted by maubp View Post
                      Having the quality scores drop off with the read length is normal. I haven't seen enough data to say for sure, but scores like yours do look worse the normal. Ask your sequencing centre to have a look at it maybe? Perhaps they had a bad run.
                      Due to the way the Illumina pipeline (or RTA) sorts the output to FASTQ files the reads at the beginning of the file always look bad. A FASTQ file for a lane of GAII data will have the reads sorted first by tile # and then by x-coordinate. Thus at the start of the file (or really at the start of every block of reads for each tile) you will have reads from the extreme edge of the tile. Reads at the edge are inherently poorer quality. You can't make any assessment about the overall quality of the run by looking at a few, non-randomly selected reads.

                      The reads at the top of my FASTQ files always have Q-scores like the ones shown here.

                      Comment


                      • #12
                        Nice tip kmcarr

                        Comment


                        • #13
                          yeah, thanks for that. the sequencing group here also pointed that out to me (I should have posted a followup). so now i am doing some real QC.... (but i still still think the data quality might be a bit low)

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          25 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          27 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X