Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA mapping for SOLiD single-end reads

    Dear NGS community,

    although I this is my first post here I would like to acknowledge you all for the helping in all the tough NGS analysis.

    I have SOLiD single-end reads 50 bp long with good quality that I want to map into a reference genome of approximately 12.6 Mb. The reads are all in fasqcssanger following the Galaxy pipeline. I tried several settings in BWA and all of them gave different results and I'm really struggling to find the most accurate/specific of them (if this is possible??) in order to proceed to SNP calling. That's the reason of this post.

    Setting the default parameters (-n 0.04; -l -1; -k2) only ~51% of the reads were mapped.

    By tuning the parameters I was able to improve the fraction of mapped reads:

    allowing for more mismatches and setting -k to 3, more reads were mapped (-n 10; -l -1; -k 3): ~71% of mapped reads.

    tuning the seed to a small number, also increased this number (-n 10; -l 25; -k3): ~77% of reads mapped.

    I'm now waiting for the results of disabling the seed (-n 10; -l 1000; -k3): ???

    However, I'm not really sure with what option to follow with. Can please someone advise me about this issue. What parameters are more appropriate for downstream SNP calling?

    Thanks to all. This forum has been a real help for my NGS adventure.

    PAL

  • #2
    Well, of course allowing more mismatches will have more reads aligning. But if only 50% of your reads aligned on default, that doesn't sound like good quality to me. That, or your reference genome is not a very good match to your sample. You might have better luck making a de novo assembly, and aligning to that.

    Comment


    • #3
      I think 50% is "okay".
      I think the problem boils down the the Solid system making calls based on transition. Once there's a bad transition, the rest of the read is junk. With Illumina, a bad quality read doesn't ruin the rest of the read. So with solid, almost of the read needs to be good to align well. I am open to being schooled better on this.

      "Forcing" more alignments by allowing more mismatches is going to mean more false positives. Requiring fewer mismatches will mean missing some real calls. It's a trade-off.

      Alignment is not perfect. Genomic references are not perfect. SNP calling is not perfect. De-novo assembly is not perfect.

      Someday we'll look back and laugh at the current state of affairs. Right now it's fustrating and focusing our cameras to get the true picture remains very tricky.

      Comment


      • #4
        @swbarnes2 the reference genome is finished and is from the same species, but making a de novo assembly at first with 50 bp SOLiD reads don't look much pretty for me...

        Richard is right, 50% could not be so bad if the alignment is of good quality. From a review that I have read from Li and Homer it seems true that following a sequencial error in the SOLiD machine all the bases can be wrong if they are converted to base calls prior of mapping (but I think BWA maps in color code mode and this problem might not occur).

        I'm now testing for only 7 and 5 mismatches in the whole read to make results comparable.

        Although I'm still confused about seeding, to do or not to do the seeding? what's the best for accuracy? For what I understand in BWA manual the algorithm can look for mismatches in the seed and if this number is below the threshold it will place the read in that position. However, quality scores start to decrease around from the middle of the read (whatever technology they are). And if from this region there are more mismatches than the allowed by the -n setting, the read is rejected from the position where it was aligned by the seed? (sorry if I'm interpreting things wrongly)

        "Alignment is not perfect. Genomic references are not perfect. SNP calling is not perfect. De-novo assembly is not perfect.
        Someday we'll look back and laugh at the current state of affairs. Right now it's fustrating and focusing our cameras to get the true picture remains very tricky."

        I completely agree with you, and hopefully that day will come, however, doesn't seem to be so soon as desirable...

        Comment


        • #5
          Originally posted by Richard Finney View Post
          I think 50% is "okay".
          I think the problem boils down the the Solid system making calls based on transition. Once there's a bad transition, the rest of the read is junk. With Illumina, a bad quality read doesn't ruin the rest of the read. So with solid, almost of the read needs to be good to align well. I am open to being schooled better on this.
          As Pedro said bwa does the mapping in colorspace so this is not an issue. Ther are two issues with seed-based aligners for SOLiD though. Since raw error rates are much higher than for other competing instruments, it is not uncommon that a read has more than two mismatches in the seed. Also, since a SNP gives two cs differences the alternative allele gets underrepresented if the SNP is in the seed.

          Bwa defaults are for Illumina and needs to be changed for SOLiD, I found something like -l 23 -n 7 to give good results (shortening the seed too much or allowing more mismatches in the seed makes it much slower). For SNP-calling you may want to trim the reads first though.

          Comment


          • #6
            Yes, that's what I also experienced. I had to trim both the 5' end (2 bases) and the 3' end (3 bases) because there were many reads with N bases in these positions.

            The results that I've got for this data were the following (and hope that this could help others also):

            defaults: 52% mapped reads
            -n5 -l-1 -k3: 63.3% mapped reads
            -n5 -l25 -k3: 64.2% mapped reads (no big difference for the previous)
            -n7 -l-1 -k3: 68.8% mapped reads
            -n7 -l25 -k3: 71.9% mapped reads
            -n10 -l-1 -k3: 71.5% mapped reads
            -n10 -l25 -k3: 77.8% mapped reads

            But at a first sight it seems to me that the quality of the mapping decreases considerably for -n10, and so I will not base my further analysis on these.

            Comment


            • #7
              Thanks for a useful thread. To get back to your initial question on how the settings affect SNP calling, why not call SNPs on three or four of the "better quality" alignment settings, and compare the VCFs.

              I think Heng Li stated once bwa will work ok with colour space data but more mismatches should be allowed. Iterative read trimming from the 3' end seems to be what allows Lifescope and NovoalignCS to align more reads than BWA using default settings.

              Lastly : did you try SAET ? I am getting 1-5% more alignments for bacteria after processing the reads using SAET.

              Comment


              • #8
                Hi PedroA,

                And everyone on this great thread!

                What did you use to trim your SOLiD reads? Is there anything you recommend for assessing the quality of my reads, such as FastQC for Illumina? I am aware that I can use this application for the SOLiD reads, but I've read elsewhere that the chemistry of SOLiD gives for a very weird-looking per-base quality boxblot -- so what is the standard for a good quality run?

                Thanks!!

                Carmen

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 08:47 AM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X