Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat low percent alignment (maybe) and a few other questions

    I'm analyzing my first set of RNA-seq data, ~33.7 million 100bp PE reads.

    Here's the quality from FastQC: reads1 reads2. I haven't done any trimming for adapters or low quality sequence, I was just letting Tophat deal with that, although I'll admit I don't know exactly how Tophat deals with it...I guess adapter sequences and sequences below a certain quality threshold just wouldn't align? From FastQC and looking at some of the quality scores the data is Illumina 1.5, although I see lots of "i's" so I guess it has scores up to 41, I used the "--solexa1.3-quals" option during alignment.

    I was thinking of using cutadapt to filter both adapters and low quality sequence, maybe also the first ~10bp, would this be a good idea?

    Without any trimming I aligned with Tophat to the mouse genome, here are the percent alignments I'm getting from the log files:
    Code:
    left_kept_reads.fixmap = 47.18%
    left_kept_reads_seg1.fixmap = 25.65%
    left_kept_reads_seg2.fixmap = 26.01%
    left_kept_reads_seg3.fixmap = 25.57%
    left_kept_reads_seg4.fixmap = 14.40%
    right_kept_reads.fixmap = 41.37%
    right_kept_reads_seg1.fixmap = 22.03%
    right_kept_reads_seg2.fixmap = 23.46%
    right_kept_reads_seg3.fixmap = 25.51%
    right_kept_reads_seg4.fixmap = 15.52%
    If I use the instructions here to determine the number of reads in the original fastq file (33716355) and the number of unique reads in the accepted_hits.bam file (21284339) I get 63.1% aligned.

    Do these percent alignments seem low?

    I also get tons of "malformed closure" and "multiple closures" warnings in the long_spanning_reads.log file, about 225,000 of them. What do these warnings mean?

    Here also is the output from samtools flagstat, in case it helps:

    Code:
    4245668 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    44245668 + 0 mapped (100.00%:-nan%)
    44245668 + 0 paired in sequencing
    23568872 + 0 read1
    20676796 + 0 read2
    31410058 + 0 properly paired (70.99%:-nan%)
    32757724 + 0 with itself and mate mapped
    11487944 + 0 singletons (25.96%:-nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    Any advice is appreciated, thank you!

  • #2
    With 100x100 mers that alignment rate is not uncommon in my experience. I personally wonder if 100x100 is a benefit or detriment to RNAseq data. If you aligne the same data trimed to 75x75 and 50x50 and compare the metrics with the 100x100 you will find that the percent aligned increases significantly as you reduce the size to 75x75 (~10-15% higher) and the number of unique mapping events only drops by 1%, dropping to 50x50 increases mapping by another ~5% and drops the unique mapping by 1%. I often wonder that as the read length increases we are overloading tophat. remember the average exon size in human, assuming mouse to be similar, is around 110bp with a mode of 96bp. So aligner function will always be fixed to those genome features and different read lengths may effect the outcome.

    Comment


    • #3
      From my understanding, Tophat takes all the 100bp sequences it can't align and splits them in to 25bp fragments then tries aligning them again. Wouldn't this help with the problem of trying to align 100bp sequences to exons that are not much bigger than this?

      I will try trimming the reads in a few different ways and see how the results compare.

      Do you what all the "malformed closure" and "multiple closures" warnings mean?

      Comment


      • #4
        How were your libraries prepared? Do you suspect that there may be any adapter sequence in your reads? If there is this can greatly reduce the amount of reads that align. Trimming for quality can also improve alignments.

        Comment


        • #5
          There are adapter sequences in the reads, I'm going to trim adapters and low quality sequence and align again, and see how that compares.

          Comment


          • #6
            Its unsurprising then that only half your reads aligned if there is adapter sequence present. Is the adapter sequence at the 5' or 3' end of your reads?

            Comment


            • #7
              Originally posted by Jon_Keats View Post
              With 100x100 mers that alignment rate is not uncommon in my experience. I personally wonder if 100x100 is a benefit or detriment to RNAseq data. If you aligne the same data trimed to 75x75 and 50x50 and compare the metrics with the 100x100 you will find that the percent aligned increases significantly as you reduce the size to 75x75 (~10-15% higher) and the number of unique mapping events only drops by 1%, dropping to 50x50 increases mapping by another ~5% and drops the unique mapping by 1%. I often wonder that as the read length increases we are overloading tophat. remember the average exon size in human, assuming mouse to be similar, is around 110bp with a mode of 96bp. So aligner function will always be fixed to those genome features and different read lengths may effect the outcome.
              Good idea - how well does TopHat cope if it needs to put two+ introns into an alignment - and thus align over 3 (or more) exons?

              Chris

              Comment


              • #8
                I'm not sure exactly how the libraries were prepared...someone else in my lab dissected mouse tissue and sent it frozen to the sequencing facility and the facility did the rest.

                Originally posted by chadn737 View Post
                Its unsurprising then that only half your reads aligned if there is adapter sequence present. Is the adapter sequence at the 5' or 3' end of your reads?
                They are in the middle or towards the 3' ends...is this what it looks like when you have short fragments? These are the most common adapters in the data, sequences are from the FastQC contaminant_list.txt file, the numbers are how many times they appeared in the first 100,000 sequences, there are some other "contaminant" sequences that appear less often. Here's what some of them look like, adapter sequences in red:

                Reads1, Illumina Multiplexing Adapter 1 11640
                Reads1, Illumina Gex Adapter 2.01 689
                Reads2, Illumina Paired End Adapter 1/Multiplexing Adapter 2 5563

                Should I use these adapter sequences with Cutadapt (or another trimming program) and trim everything from the adapter to the 3' end of each read? (And also trim for quality.)

                Comment


                • #9
                  Originally posted by biznatch View Post
                  They are in the middle or towards the 3' ends...is this what it looks like when you have short fragments?
                  Yes.

                  I have encountered the exact same problem before. 50-40% of my reads failed to align due to varying lengths of adapter sequence at the 3' end. However after quality trimming and adapter trimming I was able to get closer to 89-90% of my reads to align.

                  I tried a variety of approaches, but had the best results when I first used fastx quality trimmer to trim reads from the 3' end based on quality and then using cutadapt to trim adapter sequence from the 3' end. Cutadapt can trim based on quality, but for some reason this resulted in fewer aligned reads than when I first when over it with fastx quality trimmer.

                  I assume this arises when you have a size selection problem during the library prep. This is going to create eve greater headaches for you since your reads are paired end. If you look at the paired reads for those that have adapter sequence at the 3', the two pairs probably overlap significantly. Which if I understand it correctly, means that there is no insert between the two paired ends. That will probably screw things up in Tophat by throwing off the insert size.

                  Comment


                  • #10
                    Thank you so much for your help! I will try the trimming as you suggest.

                    I will also take a look at the paired reads and see if they overlap...I wonder if I could combine the paired end reads that overlap each other into non-paired end reads then align those ones separately...I think I saw a post asking something similar on here recently and someone mentioned a program that could do that, I will have to go look.

                    [Edit]: Looking at the aligned data from my first post, here are the insert size metrics from Picard. Is this normal or should it be more bell curve shaped, and or should the peak be higher? I aligned using Tophat with -r 150 --mate-std-dev 40 (insert size 150, standard deviation 40). I got these numbers by aligning 1 million of the paired reads with Bowtie then using Picard InsertSizeMetrics.
                    Last edited by biznatch; 11-30-2011, 03:44 PM.

                    Comment


                    • #11
                      Also try adding --closure-search to tophat for reads with a small insert size:

                      --closure-search Enables the mate pair closure-based search for junctions. Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp)

                      Chris

                      Comment


                      • #12
                        Trying different alignment and trimming settings

                        I've been trying different alignment and trimming options, these were all tested with only 100,000 reads otherwise it would take forever.

                        Different alignment options in Tophat using non-trimmed, non-filtered reads, all of these were almost identical percent alignment (~63%):
                        • My data is Illumina 1.5 but it didn't make a difference whether I used --solexa1.3-quals or not (why??).
                        • -r (inner distance between mate pairs) ranging from 50 to 300 were all the same (based on Picard CollectInsertSizeMetrics, ~150 is the correct value).
                        • --coverage-search, --microexon-search, --butterfly-search, --closure-search all about the same.
                        • --library-type fr-unstranded (which I think is the correct one) was 63% while both fr-firststrand and fr-secondstrand were about 1.5% lower.


                        Trimming:
                        • Trimming off the last 25bp of each read using Fastx = 77.5% alignment. (This is the highest percent alignment I got)
                        The following all aligned ~50%:
                        • Trimming the "Illumina Multiplexing Adapter 1" from reads1 and "Illumina Paired End Adapter 1/Multiplexing Adapter 2" from reads2 using Cutadapt. (These adapters appear in about 11.5% and 5.5% of the left and right sequences, respectively.)
                        • Trimming the above adapters AND trimming for quality (10) using Cutadapt = 50%.
                        • Trimming 3' end just based on quality using Fastx, threshold set to either 10, 20, or 30 were all the same.


                        I have no idea why trimming based on quality or trimming off adapters would lower percent alignment, while just removing the last 25bp would increase it...unless I did something wrong in the trimming step. When I opened the files of quality trimmed reads I saw that basically all the 3' "B" quality positions were removed so it seems to have worked correctly.

                        I still get lots of "malformed closure" warnings.

                        So I'm not sure where to go from here...just trim off 25bp and use that data? We're getting another set of RNA-seq data this week so I'll see how it compares.
                        Last edited by biznatch; 12-01-2011, 02:39 PM.

                        Comment


                        • #13
                          trimming might not work as tophat may expect uniform read lengths so any read pair with a deviation might be tossed out of the analysis. Percent alignment will increase a bit if you increase to 1,000,000 reads as more junction reads will be defined unless you are using a GTF reference.

                          Comment


                          • #14
                            Tophat does not require uniform read lengths (at least, not anymore).
                            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                            I got ~63% alignment when I aligned the entire set (~33 million reads) and with my test sets of 100,000 I get 63% when using the same Tophat settings so I don't know if using more will necessarily improve percent alignment, I'm not using a GTF reference.

                            Comment


                            • #15
                              Percent alignment vs. number of junctions

                              I tried some more conditions and this time looked at how many junctions were found, not just percent alignment, as I think this might be a better metric. Any sequences less than 25 bp after trimming were discarded. Alignment command was:

                              Code:
                              tophat -o ./1 -p 2 -r 150 --mate-std-dev 40 --solexa1.3-quals ~/chip/bowtie/indexes/mm9all+ ./reads1.fq ./reads2.fq
                              1. Unmodified.
                              2. Trimmed off adapters.
                              3. Trimmed 3' end for quality=10.
                              4. Trimmed off adapters then 3' end for quality.
                              5. Trimmed 3' end for quality then trimmed off adapters.
                              6. Removed 10 bp from 3' end.
                              7. Removed 20 bp from 3' end.
                              8. Removed 30 bp from 3' end.
                              9. Removed 40 bp from 3' end.
                              10. Removed 50 bp from 3' end.
                              11. Removed 10 bp from 5' end.
                              12. Removed 20 bp from 5' end.


                              Removing a fixed number of bp was done using FASTX, adapter and quality trimming was done using cutadapt. The following were removed sequentially, because with cutadapt if you have more than one adapter sequence it only uses the one with the best match, and some of the sequences had multiple adapters:
                              • MINT adapters (from making cDNA) from 5' end of left and right reads.
                              • Illumina Multiplexing Adapter 1 from 3' end of left reads.
                              • llumina Paired End Adapter 1/Multiplexing Adapter 2 form 3' end of right reads.
                              • I also had some longs strings of A's and T's, I'm not really sure what they're from. I trimmed any reads which were more than half (50bp) A or T.


                              Overall, trimming based on quality or adapters decreased the number of sequences that aligned. Even though a higher percentage of reads aligned as I trimmed a fixed number of bases off the 3' end (removing 50bp from 3' end had 87% alignment), the most junctions were found using the unmodified reads. In fact, the 50bp removal 87% alignment identified the lowest number of junctions. Not exactly what I expected. Maybe more of the trimmed reads are aligning but they're aligning in the same spot so it's not helping to identify more junctions?

                              So I guess from here I'll just use unmodified reads? Oh well, it'll save some time I suppose. But I'm a bit confused because trimming off adapters and/or for quality seems to be a common practice. I'm not sure what is different with my data that causes this trimming to actually give poorer results.

                              This is an example at two genes, the percent alignment is shown and then the number of junctions identified.
                              Last edited by biznatch; 12-02-2011, 12:49 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X