Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat paired-end reads and minimum length

    Hello,

    I have a few questions about running my paired-end reads through Tophat (using Galaxy).

    1) From what I've read on this forum, it sounds like paired end reads have to be properly mate-matched (that is, each pair must have a mate, and be in the same order, in the R1 and R2 files) in order for Tophat to map the mates properly. My question is, if I save any remaining unpaired mates after QC in a separate file, and run them through Tophat separately from the paired end reads, how can I then join the single-end and paired-end data together for analysis in Cufflinks?

    2) How do I determine the standard deviation of distances between my mate-pairs? All I've got to work off of is a graph of the size distributions, which range from about 200 to about 1000 (average ~300). I want to ensure that Tophat is still able to successfully map those larger fragments.

    3) What is the shortest fragment length that it is reasonable to try and map? I noticed that the default Tophat setting on galaxy is to map a minimum read segment length of 25. So I'm wondering if this is a good cutoff for minimum length of read to keep after QC.

    Any thoughts or suggestions are greatly appreciated

  • #2
    1) Personally I wouldn’t bother with this. There might be reasons those reads didn’t map in the right orientation that would mean you’d rather just ignore them anyway. Are you hurting for coverage? Because you probably won’t really recover much this way anyway. If you didn’t have gross problems with library prep or sequencing, hopefully tophat should be aligning 80%+ of the reads correctly.

    2) You can use "bamtools stats -insert -in aligned_reads.bam” to figure this out. But remember, tophat can handle pairs that have introns between them, so you really shouldn’t worry about 1000bp. Some pairs will have >100,000bp between them. For instance, in one of my RNAseq data sets the median insert size is just 169bp, while average is 49Kbp. Obviously introns are pushing that average way up.

    3) You really shouldn’t have it shorter than about 50bp for detecting splicing. Tophat runs by trying to align the whole read first, then breaking it up into peaces (default is 4 fragments of 25bp for 100bp reads). If you only have one fragment worth (i.e. leaving it at 25bp and having <50bp read length), the splicing mapping is basically worthless. You can set that 25bp to be 20 or something, and get down to a total read length of 40bp, but remember that the shorter the read length the less unique mappings you’re going to have. So your alignments will get progressively worse as you drop that down. The option is set with "--segment-length”. Personally, I wouldn’t do much trimming though. You can get rid of adapters, but in my experience quality trimming really doesn’t help when you’re aligning to a genome. These aligners are already quality aware, so mismatches in poor quality regions don’t hurt you much.

    Comment


    • #3
      Originally posted by Wallysb01 View Post
      1) Personally I wouldn’t bother with this. There might be reasons those reads didn’t map in the right orientation that would mean you’d rather just ignore them anyway. Are you hurting for coverage? Because you probably won’t really recover much this way anyway. If you didn’t have gross problems with library prep or sequencing, hopefully tophat should be aligning 80%+ of the reads correctly.
      I think you misunderstood the OP. He meant having pairs of reads which have become mate-less after QC, not after mapping.

      So, one mate will pass QC, but the other one will not, thus leaving you with a list of single-end reads who lost their partner due to sequencing quality reasons, but they could still align on their own.

      What to do in those cases?

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      31 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      27 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X