Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat/bowtie find no alignment on long (> 250 bp) reads

    For the first time, I am dealing with 300bp data generated from Illumina's MiSeq. It's paired end data, but I'm currently treating it as unpaired data for simplicity's sake. I've dealt with 75bp and 150bp data from similar experiments before without much difficulty. I prepared the fastq files as I usually do for TopHat and ran as usual, but I am finding only around 25% alignment (for read 1; for read 2, which has lower quality scores, alignments are only around 10%).

    Incidentally, I checked and the problem is indeed with the alignment (thus Bowtie2), not "tophat" per se.

    I checked the unmapped.bam file, and sorted by frequency of read. I found that all the top sequences gave very good alignments when I threw then into a standard nucleotide blast (human). So it's not like these are junk that shouldn't be expected to align. The rejected sequences all looked long to me, so I looked at the length distribution of reads in the unmapped vs. the accepted_hits files; sure enough, 60% of unmapped reads were in the 250-300bp length, while only 10% of accepted hits were in the same size range. So, clearly tophat is having issues with long reads.

    I found that the default --read-edit-dist is set to 2, which seems a little silly, as you'd need it higher or lower depending on read length. In any event, I tried bumping this up to 4, but this only got me an additional 0.1% of reads in the accepted_hits.bam.

    In case anyone is interested, here is the tophat command I ran (through a perl script) to do the alignment:
    "~/software/tophat/tophat -p 32 -o $outfolder --read-edit-dist 4 ~/software/hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome $folder/cutadapted-full/$filename"

    So, any suggestions of options/flags ??? that can help Bowtie2/TopHat properly find alignments for long reads?

  • #2
    Some of tophat's settings aren't modifiable without tweaking the code itself. Give STAR a try, it'll likely give better results.

    Comment


    • #3
      What kind of QC have you done on this data before doing Tophat analysis? With MiSeq you are likely to get read-through into adapters if the inserts are small. Trimming of the data would be needed in that case.

      Comment


      • #4
        Thanks for the idea GenoMax. However, I have already trimmed adapters from both ends. I used sabre to cut off the 5' adapters and cutadapt for trimming any 3' adapters, with the requirements that all sequences have to be at least 50bp long after trimming. So, adapters shouldn't be an issue.

        Comment


        • #5
          That eliminates that factor then.

          You probably have a workflow set up and that may be the reason you are using TopHat. But for diagnostic purposes try STAR (or BBMap) as Devon suggested with a sample or two.

          Comment


          • #6
            Hm. Well, I've never looked into STAR before, and I'd rather not go through the hassle of setting up a new program and preparing index files, etc. But if that's the only real choice...

            Comment


            • #7
              BBMap is super-easy to use, and can index on the fly -

              bbmap.sh -Xmx24g ref=hg19.fasta in=reads.fq out=mapped.sam maxindel=200000 local nodisk

              The "local" flag is optional and I only included it because there may be a problem with your long reads, in that the trailing part is probably junk. The "nodisk" flag will prevent writing an index to disk.
              Last edited by Brian Bushnell; 01-05-2015, 10:48 AM.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 11:49 AM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-24-2024, 08:47 AM
              0 responses
              16 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              61 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Working...
              X