Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat/bowtie find no alignment on long (> 250 bp) reads

    For the first time, I am dealing with 300bp data generated from Illumina's MiSeq. It's paired end data, but I'm currently treating it as unpaired data for simplicity's sake. I've dealt with 75bp and 150bp data from similar experiments before without much difficulty. I prepared the fastq files as I usually do for TopHat and ran as usual, but I am finding only around 25% alignment (for read 1; for read 2, which has lower quality scores, alignments are only around 10%).

    Incidentally, I checked and the problem is indeed with the alignment (thus Bowtie2), not "tophat" per se.

    I checked the unmapped.bam file, and sorted by frequency of read. I found that all the top sequences gave very good alignments when I threw then into a standard nucleotide blast (human). So it's not like these are junk that shouldn't be expected to align. The rejected sequences all looked long to me, so I looked at the length distribution of reads in the unmapped vs. the accepted_hits files; sure enough, 60% of unmapped reads were in the 250-300bp length, while only 10% of accepted hits were in the same size range. So, clearly tophat is having issues with long reads.

    I found that the default --read-edit-dist is set to 2, which seems a little silly, as you'd need it higher or lower depending on read length. In any event, I tried bumping this up to 4, but this only got me an additional 0.1% of reads in the accepted_hits.bam.

    In case anyone is interested, here is the tophat command I ran (through a perl script) to do the alignment:
    "~/software/tophat/tophat -p 32 -o $outfolder --read-edit-dist 4 ~/software/hg19/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome $folder/cutadapted-full/$filename"

    So, any suggestions of options/flags ??? that can help Bowtie2/TopHat properly find alignments for long reads?

  • #2
    Some of tophat's settings aren't modifiable without tweaking the code itself. Give STAR a try, it'll likely give better results.

    Comment


    • #3
      What kind of QC have you done on this data before doing Tophat analysis? With MiSeq you are likely to get read-through into adapters if the inserts are small. Trimming of the data would be needed in that case.

      Comment


      • #4
        Thanks for the idea GenoMax. However, I have already trimmed adapters from both ends. I used sabre to cut off the 5' adapters and cutadapt for trimming any 3' adapters, with the requirements that all sequences have to be at least 50bp long after trimming. So, adapters shouldn't be an issue.

        Comment


        • #5
          That eliminates that factor then.

          You probably have a workflow set up and that may be the reason you are using TopHat. But for diagnostic purposes try STAR (or BBMap) as Devon suggested with a sample or two.

          Comment


          • #6
            Hm. Well, I've never looked into STAR before, and I'd rather not go through the hassle of setting up a new program and preparing index files, etc. But if that's the only real choice...

            Comment


            • #7
              BBMap is super-easy to use, and can index on the fly -

              bbmap.sh -Xmx24g ref=hg19.fasta in=reads.fq out=mapped.sam maxindel=200000 local nodisk

              The "local" flag is optional and I only included it because there may be a problem with your long reads, in that the trailing part is probably junk. The "nodisk" flag will prevent writing an index to disk.
              Last edited by Brian Bushnell; 01-05-2015, 10:48 AM.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              26 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Working...
              X