Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • STAR vs tophat2 - mapping reads to long exons

    Hi,

    I have some RNA-Seq from the fruit fly, where we test the differential gene expression of several knock-outs and developmental stages.
    As I heard a lot of good things about the STAR algorithm, I tested it in comparison to bbmap and tophat2.

    These are the commands I used in the three algorithms:
    Code:
    	~/software/STAR-STAR_2.4.1c/STAR --runThreadN 15 --genomeDir genomes/Drosophila_melanogaster/STARindex/Dmel/ --readFilesIn $file --readFilesCommand zcat  --sjdbGTFfile genes.gtf --outFilterType BySJout --outFilterMultimapNmax 1 --alignSJoverhangMin 8 --outFileNamePrefix $NEW_FILE.STAR. --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.05 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
    
     	bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh minidentity=0.9 ambiguous=toss kfilter=85 bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat
    
     	tophat2 -p 15 -g 1 -G genes.gtf -o $NEW_FILE.tophat.out  genomes/Drosophila_melanogaster/Ensembl/BDGP6.80/bowtie2index/genome $file
    In all my runs ( I testes 12 different samples), STAR performed better than topohat2, while bbmap was way back with the number of mapped reads.
    When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

    What might be the reasons (if any), that STAR doesn't map these regions?

    I have here two examples of such exons.

    and


    another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic

  • #2
    Keep in mind that tophat2 is mapping to the transcriptome first, so if these exons contain repeat regions or are similar to pseudo genes then that's the reason (i.e., tophat2 might only be reporting alignments there due to not looking yet at the whole genome).

    Comment


    • #3
      When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k', like this:

      bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh ambiguous=toss maxindel=100k bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k'...
        Hi Brian,
        thanks for the remarks. I am trying now to run the suggested corrections. I was though wondering, how I can control the number of allowed mismatches in the mapping process or the quality of reads if using the kfilter and minidentity parameters is not recommended.

        thanks,
        Assa

        Comment


        • #5
          If you are using the latest version, the "subfilter=X" flag will be useful. It removes alignments with more than X mismatches. kfilter and minidentity are nice for DNA mapping, but not for RNA to genome.

          Comment


          • #6
            yes I have the newest version. I'll give it a go with this parameter.

            what parameter can I use to control the overall read quality?

            and

            Is there a way to collect splice junction/ as tophat is doing?

            Comment


            • #7
              Originally posted by frymor View Post
              When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

              What might be the reasons (if any), that STAR doesn't map these regions?
              You are using --outFilterMultimapNmax 1, which only allows uniquely mapped reads to be output in the SAM/BAM. No alignments will be reported for multi-mapping reads.
              My guess would be that these reads in the long exons are multi-mappers, and thus STAR does not output them.
              Please try mapping with multi-mappers allowed and check these regions again.

              As far as I can tell, with -g 1 option TopHat will report one random alignment out of all multi-mappers with the same highest score. Also, as @dpryan pointed out, TopHat maps to the transcriptome first, and may consider the alignments unique because of that.

              Originally posted by frymor View Post
              another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic
              This caused by --alignIntronMax 1, I replied in more detail at the link above.

              In general, when making the first runs and comparisons, I would recommend running STAR with default parameters.

              Cheers
              Alex

              Comment


              • #8
                Originally posted by frymor View Post
                what parameter can I use to control the overall read quality?
                Well, you could use the "subfilter=X" flag to reject mapping sites with more than X substitutions, but normally for RNA-seq, I only set maxindel to an appropriate value for the organism's intron lengths, and leave everything else at default.

                Is there a way to collect splice junction/ as tophat is doing?
                BBMap does not do that, but you can find the introns by pileup with Samtools (they will be listed as deletions), and you could probably find the exon locations by using Bedtools to list the locations with coverage above a certain threshold, if you treat bases covered by deletion events as not covered. It's very obvious where the introns and exons are when you look at the bam file in IGV, but I've never actually tried to generate annotation from mapping; there's probably a better way than what I'm suggesting.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X