Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • STAR vs tophat2 - mapping reads to long exons

    Hi,

    I have some RNA-Seq from the fruit fly, where we test the differential gene expression of several knock-outs and developmental stages.
    As I heard a lot of good things about the STAR algorithm, I tested it in comparison to bbmap and tophat2.

    These are the commands I used in the three algorithms:
    Code:
    	~/software/STAR-STAR_2.4.1c/STAR --runThreadN 15 --genomeDir genomes/Drosophila_melanogaster/STARindex/Dmel/ --readFilesIn $file --readFilesCommand zcat  --sjdbGTFfile genes.gtf --outFilterType BySJout --outFilterMultimapNmax 1 --alignSJoverhangMin 8 --outFileNamePrefix $NEW_FILE.STAR. --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --outFilterMismatchNoverLmax 0.05 --outFilterScoreMinOverLread 0  --outFilterMatchNminOverLread 0 --alignIntronMax 1
    
     	bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh minidentity=0.9 ambiguous=toss kfilter=85 bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat
    
     	tophat2 -p 15 -g 1 -G genes.gtf -o $NEW_FILE.tophat.out  genomes/Drosophila_melanogaster/Ensembl/BDGP6.80/bowtie2index/genome $file
    In all my runs ( I testes 12 different samples), STAR performed better than topohat2, while bbmap was way back with the number of mapped reads.
    When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

    What might be the reasons (if any), that STAR doesn't map these regions?

    I have here two examples of such exons.

    and


    another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic

  • #2
    Keep in mind that tophat2 is mapping to the transcriptome first, so if these exons contain repeat regions or are similar to pseudo genes then that's the reason (i.e., tophat2 might only be reporting alignments there due to not looking yet at the whole genome).

    Comment


    • #3
      When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k', like this:

      bbmap.sh in=$file outu=$NEW_FILE.bbmap.unmapped.bam outm=$NEW_FILE.bbmap.bam bamscript=$NEW_FILE.bbmap.sh ambiguous=toss maxindel=100k bhist=bhist.$NEW_FILE.bbmap.txt 2>$NEW_FILE.bbmap_stat

      Comment


      • #4
        Originally posted by Brian Bushnell View Post
        When you're mapping RNA-seq data to the genome with BBMap, you should not set the flag "minidentity=0.9" - that will eliminate most of the spliced alignments! And kfilter=85 will also require an 85bp exact match to the reference, which you will not get in virtually any read that spans an intron. I recommend eliminating those flags and adding 'maxindel=100k'...
        Hi Brian,
        thanks for the remarks. I am trying now to run the suggested corrections. I was though wondering, how I can control the number of allowed mismatches in the mapping process or the quality of reads if using the kfilter and minidentity parameters is not recommended.

        thanks,
        Assa

        Comment


        • #5
          If you are using the latest version, the "subfilter=X" flag will be useful. It removes alignments with more than X mismatches. kfilter and minidentity are nice for DNA mapping, but not for RNA to genome.

          Comment


          • #6
            yes I have the newest version. I'll give it a go with this parameter.

            what parameter can I use to control the overall read quality?

            and

            Is there a way to collect splice junction/ as tophat is doing?

            Comment


            • #7
              Originally posted by frymor View Post
              When I took a closer look at the mapped bam files and compared tophat2 and STAR I have noticed the STAR has some problems with long exons. Even though it shows almost everywhere a higher mapping rate, in long exons it seems not to be able to map the reads correctly, while tophat2 can map there quite a lot of them.

              What might be the reasons (if any), that STAR doesn't map these regions?
              You are using --outFilterMultimapNmax 1, which only allows uniquely mapped reads to be output in the SAM/BAM. No alignments will be reported for multi-mapping reads.
              My guess would be that these reads in the long exons are multi-mappers, and thus STAR does not output them.
              Please try mapping with multi-mappers allowed and check these regions again.

              As far as I can tell, with -g 1 option TopHat will report one random alignment out of all multi-mappers with the same highest score. Also, as @dpryan pointed out, TopHat maps to the transcriptome first, and may consider the alignments unique because of that.

              Originally posted by frymor View Post
              another difference I can see here is the behaviour around splice junctions, tophat can identify splice junction over a much longer region. Some of them we know to be true, but STAR can't see them. But this is another topic
              This caused by --alignIntronMax 1, I replied in more detail at the link above.

              In general, when making the first runs and comparisons, I would recommend running STAR with default parameters.

              Cheers
              Alex

              Comment


              • #8
                Originally posted by frymor View Post
                what parameter can I use to control the overall read quality?
                Well, you could use the "subfilter=X" flag to reject mapping sites with more than X substitutions, but normally for RNA-seq, I only set maxindel to an appropriate value for the organism's intron lengths, and leave everything else at default.

                Is there a way to collect splice junction/ as tophat is doing?
                BBMap does not do that, but you can find the introns by pileup with Samtools (they will be listed as deletions), and you could probably find the exon locations by using Bedtools to list the locations with coverage above a certain threshold, if you treat bases covered by deletion events as not covered. It's very obvious where the introns and exons are when you look at the bam file in IGV, but I've never actually tried to generate annotation from mapping; there's probably a better way than what I'm suggesting.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X