Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat 2.0.4 -T and -G versus -G --no-novel-juncs

    Hello everyone,

    I've run into some ambiguity in the TopHat manual and was looking to see if anyone could help with my question.

    I wanted to align my sequenced reads against the genome with a known gtf file so that reads align to junctions but without tophat identifying novel junctions.

    I originally did this by running the following params:

    tophat2 \
    --no-novel-juncs \
    -G ensembl.gtf \
    --solexa1.3-quals \
    --library-type fr-unstranded \
    -p 8 \
    -r 50 \
    --mate-std-dev 100 \
    -o ./s_1_tophat_out \
    s_1_1_sequence.txt \
    s_1_2_sequence.txt

    I used --no-novel-juncs with a supplied gtf file with -G to align my reads to the genome and get back a junctions.bed file with the number of reads that overlap the known junctions found by tophat, etc. That makes sense.

    --no-novel-juncs
    Only look for reads across junctions indicated in the supplied GFF or junctions file. (ignored without -G/-j)
    Here is the log of that run:

    [2012-10-24 00:19:16] Beginning TopHat run (v2.0.4)
    -----------------------------------------------
    [2012-10-24 00:19:16] Checking for Bowtie
    Bowtie version: 2.0.0.7
    [2012-10-24 00:19:16] Checking for Samtools
    Samtools version: 0.1.18.0
    [2012-10-24 00:19:16] Checking for Bowtie index files
    [2012-10-24 00:19:16] Checking for reference FASTA file
    [2012-10-24 00:19:16] Generating SAM header for /data/share/pulm/Deep_Seq/TopHat2/hg19
    format: fastq
    quality scale: phred64 (reads generated with GA pipeline version >= 1.3)
    [2012-10-24 00:19:56] Reading known junctions from GTF file
    [2012-10-24 00:20:22] Preparing reads
    left reads: min. length=75, max. length=75, 27052490 kept reads (12433 discarded)
    right reads: min. length=75, max. length=75, 27041987 kept reads (22936 discarded)
    [2012-10-24 00:45:48] Creating transcriptome data files..
    [2012-10-24 00:47:25] Building Bowtie index from Homo_sapiens.GRCh37.64.mod.fa
    [2012-10-24 01:27:40] Mapping left_kept_reads to transcriptome Homo_sapiens.GRCh37.64.mod with Bowtie2
    [2012-10-24 03:38:37] Mapping right_kept_reads to transcriptome Homo_sapiens.GRCh37.64.mod with Bowtie2
    [2012-10-24 05:20:48] Resuming TopHat pipeline with unmapped reads
    [2012-10-24 05:20:49] Mapping left_kept_reads.m2g_um to genome hg19 with Bowtie2
    [2012-10-24 05:59:56] Mapping left_kept_reads.m2g_um_seg1 to genome hg19 with Bowtie2 (1/3)
    [2012-10-24 06:06:18] Mapping left_kept_reads.m2g_um_seg2 to genome hg19 with Bowtie2 (2/3)
    [2012-10-24 06:14:03] Mapping left_kept_reads.m2g_um_seg3 to genome hg19 with Bowtie2 (3/3)
    [2012-10-24 06:20:19] Mapping right_kept_reads.m2g_um to genome hg19 with Bowtie2
    [2012-10-24 07:17:53] Mapping right_kept_reads.m2g_um_seg1 to genome hg19 with Bowtie2 (1/3)
    [2012-10-24 07:27:27] Mapping right_kept_reads.m2g_um_seg2 to genome hg19 with Bowtie2 (2/3)
    [2012-10-24 07:37:15] Mapping right_kept_reads.m2g_um_seg3 to genome hg19 with Bowtie2 (3/3)
    [2012-10-24 07:44:00] Retrieving sequences for splices
    [2012-10-24 07:47:26] Indexing splices
    [2012-10-24 07:48:27] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie2 (1/3)
    [2012-10-24 07:50:03] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/3)
    [2012-10-24 07:51:58] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/3)
    [2012-10-24 07:53:29] Joining segment hits
    [2012-10-24 08:28:49] Mapping right_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie2 (1/3)
    [2012-10-24 08:30:46] Mapping right_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/3)
    [2012-10-24 08:32:47] Mapping right_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/3)
    [2012-10-24 08:34:46] Joining segment hits
    [2012-10-24 09:08:06] Reporting output tracks
    Warning: couldn't remove all temporary files in ./s_1_tophat_out/tmp/
    -----------------------------------------------
    [2012-10-24 10:42:54] Run complete: 10:23:38 elapsed

    I then noticed that there is also a -T function.

    -T/--transcriptome-only
    Only align the reads to the transcriptome and report only those mappings as genomic mappings.
    I then wasn't sure if this was what I wanted or if I wanted the --no-novel-juncs function. I also wasn't sure how --no-novel-juncs and -T interacted together, so I ran the following two runs with and with --no-novel-juncs:

    tophat2 \
    --no-novel-juncs \ # In one run this was included in the other it was not
    -G ensembl.gtf \
    -T
    --solexa1.3-quals \
    --library-type fr-unstranded \
    -p 8 \
    -r 50 \
    --mate-std-dev 100 \
    -o ./s_1_tophat_out \
    s_1_1_sequence.txt \
    s_1_2_sequence.txt

    It appears like whether or not I included --no-novel-juncs that I got a similar size output (the final bam files were only different by about 10 bytes). Though, interestingly, I did not get any returned junctions.bed file or indel bed files. Here is the log file:

    EDIT: I did actually get the junction and indel files back, I just had an error in where I thought they were being output, so my bad on that statement above. I'm still going to look into what the differences might be.

    [2013-01-03 12:17:14] Beginning TopHat run (v2.0.6)
    -----------------------------------------------
    [2013-01-03 12:17:14] Checking for Bowtie
    Bowtie version: 2.0.0.7
    [2013-01-03 12:17:14] Checking for Samtools
    Samtools version: 0.1.18.0
    [2013-01-03 12:17:14] Checking for Bowtie index files
    [2013-01-03 12:17:14] Checking for reference FASTA file
    [2013-01-03 12:17:14] Generating SAM header for /data/share/pulm/Deep_Seq/TopHat2/hg19
    format: fastq
    quality scale: phred64 (reads generated with GA pipeline version >= 1.3)
    [2013-01-03 12:18:20] Reading known junctions from GTF file
    [2013-01-03 12:18:39] Preparing reads
    left reads: min. length=75, max. length=75, 27052490 kept reads (12433 discarded)
    right reads: min. length=75, max. length=75, 27041987 kept reads (22936 discarded)
    [2013-01-03 12:38:54] Creating transcriptome data files..
    [2013-01-03 12:40:32] Building Bowtie index from Homo_sapiens.GRCh37.64.mod.fa
    [2013-01-03 13:08:25] Mapping left_kept_reads to transcriptome Homo_sapiens.GRCh37.64.mod with Bowtie2
    [2013-01-03 13:39:57] Mapping right_kept_reads to transcriptome Homo_sapiens.GRCh37.64.mod with Bowtie2
    [2013-01-03 14:11:03] Reporting output tracks
    -----------------------------------------------
    [2013-01-03 14:35:47] Run complete: 02:18:33 elapsed
    It looks like it only maps to the known transcriptome (what I want), but doesn't do the longer set of steps that returns the junctions.bed file (something I want, but am not getting with -T option). Any idea what the differences are here and why one returns a bed file of junctions but the other does not?

    EDIT: Again, my bad, I did actually get those files back, I just had an error in where I thought they were being output. I'm still going to look into what the differences might be between the -T and --no-novel-juncs runs in output.
    Last edited by jb2; 01-07-2013, 03:16 PM. Reason: Error in original post, updating.

  • #2
    Hi,

    I'm no expert with tophat2 yet but here's what I think:

    -using both -G and -no-novel-juncs, tophat goes through the first 2 of its 3 possible steps: 1) mapping to transcriptome and 2) mapping reads to the genome but only if they can be aligned without splicing.

    -using -G and -T (I think -no-novel-juncs is not necessary/meaningless in that case), it only performs step 1.

    Now as to why it outputs a junctions file in the first case and not in the second case, I'm not sure.
    My guess is that when -T is specified, since the junctions that would be contained in the output junctions files are those specified by the GTF file, it is just unnecessary to output a junctions file.
    It is also unnecessary I think when -no-novel-juncs is specified, but it still seems to output one. You could check if all the junctions output with the -G and -no-novel-juncs options are the same as those defined by your GTF file maybe.
    Anyway, the GTF file is equivalent to a junctions file with no novel junctions, so you don't really need an output junctions file do you?

    Hope that helps,

    Amelie

    Comment


    • #3
      Hi Amelie,

      Thanks for the response. What you stated does appear to be the case. I have also checked that the splice junctions in the junctions.bed file returned by tophat2 with -G and --no-novel-juncs are in fact all correlated with junctions in the gtf file. I actually like having the junctions.bed file output because it gives me the opportunity to look at junction specific behavior and how many and which junctions seem to have evidence of expression.

      Tophat run with -T is definitely much faster, but since I desire the junctions.bed file, I'm going to stick to to the method -G --no-novel-juncs method I've used so far. I'd also point out that I don't think indels are returned either, so that is something missing too, which I think most users may want regardless of whether they are using Tophat to find novel splice junctions or not.

      Cheers,

      jb2

      Comment


      • #4
        The fact that indels are not returned with -T makes me wonder if they are allowed in step 1) when -T is not specified and tophat goes through all 3 steps. That would be desirable! I'll email tophat to check. Thanks for pointing that out!

        Comment


        • #5
          Oops, my bad! I just noticed that it does return these things, just not in the place I originally checked. Doh! So now I'm going to go and explore what the difference might be between the junctions.bed file returned from -T and -G and the one returned from -G --no-novel-juncs. Sorry for the confusion and mistake on my part!

          Comment


          • #6
            Did you find indels files in both cases too then?

            Comment


            • #7
              Yep, both insertions.bed and deletions.bed are there in both cases.

              One thing I do notice, with -G --no-novel-juncs and -G -T, the resulting bam files do differ in size:

              -G --no-novel-juncs
              -rw-r--r-- 1 jb2 lgrc 2.4G Oct 24 10:41 s_1pair_accepted_hits.bam
              -T -G
              -rw-r--r-- 1 jb2 lgrc 2.0G Jan 3 14:33 s_1pair_accepted_hits.bam
              Interesting to see what the differences might be.

              Comment


              • #8
                The difference should be the reads mapped in step 2) (mapping to genome splicing)

                Comment


                • #9
                  For the most part they find evidence of expression for a large union of the splice junctions (both -G -T and -G --no-novel-juncs show that there are reads aligned to 175988 splice junctions).

                  With the -G and -T options, there are 322 splice junctions that have aligned reads not found by -G and --no-novel-juncs.

                  However, with -G --no-novel-juncs, there are 1007 splice junctions that have aligned reads that don't have reads aligning to them with -G -T function.

                  Overall, they find a majority of the same things, I'm just curious about what causes the differences. I would have expected that -G --no-novel-juncs to have found all the the junctions that -G -T found plus a few more junctions because of the extra alignment steps it performs compared to -G -T, but in fact the -G -T tophat run finds some junctions not found by -G -T options.

                  Maybe I'll hear back from the Tophat folks about this to see what their thoughts are on that?

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  18 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  47 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X