Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to run large inputs using Tophat quicker?

    Hi,

    I'm analysing some pair-end RNA-seq data from 20 healthy individuals using Tophat and Cufflinks. However, the novel isoforms detected from each individual are quite different from each other when I compared them across individuals. So now I'm thinking merging these 20 samples together into one mega file, and then use Tophat and Cufflinks to detect isoforms. This brings with a new problem. The mega file is very big, about 74G for read1 and another 74G for read2. When I ran Tophat, it took me about three weeks, and this is very risky because if for any reason my computer was shut down, my job was terminated and I need to tun it again. Does anyone know how to make tophat run quicker for this kind of large input files?

    Many thanks

  • #2
    Hello
    You could use the iplant collaborative infrastructure... Setup an account, upload your data to the data store and then use their cloud infrastructure to run your alignment on their super computers.

    Comment


    • #3
      If I might ask, how large was the fabric/infrastructure you were using for the analysis of your data using Tophat/Cufflinks? I'm curious to know this.

      Thanks.

      Comment


      • #4
        Originally posted by xy6699 View Post
        Hi,

        I'm analysing some pair-end RNA-seq data from 20 healthy individuals using Tophat and Cufflinks. However, the novel isoforms detected from each individual are quite different from each other when I compared them across individuals. So now I'm thinking merging these 20 samples together into one mega file, and then use Tophat and Cufflinks to detect isoforms. This brings with a new problem. The mega file is very big, about 74G for read1 and another 74G for read2. When I ran Tophat, it took me about three weeks, and this is very risky because if for any reason my computer was shut down, my job was terminated and I need to tun it again. Does anyone know how to make tophat run quicker for this kind of large input files?

        Many thanks
        TopHat just aligns the reads to a reference. The alignment of any one read to that reference sequence is independent of all other reads so alignments will not be impacted by submitting the reads as 20 independent files or one single file. In other words if you already have the output of 20 TopHat runs there is no point in re-running TopHat on these reads.

        Novel isoforms are identified by Cufflinks. There are cufflinks parameters which define a coverage minimum to call a novel isoform. Have you run cuffmerge to combine the results of the individual cufflinks output into one unified set of isoform calls?

        Comment


        • #5
          Hi kmcarr,

          I have run cuffmerge, but the result is not very satisfactory. For example, for individual assembly, I found the reference transcript in each of my samples, but after using cuffmerge, the reference transcript was lost and the merged isoforms tend to be longer.

          Tophat identifies junctions using reads, so I'm thinking if I merge all the reads together at the first step, will it improve the accuracy in junction detection?

          Comment


          • #6
            Originally posted by xy6699 View Post
            Tophat identifies junctions using reads, so I'm thinking if I merge all the reads together at the first step, will it improve the accuracy in junction detection?
            No, TopHat does not identify junctions. This is the point I was trying to get across in my first post. TopHat aligns each read independant of all other reads, it does not matter if you have more reads in your input. TopHat WILL NOT change how it aligns reads based on how many reads you give it.

            Cufflinks identifies junctions. There are some parameters for Cifflinks which set a minimum number of reads supporting a junction. If you want to see what might happen you could merge the BAM output from the 20 individual TopHat runs and run Cufflinks on that. But you really have to ask yourself, if the junction is sooooo rare you need to go to these lengths to detect it, are you sure its real?

            Comment


            • #7
              Ah, I see. Thanks a lot for your information.

              So the problem I have now is that after cuffmerge I lost the reference annotated transcripts for some genes as I mentioned above. I attached an example here. Do you have any suggestions about how to assemble the isoforms more correctly?

              Many thanks
              Attached Files

              Comment


              • #8
                TopHat 2nd iteration with merged-junctions file as input to -j option?

                Originally posted by kmcarr View Post
                TopHat just aligns the reads to a reference. The alignment of any one read to that reference sequence is independent of all other reads so alignments will not be impacted by submitting the reads as 20 independent files or one single file. In other words if you already have the output of 20 TopHat runs there is no point in re-running TopHat on these reads.

                Novel isoforms are identified by Cufflinks. There are cufflinks parameters which define a coverage minimum to call a novel isoform. Have you run cuffmerge to combine the results of the individual cufflinks output into one unified set of isoform calls?
                Hi kmcarr, Does your comment still apply if I an running TopHat 2nd iteration with merged-junctions.bed file as input to -j option? I ran 1st iteration of tophat with just the -G option and 2nd iteration with -j and -G.

                Does this help detect junctions that would otherwise go missing with a single tophat iteration?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X