Seqanswers Leaderboard Ad

**Richard Barker** · 10-09-2012, 08:51 AM

Hello
You could use the iplant collaborative infrastructure... Setup an account, upload your data to the data store and then use their cloud infrastructure to run your alignment on their super computers.

Home | CyVerse

http://www.iplantcollaborative.org/

**Geneus** · 10-09-2012, 09:50 AM

If I might ask, how large was the fabric/infrastructure you were using for the analysis of your data using Tophat/Cufflinks? I'm curious to know this.

Thanks.

**kmcarr** · 10-09-2012, 09:53 AM

Originally posted by xy6699 View Post

Hi,

I'm analysing some pair-end RNA-seq data from 20 healthy individuals using Tophat and Cufflinks. However, the novel isoforms detected from each individual are quite different from each other when I compared them across individuals. So now I'm thinking merging these 20 samples together into one mega file, and then use Tophat and Cufflinks to detect isoforms. This brings with a new problem. The mega file is very big, about 74G for read1 and another 74G for read2. When I ran Tophat, it took me about three weeks, and this is very risky because if for any reason my computer was shut down, my job was terminated and I need to tun it again. Does anyone know how to make tophat run quicker for this kind of large input files?

Many thanks

TopHat just aligns the reads to a reference. The alignment of any one read to that reference sequence is independent of all other reads so alignments will not be impacted by submitting the reads as 20 independent files or one single file. In other words if you already have the output of 20 TopHat runs there is no point in re-running TopHat on these reads.

Novel isoforms are identified by Cufflinks. There are cufflinks parameters which define a coverage minimum to call a novel isoform. Have you run cuffmerge to combine the results of the individual cufflinks output into one unified set of isoform calls?

**xy6699** · 10-12-2012, 01:40 AM

Hi kmcarr,

I have run cuffmerge, but the result is not very satisfactory. For example, for individual assembly, I found the reference transcript in each of my samples, but after using cuffmerge, the reference transcript was lost and the merged isoforms tend to be longer.

Tophat identifies junctions using reads, so I'm thinking if I merge all the reads together at the first step, will it improve the accuracy in junction detection?

**kmcarr** · 10-12-2012, 04:51 AM

Originally posted by xy6699 View Post

Tophat identifies junctions using reads, so I'm thinking if I merge all the reads together at the first step, will it improve the accuracy in junction detection?

No, TopHat does not identify junctions. This is the point I was trying to get across in my first post. TopHat aligns each read independant of all other reads, it does not matter if you have more reads in your input. TopHat WILL NOT change how it aligns reads based on how many reads you give it.

Cufflinks identifies junctions. There are some parameters for Cifflinks which set a minimum number of reads supporting a junction. If you want to see what might happen you could merge the BAM output from the 20 individual TopHat runs and run Cufflinks on that. But you really have to ask yourself, if the junction is sooooo rare you need to go to these lengths to detect it, are you sure its real?

**xy6699** · 10-12-2012, 05:11 AM

Ah, I see. Thanks a lot for your information.

So the problem I have now is that after cuffmerge I lost the reference annotated transcripts for some genes as I mentioned above. I attached an example here. Do you have any suggestions about how to assemble the isoforms more correctly?

Many thanks

Attached Files

il2ra.pdf (182.2 KB, 27 views)

**hrajasim** · 02-20-2013, 03:50 PM

TopHat 2nd iteration with merged-junctions file as input to -j option?

Originally posted by kmcarr View Post

TopHat just aligns the reads to a reference. The alignment of any one read to that reference sequence is independent of all other reads so alignments will not be impacted by submitting the reads as 20 independent files or one single file. In other words if you already have the output of 20 TopHat runs there is no point in re-running TopHat on these reads.

Novel isoforms are identified by Cufflinks. There are cufflinks parameters which define a coverage minimum to call a novel isoform. Have you run cuffmerge to combine the results of the individual cufflinks output into one unified set of isoform calls?

Hi kmcarr, Does your comment still apply if I an running TopHat 2nd iteration with merged-junctions.bed file as input to -j option? I ran 1st iteration of tophat with just the -G option and 2nd iteration with -j and -G.

Does this help detect junctions that would otherwise go missing with a single tophat iteration?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

how to run large inputs using Tophat quicker?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News