Seqanswers Leaderboard Ad

**jiwu2573** · 01-18-2010, 04:36 PM

Another check:

My hg18.fa constructed from hg18 UCSC from TopHat website has the size of 3131776827 bytes, same as yours?

**Xi Wang** · 01-18-2010, 10:24 PM

Originally posted by jiwu2573 View Post

I only know my fastq file for 1 sample is around 3 GB after converting and joining all the 120 qseq.txt files, not sure how to find out how many reads in total? How do you know?

I guess there are more 10 million reads. You can count the number of lines of FASEQ file, and the number of reads equals to 1/4 of the number of lines.

The read length is 76 bp. I am running tophat with the default configuration without any argument except --solexa1.3-quals. I guess you are designating the number of mismatches, number of multi-aligned loci by the argument. If that's the case, what number do you use?

I usually discard all the multi-reads, by specifying "-g 1". I think it is normal to take several hours to finish your job.

PS. I am running TopHat through univ connection to the Linux server. Is it supposed to be faster than running on my local computer? How many processors do you have in your computer? Is a normal PC enough?

I think so. The computational server will be more efficient than PC.

**Xi Wang** · 01-18-2010, 10:25 PM

Originally posted by jiwu2573 View Post

Another question:

Is there any need to run Bowtie alone as TopHat will call Bowtie anyway?

You can just run Tophat.

**Xi Wang** · 01-18-2010, 10:28 PM

Originally posted by jiwu2573 View Post

Another check:

My hg18.fa constructed from hg18 UCSC from TopHat website has the size of 3131776827 bytes, same as yours?

It should be ok. The size of mine is 3169831337. How about anybody else?

**jiwu2573** · 01-19-2010, 09:17 PM

Hi Xi,

Thanks a lot for your help.

Last night I managed to run 1 sample by tophat successfully (it took 17 hours).

I tried to visualize the output, coverage.wig and junctions.bed in UCSC genome browser.

When I load coverage.wig, it shows Error File 'coverage.wig' - Error line 3771557 of custom track: chromEnd less than 1 (0)

When I load junctions.bed, it only shows chromosome 20?
Name Description Type Doc Items Pos
junctions TopHat junctions bed 80207 chr20:

By the way, my junctions file is 6430K and my coverage file is around 173M, are these normal?

How do you do your visualization?

Do you use Cufflinks to quantify the expression after TopHat?

**Xi Wang** · 01-19-2010, 09:29 PM

1. for 'coverage.wig', you can use a text editor to see what happened in line 3771557.
2. for "junctions.bed", you can switch the chr by typing other chromosome ID in the blank following "position/search" . E.g: chr1
3. the sizes of the file look fine.

You can alternatively use some local viewers to visualize the results. But UCSC genome browser is recommended. And Cufflink is useful for identifying transcripts with expression.

**jiwu2573** · 01-20-2010, 09:06 PM

Hi Xi,

After I run Cufflinks and Cuffcompare, any output files can be used directly to feed into DEGseq to get differential expression analysis? I have 2 groups with each group 2 replicates.

If conversion of files are needed, could you please tell me the easiest and fastest way to get DEGseq running for my analysis done by TopHat and then Cufflinks and Cuffcompare?

**jiwu2573** · 01-20-2010, 09:10 PM

Another question is about identification of novel transcripts.

I understand that junctions.bed gives out all the possible splices, but how can we dig out the novel form from all these data? Certainly not by eyeballing through UCSC genome browser?

**jiwu2573** · 01-20-2010, 09:13 PM

I also get huge size of tophat_out folder (~22 GB for 1 human RNA seq).

Do you recommend to keep all these folders untouched or just keep sam, wig, and bed files instead? The inside folder log and tmp not useful at all?

**Xi Wang** · 01-21-2010, 05:56 PM

1. DEGseq: you can use my perl script for the conversion from SAM to BED (attached). I didn't test the script much, if any problem, let me know.
2. New transcript/junction: there should be a tool for doing such thing. Anyone knows such tools?
3. Tophat_out folder: the newest version of tophat will delete the tmp files automaticly, and I don't think the tmp files much help.

Attached Files

sam2bed12.pl (2.6 KB, 16 views)

**jiwu2573** · 01-21-2010, 07:19 PM

Thanks!

I've converted all my samples. No error message!

I know there are paper and manual explaining how to use DEGSeq, I am just a bit lazy, well, it does take long for me to digest everything.

Could you please just tell me the command line I put in for my 4 bed files (2 groups with 2 samples in each group) for DE gene analysis?

**Xi Wang** · 01-21-2010, 08:22 PM

I put an example below. Before you run DEGseq, you should first download a gene annotation file in refFlat format, maybe from UCSC genome browser. Some parameters for DEGseq should also be tuned according to your analysis purpose.

Code:

  mapResultBatch1 <- c("group1_file1", "group1_file2")  
  mapResultBatch2 <- c("group2_file2", "group2_file2")
  outputDir <- "DEGseq_out_dir"
  refFlat <- "gene annotation in refFlat format"
  DEGseq(mapResultBatch1, mapResultBatch2, fileFormat="bed", refFlat=refFlat,
         outputDir=outputDir, method="MARS")

**jiwu2573** · 01-24-2010, 04:33 PM

Hi Xi,

I've run the DEGseq, it works!

But I doubt actually I should use samWrapper instead?
2 groups are case and control, 2 samples in each group are from different individuals.
DEGseq seems to deal with the same sample with technology replicates, but not biology replicates.

Is my understanding correct? If use samWrapper, could you please tell me the command line as I can hardly copy your examples in the manual? Thanks

**Xi Wang** · 01-26-2010, 02:16 AM

Hi,

Yes, for biological replicates, it is better to use samWrapper. But I am afraid that your sample size (2 vs 2) is a little bit small.
A piece of code example for samWrapper:

Code:

# get gene express
  mapResultBatch <- c("group1_file1", "group1_file2", "group2_file2", "group2_file2")  
  output <- "gene.exp"
  refFlat <- "gene annotation in refFlat format"
  exp <- getGeneExp(mapResultBatch, refFlat = refFlat, output = output)

# apply samWrapper
geneExpFile <- "gene.exp"
set.seed(100)
geneExpFile1 <- geneExpFile
geneExpFile2 <- geneExpFile
output <- "samWrapperOut.txt"
expCol1 = c(2, 4)
expCol2 = c(6, 8)
measure1 = c(-1, -2)
measure2 = c(1, 2)
samWrapper(geneExpFile1 = geneExpFile1, geneCol1 = 1, expCol1 = expCol1, measure1 = measure1, 
geneExpFile2 = geneExpFile2, geneCol2 = 1, expCol2 = expCol2, measure2 = measure2, 
nperms = 100, min.foldchange = 2, max.qValue = 1e-04, output = output, paired = TRUE)

**jiwu2573** · 01-27-2010, 08:39 PM

Hi Xi,

The samWrapper also works well. But I am a bit confused about the column designation.

When I looked at the output "gene.exp", 1st col gives the gene ID, 2nd col for raw counts, 3rd col for RPKM, and 4th for all reads, then the pattern keeps on repeating for each sample. In this order, expCol1 should be col 2 and 5 (raw counts columns), but how comes 2 and 4 as you wrote?

Also I don't understand measure1 or measure2. What's the use of them? By the way, my samples are unpaired.

Topics	Statistics	Last Post
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, Today, 07:17 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News