SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Error with GTF file when using htseq-count MDonlin Bioinformatics 13 01-13-2015 08:29 AM
DEXSeq vs htseq-count/DESeq counting model jdsv Bioinformatics 2 11-20-2011 07:48 PM
HTSeq Script from DEXSeq Reports Assertion Fail in SAM file FuzzyCoder Bioinformatics 5 09-27-2011 08:52 AM
htseq-count with warning for every read to represent all of zero counts in output hibachings2013 RNA Sequencing 10 07-15-2011 10:19 AM
understanding HTSeq counts nimmi Bioinformatics 3 11-27-2010 07:24 PM

Reply
 
Thread Tools
Old 10-10-2011, 08:59 PM   #1
FuzzyCoder
Member
 
Location: Minnesota

Join Date: Aug 2011
Posts: 13
Default DEXSeq Using Counts File From htseq-count

I am attempting to run DEXSeq in R using counts files generated by the htseq-count python script rather than the dexseq-count script provided with the DEXSeq module for R 2.14.

Here is the R script:

Code:
library(DEXSeq)
annotationfile = file.path("/media/myLab/DEXSeq/Mus_musculus.NCBIM37.64.gff")
samples = data.frame(
  condition = c("Condition1", "Condition2"),
  replicate = c(1:1,1:1),
  type = c("paired-end", "paired-end"),
  row.names = c("30D21712WKS", "B0541412WKS"),
  stringsAsFactors = TRUE,
  check.names = FALSE
)

ecs = read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts", paste(paste(rownames(samples),"Counts", sep="_"), "txt", sep=".")),
                       design = samples,
                       flattenedfile = annotationfile
)
Here is the error I receive:

Code:
Error in read.HTSeqCounts(countfiles = file.path("/media/myLab/DEXSeq/Counts",  : 
  Count files do not correspond to the flattened annotation file
In addition: Warning message:
In aggregates$gene_id == genesrle :
  longer object length is not a multiple of shorter object length
Here is my question:

How do I properly format/use the htseq-count counts file to create the ExonCountSet object?

Here is what I know:

1) GFF file is a flattened GTF using dexseq_prepare_annotation.py
2) The GTF file is Mus_musculus.NCBIM37.64.gtf
3) The same GTF file was used to generate the SAM file using GSNAP
4) The SAM produced by GSNAP was formatted in the following ways to be compatible with DEXSeq/HTSeq:
A) Replaced "." with "N" in sequences in SAM
B) Replaced spaces with tabs in SAM
C) Removed 'chr' from chromosome name in SAM
D) Dropped alignments with '*' for QUAL string in SAM
E) Sorted alignments for selecting best pairs in SAM
F) Selected best alignment pairs to reduce ambiguous alignment count from SAM
G) Sorted selected alignments for counting in SAM
H) Created HTSeq Counts File
NOTES:

Re: 4C): 'chr' was not present in chromosome field of GFF, just the name (i.e., '1', not 'chr1')

Re: 4E) sorted on name (field 1) and QUAL (field 5)

Re: 4F) applied the following script to choose the best scoring pair among multiple alignments (based on a post from Simon):

Code:
import sys, re, optparse
import HTSeq
	
def choose_best( samFile ):
	insam = HTSeq.SAM_Reader( samFile )

	# Go through all reads, with their alignments bundled up:
	for bundle in HTSeq.bundle_multiple_alignments( insam ):
	   fBestAlmt = None
	   rBestAlmt = None
	   # Go through all alignments of a given read, looking
	   # for the one with the best alignment score
	   for almt in bundle:
		  if almt.pe_which == "first": 
			 if fBestAlmt is None:
				fBestAlmt = almt
			 elif almt.aQual > fBestAlmt.aQual:
				fBestAlmt = almt
			 elif almt.aQual == fBestAlmt:
				# If there are more than one best alignment, 
				# better skip the read
				fBestAlmt = None
		  elif almt.pe_which == "second": 
			 if rBestAlmt is None:
				rBestAlmt = almt
			 elif almt.aQual > rBestAlmt.aQual:
				rBestAlmt = almt
			 elif almt.aQual == rBestAlmt:
				# If there are more than one best alignment, 
				# better skip the read
				rBestAlmt = None

	   if fBestAlmt is not None and rBestAlmt is not None:
		   if fBestAlmt.aQual > rBestAlmt.aQual:
			  if fBestAlmt.mate_start is not None:
				 for mAlmt in bundle:
					if mAlmt.iv is not None:
					   if mAlmt.iv.start is not None:
						  if fBestAlmt.mate_start.pos == mAlmt.iv.start:
							 rBestAlmt = mAlmt
		   elif rBestAlmt.aQual > fBestAlmt.aQual:
			  if rBestAlmt.mate_start is not None:
				 for mAlmt in bundle:
					if mAlmt.iv is not None:
					   if mAlmt.iv.start is not None:
						  if rBestAlmt.mate_start.pos == mAlmt.iv.start:
							 fBestAlmt = mAlmt

	   if fBestAlmt is not None:
		  # Change the NH field to 1 and print the line
		  print re.sub( "NH:i:\d+", "NH:i:1", fBestAlmt.original_sam_line ),
		  
	   if rBestAlmt is not None:
		  # Change the NH field to 1 and print the line
		  print re.sub( "NH:i:\d+", "NH:i:1", rBestAlmt.original_sam_line ),

def main():
	optParser = optparse.OptionParser()
	(opts, args)  = optParser.parse_args()
	
	choose_best(args[0])
Re: 4G): Sorted on fields 1 and 2 in SAM

Re: 4H): used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name. Here is the htseq-count command:


Code:
htseq-count -m intersection-nonempty -s no -t exonic_part -i gene_id -o mysam_XF.sam mysam_Final.sam Mus_musculus.NCBIM37.64.gff > mysam_Counts.txt
Re: 4H): dexseq_count results in no counts for any features. The call to htseq-count results in many matches to features and acceptable rejects.

Re 4H): The counts file from htseq is notably different from that dexseq_count:

From dexseq_count:

Code:
ENSMUSG00000093219+ENSMUSG00000025261:001	0
ENSMUSG00000093219+ENSMUSG00000025261:002	0
ENSMUSG00000093219+ENSMUSG00000025261:003	0
ENSMUSG00000093219+ENSMUSG00000025261:004	0
ENSMUSG00000093219+ENSMUSG00000025261:005	0
ENSMUSG00000093219+ENSMUSG00000025261:006	0
ENSMUSG00000093219+ENSMUSG00000025261:007	0
ENSMUSG00000093219+ENSMUSG00000025261:008	0
ENSMUSG00000093219+ENSMUSG00000025261:009	0
ENSMUSG00000093219+ENSMUSG00000025261:010	0
.
.
.
whereas from htseq-count:

Code:
ENSMUSG00000093219+ENSMUSG00000025261	4532
Note 1 record in htseq-count for this feature but 136 records in dexseq_count.

From what I can tell, dexseq_count includes the exonic_part_number in the records of the counts file whereas htseq-count does not.

How can I modify htseq-count to produce the same file as dexseq_count?
OR
How can I modify DEXSeq to accept the counts file produced by htseq-count?
OR
How can I use the counts file from htseq-count combined with other data to properly create the ExonCountSet object in R?

or, humbly, what do I not know that I should know to make it the last step?

NOTE: I had no luck directly calling dexseq_count with sams produced by TopHat nor GSNAP. I had to do the format changes in step 4 to make the process work for htseq-count. I have not yet succeeded in running dexseq_count.
__________________
Best Regards,

Paul Bergmann

Last edited by FuzzyCoder; 10-10-2011 at 09:03 PM. Reason: Cleaned code snippets
FuzzyCoder is offline   Reply With Quote
Old 10-10-2011, 11:54 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

It should be much easier to get dexseq_count.py to work than to try to use htseq_count.py. The reason that I have written two scripts is, after all, that the tasks are not exactly the same.

dexseq_count.py should not have any problems with a GFF file produced with dexseq_prepare.py.

Quote:
I used htseq-count instead of dexseq-count because GFF contained attribute "exonic_part" not "exon" and dexseq_count was expecting "exon" whereas htseq-count alloowed me to specify the attribute name.
dexseq_prepare.py has split and renamed all your "exon" lines to "exonic_part", and this is what dexseq_count.py expects.

We have tested it with TopHat SAM files but not yet with GSNAP SAM files. With TopHat, it works fine.

Maybe you can send me by email an excerpt from your TopHat and GSNAP SAM files and then, we can investigate.
Simon Anders is offline   Reply With Quote
Old 10-11-2011, 06:21 PM   #3
FuzzyCoder
Member
 
Location: Minnesota

Join Date: Aug 2011
Posts: 13
Default

Thank you Simon.

After reading your reply, I decided to rerun dexseq_count.py and it indeed worked . It seems that I did not actually follow all the pre-processing steps necessary to make the SAM compatible when I last tried dexseq_count .

I have been able to generate count files and create an ExonCountSet object per the 10/2/2011 pasilla vignette (using my data) through estimateDispersions.

However, I now get an error when attempting to run fitDispersionFunction:

Code:
Error in glmgam.fit(mm, disps[good], start = coefs) : 
  More columns than rows in X
In addition: Warning message:
In is.na(rows) : is.na() applied to non-(list or vector) of type 'NULL'
Error in fitDispersionFunction(ecs) : 
  Failed to fit the dispersion function
Any thoughts?

I will attempt to email the RData containing the ecs. However, it is 7MB, so please let me know if that does not make it to you. I can give you access to it via DropBox or FTP at your preference.
__________________
Best Regards,

Paul Bergmann
FuzzyCoder is offline   Reply With Quote
Old 10-12-2011, 12:27 AM   #4
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Also... you don't have the latest version of DEXSeq (0.99.0). You could also try updating it.
areyes is offline   Reply With Quote
Old 10-12-2011, 07:26 AM   #5
FuzzyCoder
Member
 
Location: Minnesota

Join Date: Aug 2011
Posts: 13
Default

Alejandro-

I updated to 0.1.29 with biocLite. Same error.

I emailed the data file to you and Simon (~7MB). I also sent you an invitation to my DropBox folder.

Thanks for your assistance.
__________________
Best Regards,

Paul Bergmann
FuzzyCoder is offline   Reply With Quote
Old 10-12-2011, 08:15 AM   #6
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

I found the reason why the function is breaking. DEXSeq follows the motivation of DESeq package to use biological replicates to estimate the variance between samples to distinguish real effects from your treatments from just technical and biological variation, in this case you don't have biological replicates and the individual exon dispersion estimations give values that are basically 0. (Try doing the Figure 1 of the vignette) Then, at the time of estimating the dispersion function it just breaks, because the data behaves differently of what we are expecting.
areyes is offline   Reply With Quote
Old 10-12-2011, 08:26 AM   #7
FuzzyCoder
Member
 
Location: Minnesota

Join Date: Aug 2011
Posts: 13
Default

Thank you!

I will try it with additional replicates. I only used one per condition because I wanted to work my way through the GSNAP -> DEXSeq workflow successfully before I ran all the replicates through GSNAP (~12 hours per replicate). I will let you know how it goes tomorrow.
__________________
Best Regards,

Paul Bergmann
FuzzyCoder is offline   Reply With Quote
Old 10-18-2012, 12:39 AM   #8
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Dear all,
I try to use DExseq but got the following error when I call the function

ecs<- fitDispersionFunction(ecs)

Warning message:
In glmgam.fit(mm, disps[good], coef.start = coefs) :
Too much damping - convergence tolerance not achievable


Here is the version in R I use
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DEXSeq_1.4.0 Biobase_2.18.0 BiocGenerics_0.4.0
loaded via a namespace (and not attached):
[1] RCurl_1.95-1.1 XML_3.95-0.1 biomaRt_2.14.0 hwriter_1.3 plyr_1.7.1
[6] statmod_1.4.16 stringr_0.6.1

Any suggestion on what is going on?

Cheers
Olivier
oliviera is offline   Reply With Quote
Old 10-18-2012, 12:51 AM   #9
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi oliviera,

This warning sometimes happens when your data is sparsed. Have you done the "fit diagnostics" as indicated in the vignette? As it is just a warning, you can go on with your analysis, maybe you will loose some power if the fit is "above" most of the points...

Alejandro
areyes is offline   Reply With Quote
Old 10-18-2012, 02:00 AM   #10
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Hi Alejandro,

Here is the graph to check dispersion estimate. What do you think?

meanvalues<- rowMeans(counts(ecs))
plot(meanvalues, fData(ecs)$dispBeforeSharing, log="xy", main="mean vs CR dispersion")
x<- 0.01:max(meanvalues)
y<- ecs@dispFitCoefs[1] + ecs@dispFitCoefs[2] / x
lines(x, y, col="red")
Attached Images
File Type: jpg Screen shot 2012-10-18 at 11.55.12 AM.jpg (79.7 KB, 53 views)
oliviera is offline   Reply With Quote
Old 10-18-2012, 02:02 AM   #11
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

I think it should be OK, how many hits do you get?
areyes is offline   Reply With Quote
Old 10-20-2012, 12:53 AM   #12
oliviera
Member
 
Location: germany

Join Date: Apr 2010
Posts: 31
Default

Only 92 with adjp < 0.01.
With DEseq I get ~2000 genes at this cut off. I think I made something wrong
oliviera is offline   Reply With Quote
Old 04-17-2013, 12:45 PM   #13
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 77
Default

I got the same problem. It's all zeros in the second column of the output of dexseq_count.py
I saw you said you realized later you didn't follow the manual exactly. Could you tell me what's the problem? Thanks.
metheuse is offline   Reply With Quote
Old 04-17-2013, 01:56 PM   #14
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by oliviera View Post
Only 92 with adjp < 0.01.
With DEseq I get ~2000 genes at this cut off. I think I made something wrong
Why would you use such as stringent cut-off?

Do you really need to make sure that your list of differentially used exons do not contain more than 1% false positives? In most use cases, 10% are considered acceptable.

And: Of course, you get much more genes than exons. Detecting differential expression is needs to see much less information in the data than detecting differential exon usage.
Simon Anders is offline   Reply With Quote
Old 04-17-2013, 01:58 PM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by metheuse View Post
I got the same problem. It's all zeros in the second column of the output of dexseq_count.py
I saw you said you realized later you didn't follow the manual exactly. Could you tell me what's the problem? Thanks.
Hav you checked your alignments with a genome browser? Load the SAM file and the GFF file produced by dexseq_prepare in, e.g., IGV, and look at one of the loci with zero counts. If there really are no reads, you experiment has failed (or you are using a wrong annotation file).
Simon Anders is offline   Reply With Quote
Old 04-17-2013, 05:10 PM   #16
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 77
Default

Quote:
Originally Posted by Simon Anders View Post
Hav you checked your alignments with a genome browser? Load the SAM file and the GFF file produced by dexseq_prepare in, e.g., IGV, and look at one of the loci with zero counts. If there really are no reads, you experiment has failed (or you are using a wrong annotation file).
I converted the bam file to bed and intersected it with "chr1 29385323 29385364 ENSG00000159023:023" (which has zero count in the dexseq_count.py output)
This resulted in 71 intersecting reads. Here are the first 10 of them:
Code:
chr1    29379725        29391495        HWI-ST1235:101:C1WW9ACXX:6:1312:1770:71045/1    50      +
chr1    29379725        29391495        HWI-ST1235:101:C1WW9ACXX:6:2116:17025:56846/2   50      -
chr1    29379731        29391501        HWI-ST1235:101:C1WW9ACXX:6:2307:18153:8715/1    50      +
chr1    29379734        29391504        HWI-ST1235:101:C1WW9ACXX:6:2314:2163:52123/2    50      +
chr1    29379736        29391506        HWI-ST1235:101:C1WW9ACXX:6:2314:2163:52123/1    50      -
chr1    29379741        29391511        HWI-ST1235:101:C1WW9ACXX:6:2313:2799:76858/1    50      +
chr1    29379742        29391512        HWI-ST1235:101:C1WW9ACXX:6:2210:9177:71165/1    50      +
chr1    29379742        29391512        HWI-ST1235:101:C1WW9ACXX:6:1101:15865:15952/2   50      -
chr1    29379742        29391512        HWI-ST1235:101:C1WW9ACXX:6:1307:5858:86759/2    50      -
chr1    29379742        29391512        HWI-ST1235:101:C1WW9ACXX:6:1308:20389:32047/1   50      -
This should mean both my reads and the annotation file has no problem.

Last edited by metheuse; 04-17-2013 at 05:12 PM.
metheuse is offline   Reply With Quote
Old 04-17-2013, 05:23 PM   #17
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 77
Default

By the way, these are the commands I used:
Code:
samtools index 21722_mapped_hg19/accepted_hits.bam
samtools view 21722_mapped_hg19/accepted_hits.bam >21722_accepted_hits.sam
sort -k1,1 -k2,2n 21722_accepted_hits.sam >21722_accepted_hits_sorted.sam
python ~/scripts_64/dexseq_count.py -p yes -s no ~/scripts_64/Homo_sapiens.GRCh37.71.DEXSeq.chr.gff 21722_accepted_hits_sorted.sam KARPAS299_CEP.txt

Last edited by metheuse; 04-17-2013 at 08:54 PM.
metheuse is offline   Reply With Quote
Old 04-17-2013, 09:42 PM   #18
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 77
Default

I just noticed the chromosome name of the gff file doesn't contain "chr":
Code:
1    Homo_sapiens.GRCh37.71.gtf      exonic_part     11869   11871   .       +       .       transcripts "ENST00000456328"; exonic_part_number "001"; gene_id "ENSG00000223972"
This is probably the reason. I've added chr to each line and see if it works.
metheuse is offline   Reply With Quote
Old 02-09-2014, 03:00 AM   #19
sindrle
Senior Member
 
Location: Norway

Join Date: Aug 2013
Posts: 266
Default

I also have a question about warnings and dispersion:

ecs <- estimateSizeFactors( ecs )
the matrix is either rank-deficient or indefinite

ecs <- fitDispersionFunction( ecs )
Too much damping - convergence tolerance not achievable

Screen Shot 2014-02-09 at 12.58.40.jpg

And does this look ok?

Thanks! First time DEXSeq..
sindrle is offline   Reply With Quote
Old 12-28-2015, 07:54 AM   #20
czelin
Junior Member
 
Location: Switzerland

Join Date: Feb 2013
Posts: 1
Default

Dear all,

I have recently started with exon-wise analysis and would appreciate your help.

I have paired 100bp reads. I have prepared the annotation file with DEXSeq python scripts. What I have realized is that when I have a shorter exon (e.g. 150bp) the number of tags is 0. In IGV I can see lots (>500) of reads spanning this exon. Somehow few of the reads were counted for the control samples and none for comparison and this exon was reported as significantly DE.
I suspect that is because the two read pairs are longer than this exon and they overlapped adjacent exon[s]. This caused the reads be considered as "_ambiguous_readpair_position". If my understanding is correct, is there any way to solve the short-exon issue? If my understanding is wrong, could you please correct me?
czelin is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:26 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO