SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 05:45 AM
htseq-count performance dglemay Bioinformatics 8 10-23-2012 08:08 PM
multiBamCov or htseq-count to count read per feature ? NicoBxl Bioinformatics 1 07-03-2012 03:05 AM
htseq-count gets more reads? deepsea Bioinformatics 3 03-29-2012 12:27 PM
htseq-count error sissi Bioinformatics 0 03-21-2012 12:40 AM

Reply
 
Thread Tools
Old 10-08-2012, 02:44 PM   #1
cpleis
Junior Member
 
Location: Urbana, IL

Join Date: Oct 2012
Posts: 8
Default Issue with htseq-count

Hello,
I am working on an RNA-Seq project in soybean. I used Illumina HiSeq for single end reads on my soybean samples and am working on my alignment. I have run into a problem when trying to get my read counts from HTSeq. I used Tophat2 to do my alignment. When I use htseq-count to get read counts from my accepted_hits file against the soybean GFF3 file I get the error:

Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:2413:101528', because chromosome 'Glyma07g15800.1|pacid:16267415', to which it has been aligned, did not appear in the GFF file.

This occurs for each read and the job stops running.

Here is my command line for htseq-count:
module load python/2.7.3
htseq-count -m union -s no -t gene -i ID /home/a-m/leisner1/accepted_hits.bam/accepted_hits_Blk1Con.sam /home/a-m/leisner1/accepted_hits.bam/Gmax_109_gene.gff3 > Blk1Con_counts_out

I realized that the accepted_hits.bam file (that I sorted by read name and converted to a sam file) has a different gene name than the GFF3 file I am using. Since I have multiple isoforms of my transcripts my gene names end in '.1', '.2' while the gene names in my GFF file do not. I am not sure how to solve this problem. When the bioinformatics core originally ran my alignment for me they said that to avoid this problem they extracted sequences for CDS, 5 UTR and 3 UTRs for each gene based on the coordinates in the GFF3 file provided by phytozome (version 8) and then concatenated them to form a single reference transcript sequence for each gene (.fna file). If I use this new file to do my alignment in TopHat I could get around the naming issue. However, they said that I may have to modify the GFF3 file with gene names and coordinates matching the Gmax_genes.fna file.

My overall question is 1) is there an easier/simpler way to avoid the different gene names so I can get htseq-count to work, and 2) if not, how to I modify the GFF3 file in order to get my read counts?

I realize I could simply 'trim' the names of the files (with some simple code I'm sure) but then I am not sure if that would mess up my count data.

Thanks!
cpleis is offline   Reply With Quote
Old 10-09-2012, 12:43 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

There are good reasons to align against the genome, not the transcriptome (see various previous threads here). Please don't deviate from the accepted best practise unless you know what you are doing. Especially, using TopHat and htseq-count when aligning against a transcriptome does not make any sense. (Read up on what these tools do precisely to se why.)
Simon Anders is offline   Reply With Quote
Old 10-10-2012, 10:50 AM   #3
cpleis
Junior Member
 
Location: Urbana, IL

Join Date: Oct 2012
Posts: 8
Default

Simon,
Thank you for your response. I re-did the alignment using the complete nucleotide sequence (Gmax_109.fa) instead of the transcriptome using TopHat. I then sorted the accepted_hits.bam file by read name (-n) using samtools and then converted this sorted file into a SAM file. When I loaded this file into htseq-count I am getting this error repeatedly:
Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:1886:109580', because chromosome 'scaffold_785', to which it has been aligned, did not appear in the GFF file.

I am still using the same htseq-count code as before the same GFF3 file. I am not really sure how to tackle this error.

Thanks!
cpleis is offline   Reply With Quote
Old 10-10-2012, 12:11 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

So, does the scaffold appear in your GFF file?

Have you checked whether the chromosome names in the genome fasta file and in the gff file match?
Simon Anders is offline   Reply With Quote
Old 10-10-2012, 01:45 PM   #5
cpleis
Junior Member
 
Location: Urbana, IL

Join Date: Oct 2012
Posts: 8
Default

Simon,
It looks like my gff file does not contain information about scaffold_785 and that is why reads mapping to scaffold_785 are skipped for counting. I'm assuming then that there is most likely there is no annotation information available for scaffold_785 and so is not present in gff file. There are only a handful of scaffolds that are being skipped. My chromosome names are the same in both files.

I re-ran the analysis and got an output file with:
no_feature 2421689
ambiguous 345554
too_low_aQual 0
not_aligned 0
alignment_not_unique 19096489

I started with 44,613,416 reads so I'm assuming this output is "normal" (?).

Thanks for the help!
cpleis is offline   Reply With Quote
Old 10-14-2012, 05:26 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Should be okay. Only the fraction of reads with ambiguous alignment is quite high. Transcribed sequence is usually not that repetitive that you would have so many reads mapping to more than one location. Maybe somethig went wrong in your assembly, and some contigs appear in several scaffolds.
Simon Anders is offline   Reply With Quote
Old 10-15-2012, 10:06 AM   #7
cpleis
Junior Member
 
Location: Urbana, IL

Join Date: Oct 2012
Posts: 8
Default

Would a large number of isoforms not account for the high number of ambiguous alignments? I was also going to try running my htseq-count alignment with a different mode. Currently I am using union, but I was also going to try running it with intersection_nonempty to see what kind of output I get.

If it is a problem with my alignment how would I troubleshoot that? Thanks!
cpleis is offline   Reply With Quote
Old 10-15-2012, 10:25 AM   #8
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Did you align against the genome or the transcriptome? If the latter, using htseq-count is pointless, of course.
Simon Anders is offline   Reply With Quote
Old 10-15-2012, 10:31 AM   #9
cpleis
Junior Member
 
Location: Urbana, IL

Join Date: Oct 2012
Posts: 8
Default

I re-did the alignment against the genome.
cpleis is offline   Reply With Quote
Reply

Tags
htseq, read counts, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO