![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Read counts from SAM file mapped to de novo assembled transcripts using HTSeq-count | alan_sm | RNA Sequencing | 2 | 06-12-2015 09:54 PM |
htseq read counts different in separate runs | h_manoj | Bioinformatics | 0 | 02-06-2014 11:46 AM |
Can CLC genomics read mapping files be used in Bioconductor/R and HTSeq-counts? | tdelaney | Bioinformatics | 1 | 02-20-2013 10:07 PM |
htseq-count with warning for every read to represent all of zero counts in output | hibachings2013 | RNA Sequencing | 10 | 07-15-2011 11:19 AM |
De novo assembly of human genomes with massively parallel short read sequencing | dan | Literature Watch | 0 | 12-21-2009 05:40 AM |
![]() |
|
Thread Tools |
![]() |
#1 | ||||
Member
Location: AATGCATCTCATGTTATCCTTCTCGAG Join Date: Mar 2010
Posts: 14
|
![]()
This is my basic workflow:
DATA and ALIGNMENT: 50SE Illumina reads mapped with tophat2 (v 2.0.11; default setting) to a bowtie2 index of hg38 (all chromosomes plus mitochondria, but no alternate chromosomal assemblies) outfile reports as follows: Quote:
1) Generated a "corrected" gtf file from UCSC (i.e., geneID and transcript IDs are different) using the method described here. 2) converted tophat alignment bam to sam (samtools) 3) counted reads with HTSeq using the general command: Quote:
Quote:
Quote:
Can anyone offer a simple explanation of what is happening here? The hand-wavy explanation that I have at the moment is that HTSeq count is very conservative, and if a read cannot be assigned unambiguously, and exclusively to a gene, it is not counted. However, the HTSeq output is secretly making me question the integrity of my pipeline and/or my input files. Thank you for any advice or reassurance. David EDIT: and it occurs to me that it might be relevant that the library prep was RiboMinus. Since I'm used to working with poly-A purified samples, my expectations might be too high in this case. Last edited by ExMachina; 07-30-2014 at 11:42 AM. |
||||
![]() |
![]() |
![]() |
#2 |
Junior Member
Location: Rochester mn Join Date: Jul 2014
Posts: 1
|
![]()
what is sureselect adapter sequences? indexing strand of Sureselect is different from indexing strand of Truseq LT, can anyone give some information?
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Virginia Join Date: Mar 2011
Posts: 72
|
![]()
Love the location. You should sign your name the same way.
I have not done the 'correction' method you have referenced. I normally download the reference and annotation file from igenomes in tophat ready format. Can you post a snippet from your corrected gtf file? I believe HTSeq will now output a samfile with a field indicating what the read contributed to , have you looked at that and looked to see where they map? -o <samout>, --samout=<samout> write out all SAM alignment records into an output SAM file called <samout>, annotating each line with its assignment to a feature or a special counter (as an optional field with tag ‘XF’) |
![]() |
![]() |
![]() |
#4 |
Member
Location: AATGCATCTCATGTTATCCTTCTCGAG Join Date: Mar 2010
Posts: 14
|
![]()
One other change I just tried was the HTSeq option "--mode=intersection-nonempty "
I was hoping to get more information, but this option only decreased my "no_feature" count by about 5000. So I'm still left with the question: is it reasonable to have only ~15% of my mapped reads associate with annotated genes? |
![]() |
![]() |
![]() |
#5 |
Member
Location: Virginia Join Date: Mar 2011
Posts: 72
|
![]()
No, that is not reasonable. Can you post a few lines of your corrected gtf?
|
![]() |
![]() |
![]() |
#6 | |||
Member
Location: AATGCATCTCATGTTATCCTTCTCGAG Join Date: Mar 2010
Posts: 14
|
![]() Quote:
Quote:
Currently I am waiting on the imminent ENSEMBL release of 38 to run the same pipeline on an compare results to the UCSC-based results. Quote:
|
|||
![]() |
![]() |
![]() |
#7 |
Member
Location: Virginia Join Date: Mar 2011
Posts: 72
|
![]()
If it were me, I would align and count to the previous release just to see what the count assignments look like in terms of the number of reads in no_feature etc.
Another thing, are you sure the chromosome id's are all identical in the fasta and gtf? In the build I have, the chromosomes are labeled as 1, 2 etc instead of chr1, chr2 etc. |
![]() |
![]() |
![]() |
#8 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Give "featureCounts" a try, while you are at it.
As bioBio pointed out, check to make sure your chrom ID's match in your ref/BAM/GTF files. |
![]() |
![]() |
![]() |
#9 |
Member
Location: Virginia Join Date: Mar 2011
Posts: 72
|
![]()
Oh, and make sure the reads are sorted in the bam file. HTSeq now has a flag for which method the sorting was performed, ie name or position.
|
![]() |
![]() |
![]() |
#10 | ||
Member
Location: AATGCATCTCATGTTATCCTTCTCGAG Join Date: Mar 2010
Posts: 14
|
![]() Quote:
![]() Same general results ![]() Quote:
And good point on the sorting bioBob. I checked and all my bam files are sorted ( SO:coordinate) |
||
![]() |
![]() |
![]() |
#11 |
Member
Location: Virginia Join Date: Mar 2011
Posts: 72
|
![]()
Right, so you need to set that flag as the default is name. You probably had a lot of messages in the console about mate not found.
-r <order>, --order=<order>¶ For paired-end data, the alignment have to be sorted either by read name or by alignment position. If your data is not sorted, use the samtools sort function of samtools to sort it. Use this option, with name or pos for <order> to indicate how the input data has been sorted. The default is name. |
![]() |
![]() |
![]() |
#12 | |
Member
Location: AATGCATCTCATGTTATCCTTCTCGAG Join Date: Mar 2010
Posts: 14
|
![]() Quote:
EDIT: actually, I don't think this is a problem for me here, since all I have are SE reads. Last edited by ExMachina; 07-31-2014 at 12:22 PM. |
|
![]() |
![]() |
![]() |
#13 |
Member
Location: AATGCATCTCATGTTATCCTTCTCGAG Join Date: Mar 2010
Posts: 14
|
![]()
Thanks for the input on this. As an update, here's what I've figured out:
The UCSC gtf file should not be made with the "knownGenes" table but should instead be made from the "refGene" table--the "knownGenes" table has too many overlapping coordinates. Using the refGene annotations, I now get ~33% of my mapped reads mapping (uniquely) to known genes. I feel a little more confident with this result, especially given the RiboMinus library prep. More comments are always welcome and I will report back here once the new ENSEMBL annotation is released. Thanks for the help! -David |
![]() |
![]() |
![]() |
Thread Tools | |
|
|