Seqanswers Leaderboard Ad

**Simon Anders** · 07-03-2011, 04:55 AM

In principle, you could follow the 'Tour' in the HTSeq documentation to learn how to do this yourself.

The htseq-count script falls into two parts: the main part doing the actual counting, and surrounding it a thin wrapper to get the intervals out of the SAM file. This is all that needs to be changed, and once I separate these into two modular parts, it will be even easier. Might take a while until I find the time to so, though.

S

**joro** · 07-28-2011, 06:09 AM

Hi,

I have some reads in the original SAM file that don't seem to be accounted for in the output of htseq-count. To look in more detail I created a new SAM file using --samout. The reads aren't in this SAM file. Presumably they should be in the SAM file but assigned something like "no_feature"? Any suggestions are welcome, thanks.

**Simon Anders** · 08-04-2011, 08:05 AM

Ho Joro,

did you get any warnings about reads being skipped? Otherwise, could you send me some examples of SAM lines that are in the input but not the output file, togetehr with your GTF file? Thanks.

Simon

**joro** · 08-05-2011, 01:15 AM

Thanks for your reply. No I didn't get any warnings about reads being skipped. I'll email the files to you.

**joro** · 08-05-2011, 02:48 AM

Oops, I didn't realise that I had the quiet option on! Yes I'm now getting warnings about the chromosomes not appearing in the GFF file. I guess these reads are not included in the output SAM file? Thanks for your help Simon.

**fabrice** · 08-05-2011, 03:24 AM

I am runing DEXseq and htseq-count and have some problems. Maybe others are interesting to this or have the answer, so I post to here again.

When I run dexseq_count.py in DEXseq on my sam file from bwa, it always has this warning:

~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:543:
UserWarning: Malformed SAM line: RNAME != '*' although flag bit
&0x0004 set
algnt = SAM_Alignment.from_SAM_line( line )
~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:592:
UserWarning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have
an aligned mate which could not be found. (Is the SAM file properly
sorted?)
"which could not be found. (Is the SAM file properly sorted?)" )

but I can found GA2-EAS337_0017_FC:1:8:11202:16260#0 mate in the data,
as show in the attachments a.sam. And the a.sam is sorted by the name
(default by bwa).

In htseq-count, it also have this problem.

htseq-count a.sam Homo_sapiens.GRCh37.63/Homo_sapiens.GRCh37.63.gtf
100000 GFF lines processed.
200000 GFF lines processed.
...
2000000 GFF lines processed.
2064769 GFF lines processed.
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Warning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have an
aligned mate which could not be found. (Is the SAM file properly
sorted?)
4 sam line pairs processed.

Why here report 4 line pairs? In fact, there are 3 sam line pairs in a.sam.

BTW: Do this kind of count using BWA or tophat
mapping, which one is better? (Illumina pair-end human RNA-seq data)

Thank you.

Attached Files

a.sam.tar.gz (996 Bytes, 63 views)

**usad** · 08-15-2011, 12:32 PM

Install instructions for CentOS (RHEL)

Hi,

for what it's worth I detailed the additional steps necessary to install HTSeq on CentOS (should work on Red Head Enterprise as well) in RNA-Seq howto wiki. (Or rather in the first incomplete alpha version)

SEQanswers

http://seqanswers.com/wiki/How-to/RNASeq_analysis#Summarizing_Reads

Best Wishes,
Björn

**Tobias Kockmann** · 09-26-2011, 07:42 AM

alignment_not_unique

Hi HTSeq users,

I realized that the htseq-count script is very conservative when counting reads matching to known gene models:

I generated an alignment of my RNA-seq reads to the Drosophila genome using tophat in combination with known Ensembl gene models (provided by gtf file). The output in sam format looks like this:

Code:

[tobiasko@bs-dsvr09 tophat_out]$ head accepted_hits.sam
-1350550920:7:9:328:542 0       chr2L   8880    3       35M     *       0       0       CTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACA     6IC@/IIIIIFABI2II1III3'E2>@I:,&'(31     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19097273   XS:A:-
-1350550920:7:77:1586:563       0       chr2L   8883    3       35M     *       0       0       GACTCGCAAAGTGGACTTGTTCAGCAAGGACATAA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19097270   XS:A:-
-1350550920:7:24:1131:1918      0       chr2L   9327    3       35M     *       0       0       TTTTACAAATCTTACGTAAACACTCCAAGCATGAA     IIIIIIIIIIIIIIII?III=ICEI;EI:IIDII;     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19096819   XS:A:-

As indicated by the optional NH flag some of the reads are mapped to multiple locations (e.g. first line -> NH:i:2). My problem is, that htseq-counts ignores all reads/alignments that have been mapped to multiple locations by default, which in my case leads to a massive loss of coverage (>50%).

I wonder, if there is a way to either tune tophat to only report the best alignment per read like one can do when using bowtie with -k1 --best, or to modify the couting scheme of htseq-count to be less stringent (e.g. count every read once at the best alignment position).

Greetings,
Tobi

**joro** · 09-27-2011, 12:39 AM

Hi Tobi, could you write a script to change the NH flag to NH:i:1 in the sam file so that htseq-count doesn't ignore them?

**Tobias Kockmann** · 09-27-2011, 05:57 AM

That is easily doable - but this would imply that reads are counted several time as part of different alignments. To my knowledege such a way of counting is not compatible with the statistical model used by DESeq and this is actually what I would like to use the counts for.

**Simon Anders** · 09-28-2011, 01:41 AM

For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

Code:

import sys, re
import HTSeq

insam = HTSeq.SAM_Reader( sys.stdin )

# Go through all reads, with their alignments bundled up:
for bundle in HTSeq.bundle_multiple_alignments( insam ):
   bestAlmt = None
   # Go through all alignments of a given read, looking
   # for the one with the best alignment score
   for almt in bundle:
      if bestAlmt is None:
         bestAlmt = almt
      elif almt.aQual > bestAlmt.aQual:
         bestAlmt = almt
      elif almt.aQual == bestAlmt:
         # If there are more than one best alignment, 
         # better skip the read
         bestAlmt = None
   if bestAlmt is not None:
      # Change the NH field to 1 and print the line
      print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),

Call this with something like

Code:

sort samfile.sam | python chooseBest.py > filtered.sam

.

Be aware hat I haven't really tested this thoroughly.

**ssharma** · 09-28-2011, 12:49 PM

Error while using htseq-count

Hi,
I am getting an error while using htseq-count.
I used the following command:
htseq-count CL1_forBWA_left_DSS3.sam NC_003911.gff

This is the error i am getting. I got all the files from ncbi.
Error occured in line 2169 of file NC_003911.gff.
Error: Feature tRNA does not contain a 'gene_id' attribute
[Exception type: SystemExit, raised in count.py:55]

I would really appreciate if anyone can help me out.

Thanks
Shalabh

**joro** · 09-29-2011, 12:50 AM

Hi Shalabh, perhaps it would help if you showed us line 2169 of the *gff file.

**Tobias Kockmann** · 09-29-2011, 01:26 AM

Originally posted by Simon Anders View Post

For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

Code:

import sys, re
import HTSeq

insam = HTSeq.SAM_Reader( sys.stdin )

# Go through all reads, with their alignments bundled up:
for bundle in HTSeq.bundle_multiple_alignments( insam ):
   bestAlmt = None
   # Go through all alignments of a given read, looking
   # for the one with the best alignment score
   for almt in bundle:
      if bestAlmt is None:
         bestAlmt = almt
      elif almt.aQual > bestAlmt.aQual:
         bestAlmt = almt
      elif almt.aQual == bestAlmt:
         # If there are more than one best alignment, 
         # better skip the read
         bestAlmt = None
   if bestAlmt is not None:
      # Change the NH field to 1 and print the line
      print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),

Call this with something like

Code:

sort samfile.sam | python chooseBest.py > filtered.sam

.

Be aware hat I haven't really tested this thoroughly.

Thanks for the code! I tested it using a tophat alignment:

Code:

[tobiasko@bs-bsse01 tophat_output_n_mode]$ ls -l
total 269918
-rw-r--r-- 1 tobiasko bsse-paro 275526474 Sep 26 10:26 accepted_hits.bam
-rw-r--r-- 1 tobiasko bsse-paro        52 Sep 26 10:24 deletions.bed
-rw-r--r-- 1 tobiasko bsse-paro        54 Sep 26 10:24 insertions.bed
-rw-r--r-- 1 tobiasko bsse-paro        52 Sep 26 10:24 junctions.bed
-rw-r--r-- 1 tobiasko bsse-paro        68 Sep 26 10:09 left_kept_reads.info
drwxr-xr-x 2 tobiasko bsse-paro        13 Sep 26 10:21 logs
[tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools sort -n accepted_hits.bam accepted_hits.sortedByName
[bam_sort_core] merging from 2 files...
[tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools view accepted_hits.sortedByName.bam | python /usr/local/paro/scripts/selectBest.py > accepted_hits.sortedByName.best.sam

I think in principle it works, but there are blank lines in the sam output. Could this be fixed quickly???

These are the sam file heads:

Code:

[tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools view accepted_hits.sortedByName.bam | head
-1350550920:7:1:1:309   0       chr2R   12473665        255     35M     *       0       0       AAAGTTGTGTTTTTCGTACATTTTTCCAATTCACG     .<;:3I8=@2IFII2*G;2B?III%+,:7GI,2')     NM:i:1  NH:i:1
-1350550920:7:1:1:862   16      chr3L   9776491 255     35M     *       0       0       AGAAGAAGAAACCTAGTCCGCTGGAGAAAACTGGA     I,>A'DC,IB0&/.G7.(&8(6>BI7IC@B:G39I     NM:i:0  NH:i:1
-1350550920:7:1:1:1324  0       chr2L   9788256 255     35M     *       0       0       TTAGATAAGAGAAGAGTATTCACCTCATCGCCCGG     III6;III+<4ID%=)$@ID9I>1$88>5%667%)     NM:i:2  NH:i:1
-1350550920:7:1:1:1598  0       chr2L   19448548        255     35M     *       0       0       TGAGGCCGACTTCTGGCCCTTCAAAGCCTTCGCGG     ;18,;E>/?7II:I<+;:>CI<11-.43;C.(2''     NM:i:0  NH:i:1
-1350550920:7:1:1:1881  16      chrX    6547455 255     35M     *       0       0       TTCAAACGCTTCTGGATAGCGGTGAGCACATCCCA     +*+CE.37,;+D5C;6;E4@53IGII85II36<2>     NM:i:0  NH:i:1
-1350550920:7:1:2:92    0       chrX    13614346        255     35M     *       0       0       ACGGTGGCAGCTGAGTCCGTGACATCGTGTTCGCT     ?2CE16?/+>9I-4:=/7A+-4'(8(88++1$($6     NM:i:0  NH:i:1
-1350550920:7:1:2:195   0       chr3R   16964869        255     35M     *       0       0       GACACCACCAACGCGGGCTCCTACGTGGCCTGTTC     /7.1'36.+2I/<16:7/B.,-.+/)4*%++),*)     NM:i:0  NH:i:1
-1350550920:7:1:2:198   0       chrX    9515569 255     35M     *       0       0       AAGCCAAATGCACTGCCCCGTGCCCACACATGCTT     A98''G:?05+;;>/30'/?05%$--,.,0+.-,-     NM:i:0  NH:i:1
-1350550920:7:1:2:222   16      chr3R   10049214        255     35M     *       0       0       GAGCTCATAGAAGCGATTAAAGAAGCTATTGGGCG     )8+%6&?9C.9?5*7CI?III7EF<94:3C//6.5     NM:i:0  NH:i:1
-1350550920:7:1:2:237   16      chr3R   26711769        255     35M     *       0       0       GTTGGCCACCAAGACGTCGCGAAGGTGCTTTACTA     -242-(<+,:=8:6596DCFB>D:?176GI@6A:A     NM:i:0  NH:i:1

[tobiasko@bs-bsse01 tophat_output_n_mode]$ head  accepted_hits.sortedByName.best.sam
-1350550920:7:1:1:309   0       chr2R   12473665        255     35M     *       0       0       AAAGTTGTGTTTTTCGTACATTTTTCCAATTCACG     .<;:3I8=@2IFII2*G;2B?III%+,:7GI,2')     NM:i:1  NH:i:1

-1350550920:7:1:1:862   16      chr3L   9776491 255     35M     *       0       0       AGAAGAAGAAACCTAGTCCGCTGGAGAAAACTGGA     I,>A'DC,IB0&/.G7.(&8(6>BI7IC@B:G39I     NM:i:0  NH:i:1

-1350550920:7:1:1:1324  0       chr2L   9788256 255     35M     *       0       0       TTAGATAAGAGAAGAGTATTCACCTCATCGCCCGG     III6;III+<4ID%=)$@ID9I>1$88>5%667%)     NM:i:2  NH:i:1

-1350550920:7:1:1:1598  0       chr2L   19448548        255     35M     *       0       0       TGAGGCCGACTTCTGGCCCTTCAAAGCCTTCGCGG     ;18,;E>/?7II:I<+;:>CI<11-.43;C.(2''     NM:i:0  NH:i:1

-1350550920:7:1:1:1881  16      chrX    6547455 255     35M     *       0       0       TTCAAACGCTTCTGGATAGCGGTGAGCACATCCCA     +*+CE.37,;+D5C;6;E4@53IGII85II36<2>     NM:i:0  NH:i:1

**Tobias Kockmann** · 09-29-2011, 01:46 AM

Sorry, my mistake. The comma at the end of the print line escaped the c&p.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 33 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 48 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 34 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 46 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News