Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • In principle, you could follow the 'Tour' in the HTSeq documentation to learn how to do this yourself.

    The htseq-count script falls into two parts: the main part doing the actual counting, and surrounding it a thin wrapper to get the intervals out of the SAM file. This is all that needs to be changed, and once I separate these into two modular parts, it will be even easier. Might take a while until I find the time to so, though.

    S

    Comment


    • Hi,

      I have some reads in the original SAM file that don't seem to be accounted for in the output of htseq-count. To look in more detail I created a new SAM file using --samout. The reads aren't in this SAM file. Presumably they should be in the SAM file but assigned something like "no_feature"? Any suggestions are welcome, thanks.

      Comment


      • Ho Joro,

        did you get any warnings about reads being skipped? Otherwise, could you send me some examples of SAM lines that are in the input but not the output file, togetehr with your GTF file? Thanks.

        Simon

        Comment


        • Thanks for your reply. No I didn't get any warnings about reads being skipped. I'll email the files to you.

          Comment


          • Oops, I didn't realise that I had the quiet option on! Yes I'm now getting warnings about the chromosomes not appearing in the GFF file. I guess these reads are not included in the output SAM file? Thanks for your help Simon.

            Comment


            • I am runing DEXseq and htseq-count and have some problems. Maybe others are interesting to this or have the answer, so I post to here again.

              When I run dexseq_count.py in DEXseq on my sam file from bwa, it always has this warning:

              ~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:543:
              UserWarning: Malformed SAM line: RNAME != '*' although flag bit
              &0x0004 set
              algnt = SAM_Alignment.from_SAM_line( line )
              ~/python/lib/python2.7/site-packages/HTSeq-0.5.3p1-py2.7-linux-x86_64.egg/HTSeq/__init__.py:592:
              UserWarning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have
              an aligned mate which could not be found. (Is the SAM file properly
              sorted?)
              "which could not be found. (Is the SAM file properly sorted?)" )

              but I can found GA2-EAS337_0017_FC:1:8:11202:16260#0 mate in the data,
              as show in the attachments a.sam. And the a.sam is sorted by the name
              (default by bwa).

              In htseq-count, it also have this problem.

              htseq-count a.sam Homo_sapiens.GRCh37.63/Homo_sapiens.GRCh37.63.gtf
              100000 GFF lines processed.
              200000 GFF lines processed.
              ...
              2000000 GFF lines processed.
              2064769 GFF lines processed.
              Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
              Warning: Read GA2-EAS337_0017_FC:1:8:11202:16260#0 claims to have an
              aligned mate which could not be found. (Is the SAM file properly
              sorted?)
              4 sam line pairs processed.

              Why here report 4 line pairs? In fact, there are 3 sam line pairs in a.sam.

              BTW: Do this kind of count using BWA or tophat
              mapping, which one is better? (Illumina pair-end human RNA-seq data)

              Thank you.
              Attached Files

              Comment


              • Install instructions for CentOS (RHEL)

                Hi,

                for what it's worth I detailed the additional steps necessary to install HTSeq on CentOS (should work on Red Head Enterprise as well) in RNA-Seq howto wiki. (Or rather in the first incomplete alpha version)




                Best Wishes,
                Björn

                Comment


                • alignment_not_unique

                  Hi HTSeq users,

                  I realized that the htseq-count script is very conservative when counting reads matching to known gene models:

                  I generated an alignment of my RNA-seq reads to the Drosophila genome using tophat in combination with known Ensembl gene models (provided by gtf file). The output in sam format looks like this:

                  Code:
                  [tobiasko@bs-dsvr09 tophat_out]$ head accepted_hits.sam
                  -1350550920:7:9:328:542 0       chr2L   8880    3       35M     *       0       0       CTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACA     6IC@/IIIIIFABI2II1III3'E2>@I:,&'(31     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19097273   XS:A:-
                  -1350550920:7:77:1586:563       0       chr2L   8883    3       35M     *       0       0       GACTCGCAAAGTGGACTTGTTCAGCAAGGACATAA     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19097270   XS:A:-
                  -1350550920:7:24:1131:1918      0       chr2L   9327    3       35M     *       0       0       TTTTACAAATCTTACGTAAACACTCCAAGCATGAA     IIIIIIIIIIIIIIII?III=ICEI;EI:IIDII;     NM:i:0  NH:i:2  CC:Z:chrX       CP:i:19096819   XS:A:-
                  As indicated by the optional NH flag some of the reads are mapped to multiple locations (e.g. first line -> NH:i:2). My problem is, that htseq-counts ignores all reads/alignments that have been mapped to multiple locations by default, which in my case leads to a massive loss of coverage (>50%).

                  I wonder, if there is a way to either tune tophat to only report the best alignment per read like one can do when using bowtie with -k1 --best, or to modify the couting scheme of htseq-count to be less stringent (e.g. count every read once at the best alignment position).

                  Greetings,
                  Tobi

                  Comment


                  • Hi Tobi, could you write a script to change the NH flag to NH:i:1 in the sam file so that htseq-count doesn't ignore them?

                    Comment


                    • That is easily doable - but this would imply that reads are counted several time as part of different alignments. To my knowledege such a way of counting is not compatible with the statistical model used by DESeq and this is actually what I would like to use the counts for.

                      Comment


                      • For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

                        Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

                        Code:
                        import sys, re
                        import HTSeq
                        
                        insam = HTSeq.SAM_Reader( sys.stdin )
                        
                        # Go through all reads, with their alignments bundled up:
                        for bundle in HTSeq.bundle_multiple_alignments( insam ):
                           bestAlmt = None
                           # Go through all alignments of a given read, looking
                           # for the one with the best alignment score
                           for almt in bundle:
                              if bestAlmt is None:
                                 bestAlmt = almt
                              elif almt.aQual > bestAlmt.aQual:
                                 bestAlmt = almt
                              elif almt.aQual == bestAlmt:
                                 # If there are more than one best alignment, 
                                 # better skip the read
                                 bestAlmt = None
                           if bestAlmt is not None:
                              # Change the NH field to 1 and print the line
                              print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),
                        Call this with something like
                        Code:
                        sort samfile.sam | python chooseBest.py > filtered.sam
                        .

                        Be aware hat I haven't really tested this thoroughly.

                        Comment


                        • Error while using htseq-count

                          Hi,
                          I am getting an error while using htseq-count.
                          I used the following command:
                          htseq-count CL1_forBWA_left_DSS3.sam NC_003911.gff

                          This is the error i am getting. I got all the files from ncbi.
                          Error occured in line 2169 of file NC_003911.gff.
                          Error: Feature tRNA does not contain a 'gene_id' attribute
                          [Exception type: SystemExit, raised in count.py:55]

                          I would really appreciate if anyone can help me out.

                          Thanks
                          Shalabh

                          Comment


                          • Hi Shalabh, perhaps it would help if you showed us line 2169 of the *gff file.

                            Comment


                            • Originally posted by Simon Anders View Post
                              For an explanation why htseq-count skips multiply matched reads, see post #4 of this thread: http://seqanswers.com/forums/showthread.php?t=9129

                              Now, if you have two mappings for a read and one has high quality and the other does not, it would indeed make sense to use the good one and discard the bad one rather than skipping both. Hence, you could sort you SAM file by read ID, so that multiple mappings are in adjacent lines and the write a script to filter the best one. Here is a quick attempt to do this with HTSeq:

                              Code:
                              import sys, re
                              import HTSeq
                              
                              insam = HTSeq.SAM_Reader( sys.stdin )
                              
                              # Go through all reads, with their alignments bundled up:
                              for bundle in HTSeq.bundle_multiple_alignments( insam ):
                                 bestAlmt = None
                                 # Go through all alignments of a given read, looking
                                 # for the one with the best alignment score
                                 for almt in bundle:
                                    if bestAlmt is None:
                                       bestAlmt = almt
                                    elif almt.aQual > bestAlmt.aQual:
                                       bestAlmt = almt
                                    elif almt.aQual == bestAlmt:
                                       # If there are more than one best alignment, 
                                       # better skip the read
                                       bestAlmt = None
                                 if bestAlmt is not None:
                                    # Change the NH field to 1 and print the line
                                    print re.sub( "NH:i:\d+", "NH:i:1", bestAlmt.original_sam_line ),
                              Call this with something like
                              Code:
                              sort samfile.sam | python chooseBest.py > filtered.sam
                              .

                              Be aware hat I haven't really tested this thoroughly.
                              Thanks for the code! I tested it using a tophat alignment:

                              Code:
                              [tobiasko@bs-bsse01 tophat_output_n_mode]$ ls -l
                              total 269918
                              -rw-r--r-- 1 tobiasko bsse-paro 275526474 Sep 26 10:26 accepted_hits.bam
                              -rw-r--r-- 1 tobiasko bsse-paro        52 Sep 26 10:24 deletions.bed
                              -rw-r--r-- 1 tobiasko bsse-paro        54 Sep 26 10:24 insertions.bed
                              -rw-r--r-- 1 tobiasko bsse-paro        52 Sep 26 10:24 junctions.bed
                              -rw-r--r-- 1 tobiasko bsse-paro        68 Sep 26 10:09 left_kept_reads.info
                              drwxr-xr-x 2 tobiasko bsse-paro        13 Sep 26 10:21 logs
                              [tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools sort -n accepted_hits.bam accepted_hits.sortedByName
                              [bam_sort_core] merging from 2 files...
                              [tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools view accepted_hits.sortedByName.bam | python /usr/local/paro/scripts/selectBest.py > accepted_hits.sortedByName.best.sam
                              I think in principle it works, but there are blank lines in the sam output. Could this be fixed quickly???

                              These are the sam file heads:

                              Code:
                              [tobiasko@bs-bsse01 tophat_output_n_mode]$ samtools view accepted_hits.sortedByName.bam | head
                              -1350550920:7:1:1:309   0       chr2R   12473665        255     35M     *       0       0       AAAGTTGTGTTTTTCGTACATTTTTCCAATTCACG     .<;:3I8=@2IFII2*G;2B?III%+,:7GI,2')     NM:i:1  NH:i:1
                              -1350550920:7:1:1:862   16      chr3L   9776491 255     35M     *       0       0       AGAAGAAGAAACCTAGTCCGCTGGAGAAAACTGGA     I,>A'DC,IB0&/.G7.(&8(6>BI7IC@B:G39I     NM:i:0  NH:i:1
                              -1350550920:7:1:1:1324  0       chr2L   9788256 255     35M     *       0       0       TTAGATAAGAGAAGAGTATTCACCTCATCGCCCGG     III6;III+<4ID%=)$@ID9I>1$88>5%667%)     NM:i:2  NH:i:1
                              -1350550920:7:1:1:1598  0       chr2L   19448548        255     35M     *       0       0       TGAGGCCGACTTCTGGCCCTTCAAAGCCTTCGCGG     ;18,;E>/?7II:I<+;:>CI<11-.43;C.(2''     NM:i:0  NH:i:1
                              -1350550920:7:1:1:1881  16      chrX    6547455 255     35M     *       0       0       TTCAAACGCTTCTGGATAGCGGTGAGCACATCCCA     +*+CE.37,;+D5C;6;E4@53IGII85II36<2>     NM:i:0  NH:i:1
                              -1350550920:7:1:2:92    0       chrX    13614346        255     35M     *       0       0       ACGGTGGCAGCTGAGTCCGTGACATCGTGTTCGCT     ?2CE16?/+>9I-4:=/7A+-4'(8(88++1$($6     NM:i:0  NH:i:1
                              -1350550920:7:1:2:195   0       chr3R   16964869        255     35M     *       0       0       GACACCACCAACGCGGGCTCCTACGTGGCCTGTTC     /7.1'36.+2I/<16:7/B.,-.+/)4*%++),*)     NM:i:0  NH:i:1
                              -1350550920:7:1:2:198   0       chrX    9515569 255     35M     *       0       0       AAGCCAAATGCACTGCCCCGTGCCCACACATGCTT     A98''G:?05+;;>/30'/?05%$--,.,0+.-,-     NM:i:0  NH:i:1
                              -1350550920:7:1:2:222   16      chr3R   10049214        255     35M     *       0       0       GAGCTCATAGAAGCGATTAAAGAAGCTATTGGGCG     )8+%6&?9C.9?5*7CI?III7EF<94:3C//6.5     NM:i:0  NH:i:1
                              -1350550920:7:1:2:237   16      chr3R   26711769        255     35M     *       0       0       GTTGGCCACCAAGACGTCGCGAAGGTGCTTTACTA     -242-(<+,:=8:6596DCFB>D:?176GI@6A:A     NM:i:0  NH:i:1
                              
                              [tobiasko@bs-bsse01 tophat_output_n_mode]$ head  accepted_hits.sortedByName.best.sam
                              -1350550920:7:1:1:309   0       chr2R   12473665        255     35M     *       0       0       AAAGTTGTGTTTTTCGTACATTTTTCCAATTCACG     .<;:3I8=@2IFII2*G;2B?III%+,:7GI,2')     NM:i:1  NH:i:1
                              
                              -1350550920:7:1:1:862   16      chr3L   9776491 255     35M     *       0       0       AGAAGAAGAAACCTAGTCCGCTGGAGAAAACTGGA     I,>A'DC,IB0&/.G7.(&8(6>BI7IC@B:G39I     NM:i:0  NH:i:1
                              
                              -1350550920:7:1:1:1324  0       chr2L   9788256 255     35M     *       0       0       TTAGATAAGAGAAGAGTATTCACCTCATCGCCCGG     III6;III+<4ID%=)$@ID9I>1$88>5%667%)     NM:i:2  NH:i:1
                              
                              -1350550920:7:1:1:1598  0       chr2L   19448548        255     35M     *       0       0       TGAGGCCGACTTCTGGCCCTTCAAAGCCTTCGCGG     ;18,;E>/?7II:I<+;:>CI<11-.43;C.(2''     NM:i:0  NH:i:1
                              
                              -1350550920:7:1:1:1881  16      chrX    6547455 255     35M     *       0       0       TTCAAACGCTTCTGGATAGCGGTGAGCACATCCCA     +*+CE.37,;+D5C;6;E4@53IGII85II36<2>     NM:i:0  NH:i:1

                              Comment


                              • Sorry, my mistake. The comma at the end of the print line escaped the c&p.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                29 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X