Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Originally posted by bbl View Post
    HTSeq seems to be the right tool I need for my work. I am trying to count how many times each base in an interesting region of exon has been covered by a read. Can 'counting' function in HTSeq do that? If so, how to do it?
    'Bbl' send me a private mail with additional information, but I answer here, in case somebody else is interested, too. He has a SAM file, and a BED file with positions of selcted exons. For these exons, he wants to know the coverage vector.

    Here my answer to Bbl:

    First of all, HTSeq is not a "program" but a framework that allows you to conveniently write your own scripts. Hence, I have to assume that you are a bioinformatician with at least a bit of experience in programming. If you are not, this might be difficult.

    If you already know some Python, read the tutorial (Tour) in the HTSeq documentation. If you don't know Python yet but do know Perl or something similar, maybe look first through the first few pages of the Python Tutorial on the Python web site. You don't need to know much for HTSeq.

    Given a SAM file (not BAM), you can calculate the coverage in a HTSeq.GenomicArray, as shown in the section "Calculating coverage vectors" of the Tour.

    These here are the lines in the Tour that you need to use and adapt:

    Code:
    import HTSeq
    
    yeast_chroms = [ "2-micron", "MT", "I", "II", "III", "IV", "V", "VI", 
       "VII", "VIII", "IX", "X", "XI", "XII", "XIII", "XIV", "XV", "XVI" ]
    
    alignment_file = HTSeq.SAM_Reader( "yeast_RNASeq_excerpt.sam" )
    
    cvg = HTSeq.GenomicArray( yeast_chroms, stranded=True, typecode='i' )
    
    for alngt in alignment_file:
        if alngt.aligned:
           cvg.add_value( 1, alngt.iv )
    Now, you can query the coverage at a selected interval, e.g.:
    Code:
    iv = HTSeq.GenomicInterval( "II", 120100, 120120, "+" )
    print list( cvg[iv] )
    If you do this with my example data, you will just get 20 zeroes, because no reads mapped there.

    You can now go through your BED file and do this for each interval given there. I assume that your BED file has four columns, for chromosome, start, end, and name of the exon. (If you have another format, you have to change the line below with the 'split' method.)

    Code:
    for line in open( "mybedfile.bed" ):
    
       # Split the line into the four fields:
       chrom, start, end, name = line.rstrip().split( "\t" )
    
       # Make two intervals, one for '+', one for '-':
       ivPlus  = HTSeq.GenomicInterval( chrom, start, end, "+" )
       ivMinus = HTSeq.GenomicInterval( chrom, start, end, "-" )
    
       # Write out the results:
       print "Exon", name, ":"
       print "+:", list( cvg[ivPlus] )
       print "-:", list( cvg[ivMinus] )
       print
    Is this roughly what you wanted?

    In any case, I guess your work is not done here; you might want to calculate statistics or make plots from the output. This should be easy wth Python (incl numpy/scipy and matplotlib), too, so it's worth learning it.

    Simon

    Comment


    • #62
      Hi Simon,

      Firstly thanks for this tool we're finding it very useful.

      I have a couple of questions:

      1. Is there a way to find out which version of htseq-count we're using?

      2. Should the sum of column 2 in the htseq-count output file (including no_feature and ambiguous) equal the number of non-header lines in the SAM file?

      Comment


      • #63
        Originally posted by joro View Post
        1. Is there a way to find out which version of htseq-count we're using?
        Yes. Look at the end of the output of 'htseq-count -h'.

        2. Should the sum of column 2 in the htseq-count output file (including no_feature and ambiguous) equal the number of non-header lines in the SAM file?
        Yes, but only if you use the latest version, which offers you two further counters, namely "unmapped" (for reads which are listed without alignment in the SAM file) and "lowqual" (for reads below the alignment quality threshold, in case you set one).

        Simon
        Last edited by Simon Anders; 10-08-2010, 03:31 AM. Reason: sp

        Comment


        • #64
          Thanks for your quick reply.

          I must be missing something because this is the output of 'htseq-count -h':

          Code:
          $ htseq-count -h
          Usage: htseq-count [options] sam_file gff_file
          
          This script takes an alignment file in SAM format and a feature file in GFF
          format and calculates for each feature the number of reads mapping to it. See
          http://www-huber.embl.de/users/anders/HTSeq/doc/count.html for details.
          
          Options:
            -h, --help            show this help message and exit
            -m MODE, --mode=MODE  mode to handle reads overlapping more than one
                                  feature(choices: union, intersection-strict,
                                  intersection-nonempty; default: union)
            -t FEATURETYPE, --type=FEATURETYPE
                                  feature type (3rd column in GFF file) to be used, all
                                  features of other type are ignored (default, suitable
                                  for Ensembl GTF files: exon)
            -i IDATTR, --idattr=IDATTR
                                  GFF attribute to be used as feature ID (default,
                                  suitable for Ensembl GTF files: gene_id)
            -s STRANDED, --stranded=STRANDED
                                  whether the data is from a strand-specific assay
                                  (default: yes)
            -q, --quiet           suppress progress report
          
          Written by Simon Anders ([email protected]), European Molecular Biology
          Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
          Public License v3. Part of the 'HTSeq' framework.
          What is the best way to get the latest version?

          Comment


          • #65
            Hi joro,

            this must be quite an old version. I've added the version information to the last sentence of the output of 'htseq-count -h' a few months ago.

            Plese follow these instruction to get the newest version: http://www-huber.embl.de/users/ander...c/install.html

            Avoid using easy_install. It tends to get outdated versions.

            Simon

            Comment


            • #66
              Thanks, I've got the latest version now and 'htseq-count -h' provides the version information.

              I wonder if you could help me with an error I'm now seeing?
              Code:
              Error: ('Inconsistent CIGAR operation.', 'line 15641 of file output.sam')
              The CIGAR string for this line is
              Code:
              32M113N18M
              Can you see why this is causing a problem?

              Comment


              • #67
                Yes, that's a bug I introduced in v0.4.5 and fixed, half an hour ago, in v0.4.5p1.

                Comment


                • #68
                  It's running without errors now thanks. However, the sum of column 2 in the htseq-count output file doesn't equal the number of non-header lines in the SAM file. The difference is around 40,000. Do you know what is causing this difference?

                  Comment


                  • #69
                    Hi Simon,

                    In htseq-count, if a read maps to an exon shared by two transcripts of the same gene, what will be the count for the gene. For example Exon2 in ENSECAT00000020941 and Exon1 in ENSECAT00000028962 (see below)?

                    Code:
                    15    protein_coding    exon    76927858    76928059    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    exon    76913463    76915467    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    CDS    76913932    76914786    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000017218";
                    15    protein_coding    start_codon    76914784    76914786    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    stop_codon    76913929    76913931    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000020941"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    exon    76915073    76915408    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    CDS    76915073    76915408    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
                    15    protein_coding    start_codon    76915406    76915408    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "1"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    exon    76915024    76915071    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    CDS    76915024    76915071    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "2"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
                    15    protein_coding    exon    76915011    76915022    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "3"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    CDS    76915011    76915022    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "3"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
                    15    protein_coding    exon    76914863    76915009    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "4"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    CDS    76914863    76915009    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "4"; gene_name "LOC100056640"; transcript_name "LOC100056640"; protein_id "ENSECAP00000023033";
                    15    protein_coding    exon    76913932    76914861    .    -    .     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "5"; gene_name "LOC100056640"; transcript_name "LOC100056640";
                    15    protein_coding    CDS    76913932    76914861    .    -    0     gene_id "ENSECAG00000019779"; transcript_id "ENSECAT00000028962"; exon_number "5"; gene_name "LOC100056640"; transcript_name "LOC100056640";

                    Comment


                    • #70
                      Originally posted by joro View Post
                      In htseq-count, if a read maps to an exon shared by two transcripts of the same gene, what will be the count for the gene.
                      One. If a read overlaps with two exons of the same gene, it will be counted once for this gene. If the read overlaps with two exons of two different genes, it will not be counted for either gene but as "ambiguous".

                      Two exons are deemed to belong to the same gene, if their lines in the GFF file have the same value for the "gene_id" attribute. The purpose of the '--idattr' command line option is to tell htseq-count to look at another attribute than 'gene_id' for this purpose. The transcript ID is hence ignored unless you tell htseq-count otherwise.

                      Simon
                      Last edited by Simon Anders; 10-14-2010, 03:53 AM. Reason: spelling

                      Comment


                      • #71
                        Thanks, that makes sense.

                        What about if you used option '--idattr' to look at transcript_id in the example I gave before? i.e. if the read overlaps with two exons of two different transcripts, would it be "ambiguous"?

                        Comment


                        • #72
                          Originally posted by joro View Post
                          What about if you used option '--idattr' to look at transcript_id in the example I gave before? i.e. if the read overlaps with two exons of two different transcripts, would it be "ambiguous"?
                          I would even be ambiguous if the read overlaps with one exon that appear in two different transcripts and hence appears in two different lines of the GFF files with two transcript IDs.

                          This means that htseq-count is suitable for gene-level counting only. However, the point of HTSeq is, after all, that you cane easily write your own scripts with a bit of Python knowledge, so if you need to modify the counting logic, you can do so. (Use the Tour in the documentation as starting point.)

                          Simon

                          Comment


                          • #73
                            Ok thanks Simon. I'll have a go at changing the counting logic if I need to at any point! I think I only need gene-level counting for now.

                            Comment


                            • #74
                              Originally posted by joro View Post
                              It's running without errors now thanks. However, the sum of column 2 in the htseq-count output file doesn't equal the number of non-header lines in the SAM file. The difference is around 40,000. Do you know what is causing this difference?
                              Thanks for offering to look through my input files but I've resolved this now. There were reads mapping to sequences that weren't defined in the gtf. My mistake!

                              Comment


                              • #75
                                Hi,

                                I have seen a post with the same issue before, but I didn't see any replay.. I am working with SOLiD data, aligned with SHRiMP to the reference mRNA library.
                                I am getting this error message with htseq-count:

                                Error occured in line 16241 of file F_mrna.sam.
                                Error: ("'seq' and 'qualstr' do not have the same length.", 'line 16241 of file F_mrna.sam')
                                [Exception type: ValueError, raised in _HTSeq.pyx:626]

                                The error applies to the whole SAM file (every line will give this error).
                                The example line from SAM file:

                                2_512_865_F3 16 Esi0595_0002 conserved unknown protein [1335] f:2354-3688 613 255 3H47M * 0 0 * AS:i:347

                                Do you have any solutions?

                                Thanks.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X