Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene reads count from RNA-seq sam file by CIGAR

    Hi all,

    Previously, I am using HTSeq-count to get gene expression levels for my RNA-seq data. It works perfectly for me.

    Nevertheless, in several cases, I noticed that some genes have 0 read count, but in IGV they have lots of reads mapped. I known this could be due to the mis-annotation of the GTF files.

    I am planing to write my own perl script to count gene expression levels from sam file, and to compare the results between my-count and HTSeq-count.

    However, I have a problem to define the end mapping position of the read in sam file.
    This information is in the CIGAR column, right?
    Therefore, the end = Start + (no. of M) + (no. of D) + (no. of N). Am I right?

    Many thanks.

    Best,
    Jerry

  • #2
    subtract 'I' insert.

    ?

    See dpryan post ( below ).
    Last edited by Richard Finney; 01-31-2013, 01:25 PM.

    Comment


    • #3
      If I just want to know the end location of the read on the reference, do I need to subtract the "I"?

      Originally posted by Richard Finney View Post
      subtract 'I' insert.

      Comment


      • #4
        Originally posted by Jerry_Zhao View Post
        If I just want to know the end location of the read on the reference, do I need to subtract the "I"?
        From the starting position I would:
        - either add the number of Ms, Ns and Ds
        - or add the length of the read sequence and subtract the number of Is
        Shouldn't this give the same?

        As for your counting issue, you can also consider BEDtools.

        Comment


        • #5
          Yes, I am planning to use START+Ms+Ds+Ns, but I am not sure whether this is correct.

          But for START + length - Is, it is definitely wrong for RNA-seq sam files.
          Introns are considered as Ns, not Is.
          My first version of scripts are using this method, and I now know it's not correct.


          Does anyone know how HTSeq-count deal with this?





          Originally posted by syfo View Post
          From the starting position I would:
          - either add the number of Ms, Ns and Ds
          - or add the length of the read sequence and subtract the number of Is
          Shouldn't this give the same?

          As for your counting issue, you can also consider BEDtools.

          Comment


          • #6
            It's probably simplest to just have a peek at how it's done in samtools (the bam_calend function in the api):
            Code:
            uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar)
            {
                    uint32_t k, end;
                    end = c->pos;
                    for (k = 0; k < c->n_cigar; ++k) {
                            int op = cigar[k] & BAM_CIGAR_MASK;
                            if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                                    end += cigar[k] >> BAM_CIGAR_SHIFT;
                    }
                    return end;
            }
            So, set the end equal to the beginning. Then, for each M/D/N in the cigar string, increment the end position by the associated count. Keep in mind that if you encounter a = or X, then things will get thrown off. Of course, I've never actually seen one of those in practice. I should note that you should not subtract I's, as those increment the position in the read, rather than the genome.

            BTW, your original observation could be caused by multi-mapped reads, which wouldn't be included in the count.
            Last edited by dpryan; 01-31-2013, 01:25 PM. Reason: Addendum

            Comment


            • #7
              Hi dpryan,
              Thanks for your kind suggestions. Nevertheless, I am not an expert of samtools.
              For Single end RNA-seq sam, I am planning use a simple Perl script like bellow:

              $region_length=0;
              while ( $cigar =~ /(\d+)[M|D|N]/g ) { $region_length+=$1; }
              $end=$start + $region_length -1;


              For paired-end sam file, is it the same?

              Best,
              Jerry


              Originally posted by dpryan View Post
              It's probably simplest to just have a peek at how it's done in samtools (the bam_calend function in the api):
              Code:
              uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar)
              {
                      uint32_t k, end;
                      end = c->pos;
                      for (k = 0; k < c->n_cigar; ++k) {
                              int op = cigar[k] & BAM_CIGAR_MASK;
                              if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                                      end += cigar[k] >> BAM_CIGAR_SHIFT;
                      }
                      return end;
              }
              So, set the end equal to the beginning. Then, for each M/D/N in the cigar string, increment the end position by the associated count. Keep in mind that if you encounter a = or X, then things will get thrown off. Of course, I've never actually seen one of those in practice. I should note that you should not subtract I's, as those increment the position in the read, rather than the genome.

              BTW, your original observation could be caused by multi-mapped reads, which wouldn't be included in the count.
              Last edited by Jerry_Zhao; 02-01-2013, 08:40 AM.

              Comment


              • #8
                Originally posted by Jerry_Zhao View Post
                $region_length=0;
                while ( $cigar =~ /(\d+)[M|D|N]/g ) { $region_length+=$1; }
                $end=$start + $region_length -1;


                For paired-end sam file, is it the same?
                For a single read of a set, yes. Of course the reads are separated by a distance, so you just need to take that into account. The bounds would be the smaller of the start positions to the greater of the end positions (trimming and such can result in one read being entirely contained within another).

                FYI, from what you've written, I foresee you writing a program doing something like:
                1) Compute the end bounds of a fragment.
                2) If the end bounds are within or span an exon (or exons) of a gene, then count it as mapping there.

                Doing such would result in counting transcripts which are spanned over, but to which the reads themselves don't actually map. The same issue would apply when you only increment a gene/transcript counter when a read uniquely maps to it and there are remotely overlapping genes.

                Comment


                • #9
                  Yes, you are right.

                  I am a user of HTSeq-count, and I really like it.
                  However, I just try to write a simple perl script by myself, and to compare the result between my count and the HTSeq-count result.

                  I have gotten the CDS regions for each gene from the Ensembl GTF file.

                  Based on the information of strand, left location, and right location for each read, I will count the number of reads for each gene.

                  Basically, the program I am writing is identical to union mode of HTSeq-count using "-t CDS" parameter.

                  For overlapping genes, because the data set I am working on is strand-specific, I think I can tell where the reads are from if the overlapping genes are on different strand of the genome.
                  Nevertheless, if the gene annotations are not correct and the CDS regions of two gene really overlap, I will not count reads within this overlapping region.




                  Originally posted by dpryan View Post
                  For a single read of a set, yes. Of course the reads are separated by a distance, so you just need to take that into account. The bounds would be the smaller of the start positions to the greater of the end positions (trimming and such can result in one read being entirely contained within another).

                  FYI, from what you've written, I foresee you writing a program doing something like:
                  1) Compute the end bounds of a fragment.
                  2) If the end bounds are within or span an exon (or exons) of a gene, then count it as mapping there.

                  Doing such would result in counting transcripts which are spanned over, but to which the reads themselves don't actually map. The same issue would apply when you only increment a gene/transcript counter when a read uniquely maps to it and there are remotely overlapping genes.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  29 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X