Seqanswers Leaderboard Ad

**Richard Finney** · 01-30-2013, 09:10 AM

subtract 'I' insert.

?

See dpryan post ( below ).

**Jerry_Zhao** · 01-30-2013, 10:46 AM

If I just want to know the end location of the read on the reference, do I need to subtract the "I"?

Originally posted by Richard Finney View Post

subtract 'I' insert.

**syfo** · 01-31-2013, 03:15 AM

Originally posted by Jerry_Zhao View Post

If I just want to know the end location of the read on the reference, do I need to subtract the "I"?

From the starting position I would:
- either add the number of Ms, Ns and Ds
- or add the length of the read sequence and subtract the number of Is
Shouldn't this give the same?

As for your counting issue, you can also consider BEDtools.

**Jerry_Zhao** · 01-31-2013, 10:24 AM

Yes, I am planning to use START+Ms+Ds+Ns, but I am not sure whether this is correct.

But for START + length - Is, it is definitely wrong for RNA-seq sam files.
Introns are considered as Ns, not Is.
My first version of scripts are using this method, and I now know it's not correct.

Does anyone know how HTSeq-count deal with this?

Originally posted by syfo View Post

From the starting position I would:
- either add the number of Ms, Ns and Ds
- or add the length of the read sequence and subtract the number of Is
Shouldn't this give the same?

As for your counting issue, you can also consider BEDtools.

**dpryan** · 01-31-2013, 01:22 PM

It's probably simplest to just have a peek at how it's done in samtools (the bam_calend function in the api):

Code:

uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar)
{
        uint32_t k, end;
        end = c->pos;
        for (k = 0; k < c->n_cigar; ++k) {
                int op = cigar[k] & BAM_CIGAR_MASK;
                if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                        end += cigar[k] >> BAM_CIGAR_SHIFT;
        }
        return end;
}

So, set the end equal to the beginning. Then, for each M/D/N in the cigar string, increment the end position by the associated count. Keep in mind that if you encounter a = or X, then things will get thrown off. Of course, I've never actually seen one of those in practice. I should note that you should not subtract I's, as those increment the position in the read, rather than the genome.

BTW, your original observation could be caused by multi-mapped reads, which wouldn't be included in the count.

**Jerry_Zhao** · 02-01-2013, 08:38 AM

Hi dpryan,
Thanks for your kind suggestions. Nevertheless, I am not an expert of samtools.
For Single end RNA-seq sam, I am planning use a simple Perl script like bellow:

$region_length=0;
while ( $cigar =~ /(\d+)[M|D|N]/g ) { $region_length+=$1; }
$end=$start + $region_length -1;

For paired-end sam file, is it the same?

Best,
Jerry

Originally posted by dpryan View Post

It's probably simplest to just have a peek at how it's done in samtools (the bam_calend function in the api):

Code:

uint32_t bam_calend(const bam1_core_t *c, const uint32_t *cigar)
{
        uint32_t k, end;
        end = c->pos;
        for (k = 0; k < c->n_cigar; ++k) {
                int op = cigar[k] & BAM_CIGAR_MASK;
                if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                        end += cigar[k] >> BAM_CIGAR_SHIFT;
        }
        return end;
}

So, set the end equal to the beginning. Then, for each M/D/N in the cigar string, increment the end position by the associated count. Keep in mind that if you encounter a = or X, then things will get thrown off. Of course, I've never actually seen one of those in practice. I should note that you should not subtract I's, as those increment the position in the read, rather than the genome.

BTW, your original observation could be caused by multi-mapped reads, which wouldn't be included in the count.

**dpryan** · 02-01-2013, 09:42 AM

Originally posted by Jerry_Zhao View Post

$region_length=0;
while ( $cigar =~ /(\d+)[M|D|N]/g ) { $region_length+=$1; }
$end=$start + $region_length -1;

For paired-end sam file, is it the same?

For a single read of a set, yes. Of course the reads are separated by a distance, so you just need to take that into account. The bounds would be the smaller of the start positions to the greater of the end positions (trimming and such can result in one read being entirely contained within another).

FYI, from what you've written, I foresee you writing a program doing something like:
1) Compute the end bounds of a fragment.
2) If the end bounds are within or span an exon (or exons) of a gene, then count it as mapping there.

Doing such would result in counting transcripts which are spanned over, but to which the reads themselves don't actually map. The same issue would apply when you only increment a gene/transcript counter when a read uniquely maps to it and there are remotely overlapping genes.

**Jerry_Zhao** · 02-01-2013, 12:01 PM

Yes, you are right.

I am a user of HTSeq-count, and I really like it.
However, I just try to write a simple perl script by myself, and to compare the result between my count and the HTSeq-count result.

I have gotten the CDS regions for each gene from the Ensembl GTF file.

Based on the information of strand, left location, and right location for each read, I will count the number of reads for each gene.

Basically, the program I am writing is identical to union mode of HTSeq-count using "-t CDS" parameter.

For overlapping genes, because the data set I am working on is strand-specific, I think I can tell where the reads are from if the overlapping genes are on different strand of the genome.
Nevertheless, if the gene annotations are not correct and the CDS regions of two gene really overlap, I will not count reads within this overlapping region.

Originally posted by dpryan View Post

For a single read of a set, yes. Of course the reads are separated by a distance, so you just need to take that into account. The bounds would be the smaller of the start positions to the greater of the end positions (trimming and such can result in one read being entirely contained within another).

FYI, from what you've written, I foresee you writing a program doing something like:
1) Compute the end bounds of a fragment.
2) If the end bounds are within or span an exon (or exons) of a gene, then count it as mapping there.

Doing such would result in counting transcripts which are spanned over, but to which the reads themselves don't actually map. The same issue would apply when you only increment a gene/transcript counter when a read uniquely maps to it and there are remotely overlapping genes.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Gene reads count from RNA-seq sam file by CIGAR

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News