Seqanswers Leaderboard Ad

**dariober** · 06-21-2013, 11:39 PM

Originally posted by Heisman View Post

Hey all,

Basically I want to be able to for any given base, determine how many uniqe reads cover it, how many reads appearing in duplicate cover it, how many reads appearing in triplicate cover it, etc.

IE, for a base at say chr1, position 1000, if there are 10 100bp reads covering it starting at:

1. chr1:920
2. chr1:935
3. chr1:950
4. chr1:950
5. chr1:950
6. chr1:978
7. chr1:980
8. chr1:989
9. chr1:989
10. chr1:996

I want to know that the base 1000 is covered by reads starting at 5 sites uniquely (1,2,6,7,10 above), 1 site in duplicate (8,9 above), and 1 site in triplicate (3,4,5 above). I only want to consider one strand at a time. Thus, the output could be like:

Chr,Base,Start_sites_covering_1x,Start_sites_covering_2x,Start_sites_covering_3x,etc
1,1000,5,1,1

Does anybody have any insight on an efficient way to do this? Are there any tools designed to do this, and if not, any thoughts on the easiest tools to use to write a program to pull this off?

I would extremely appreciate any help.

Hi-

Just a thought... If produce the pileup http://samtools.sourceforge.net/pileup.shtml of your alignment, then at each position you can count how many reads start there by counting the occurrences of the ^ sign in column 5. (^ marks the start of a read and it is followed by the quality score).

Then you could summarize the counts by region in the way you show by using bedtools groupby

You would have to stitch the whole thing together with some python/perl/awk/whatever but it should be quite quick and easy.

Does it get you any closer to what you need?

Dario

**Heisman** · 06-22-2013, 09:03 AM

Originally posted by dariober View Post

Hi-

Just a thought... If produce the pileup http://samtools.sourceforge.net/pileup.shtml of your alignment, then at each position you can count how many reads start there by counting the occurrences of the ^ sign in column 5. (^ marks the start of a read and it is followed by the quality score).

Then you could summarize the counts by region in the way you show by using bedtools groupby

You would have to stitch the whole thing together with some python/perl/awk/whatever but it should be quite quick and easy.

Does it get you any closer to what you need?

Dario

This is very promising! I didn't realize I could utilize a pileup for this purpose. I can basically create a pileup from a .bam file, create a new column with the counts of '^', then extract regions I'm interested for any given base (ex: base n-99...n for 100bp reads), and then just actually use uniq -c based on the count column and get the output that I need. I don't think I'll need to use bedtools groupby but I wasn't even aware of that until this post and it looks quite useful for other purposes, so thanks for that as well.

One question, can you tell from the pileup with '^' if the read is on the forward or reverse strand? If not, I can easily just separate forward/reverse strand reads initially from the .bam file and then do the analysis.

**dariober** · 06-22-2013, 10:19 AM

Originally posted by Heisman View Post

This is very promising! I didn't realize I could utilize a pileup for this purpose. I can basically create a pileup from a .bam file, create a new column with the counts of '^', then extract regions I'm interested for any given base (ex: base n-99...n for 100bp reads)

Just take care that the ^ is followed by the mapping quality which might be (not sure) a ^ sign itself. Cases where you have ^^ should then be counted as one.

Also note that you can pass a bed file of regions of interest to mpileup via the -l option. This way you extract your regions while creating the pileup instead of afterwards.

One question, can you tell from the pileup with '^' if the read is on the forward or reverse strand? If not, I can easily just separate forward/reverse strand reads initially from the .bam file and then do the analysis.

I guess you could use the information in the pileup to tell which strand the read comes from. But I suspect it's going to be easier to just separate them
at the beginning from the bam file.

**Heisman** · 06-22-2013, 12:08 PM

Originally posted by dariober View Post

Just take care that the ^ is followed by the mapping quality which might be (not sure) a ^ sign itself. Cases where you have ^^ should then be counted as one.

Also note that you can pass a bed file of regions of interest to mpileup via the -l option. This way you extract your regions while creating the pileup instead of afterwards.

I guess you could use the information in the pileup to tell which strand the read comes from. But I suspect it's going to be easier to just separate them
at the beginning from the bam file.

Excellent, thank you.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Best way to determine distribution of read start sites covering a base?

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News