SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Velvet compilation: basic question gmer Bioinformatics 1 05-31-2012 08:46 AM
basic question from a newby in sequencing joskee 454 Pyrosequencing 8 04-02-2012 10:13 AM
a basic question about coverage maria_mari Bioinformatics 7 01-30-2012 03:12 PM
basic question about read groups efoss Bioinformatics 2 10-19-2011 04:32 PM
depth of coverage basic question madsaan Bioinformatics 0 03-24-2011 06:40 AM

Reply
 
Thread Tools
Old 06-11-2012, 10:48 AM   #1
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default Very basic number question..

If I have two 100 base pair, paired end read files each having x reads (2x total number of reads), is the total number of bases pairs represented by these two together 2x or x?
shyam_la is offline   Reply With Quote
Old 06-11-2012, 12:03 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Technically they are "x" fragments represented by 2x reads.
If the reads from the two ends do not overlap then there are all unique bases.
GenoMax is offline   Reply With Quote
Old 06-11-2012, 02:12 PM   #3
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Quote:
Originally Posted by GenoMax View Post
Technically they are "x" fragments represented by 2x reads.
If the reads from the two ends do not overlap then there are all unique bases.
The overlap idea is key; if they are all unique then it is 2x. If there is some overlap it's between 1-2x.
Heisman is offline   Reply With Quote
Old 06-11-2012, 03:31 PM   #4
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

What if there are gaps (the opposite of overlaps)? Thats possible, right? In that case, would the number of bases be more than 2x?
shyam_la is offline   Reply With Quote
Old 06-11-2012, 03:34 PM   #5
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Quote:
Originally Posted by shyam_la View Post
What if there are gaps (the opposite of overlaps)? Thats possible, right? In that case, would the number of bases be more than 2x?
No, it would not be more than 2x. You don't sequence the gaps so they don't count as bases covered for your coverage considerations.
Heisman is offline   Reply With Quote
Old 06-11-2012, 03:54 PM   #6
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

???

We were talking about gap/overlap between members of one pair, right? The gaps is going to be covered by some other pair of reads, won't it?

For eg. If ABCDEFGHIJKLMNOPQRSTUVWXYZ was my target region, and my sequencer sliced it up, made libraries, the libraries would be something like ABCDE, BCDEF, CDEFGHI, DEFGHI and so on, of varying size. Then it sequences the library and gave me 3 unit long paired end reads. ABCDE will give me ABC and EDC (with an overlap) but CDEFGHI would give me CDE and IHG (with a gap).

Sorry I am new to NGS and want to get the bare basics correct.

Is my concept correct?

Last edited by shyam_la; 06-11-2012 at 06:48 PM.
shyam_la is offline   Reply With Quote
Old 06-11-2012, 03:57 PM   #7
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

Quote:
Originally Posted by shyam_la View Post
???

We were talking about gap/overlap between members of one pair, right? The gaps is going to be covered by some other pair of reads, won't it?

For eg. If ABCDEFGHIJKLMNOPQRSTUVWXYZ was my target region, and my sequencer sliced it up, made libraries, the libraries would be something like ABCDE, BCDEF, CDEFGHI, DEFGHI and so on, of varying size. Then it sequences the library and gave me 3 unit long paired end reads. ABCDE will give me ABC and CED (with an overlap) but CDEFGHI would give me CDE and IHG (with a gap).

Sorry I am new to NGS and want to get the bare basics correct.

Is my concept correct?
You are correct. Going along with that example, ABCDE would give you 5 base pairs worth of coverage. CDEFGHI would give you 6 base pairs worth of coverage. It doesn't matter how long the gap is, you still sequence the same total number of bases, so it contributes the same amount of bases to your overall coverage.
Heisman is offline   Reply With Quote
Old 06-11-2012, 06:49 PM   #8
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

I am still not clear.. But thanks anyway.
shyam_la is offline   Reply With Quote
Old 06-11-2012, 06:53 PM   #9
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

If you have 2 100 bp reads, the total number of bases you get is 200. Period, end of story. The question is how many of those 200 base pairs gives you more information? If you are only sequencing a 100 bp insert, then each of the paired reads will sequence the same bases. So, you get 200 base pairs, but 100 base pairs are redundant (assuming there are no errors in the read). Hence, you only get as an outcome 100 base pairs that count for your coverage.

If you are sequencing a 300 base pair insert, you get 200 base pairs of information. Here, those 200 base pairs will be unique, because there will be 100 base pairs in between. So you get 200 base pairs that count for your coverage.

Is this clear? If so, what else is confusing you?
Heisman is offline   Reply With Quote
Old 06-11-2012, 10:17 PM   #10
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Quote:
Originally Posted by shyam_la View Post
If I have two 100 base pair, paired end read files each having x reads (2x total number of reads), is the total number of bases pairs represented by these two together 2x or x?
If you have 30 Gbp of sequence data then you have 30 Gbp of sequence data, what difference does it make how many files that data is separated into?
Jeremy is offline   Reply With Quote
Old 06-11-2012, 11:35 PM   #11
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Heisman: As to a 300 bp insert, 200 bp from one read pair will cover it from the ends but the 100 bp gap will be covered by some other read pair from a different overlapping insert, from the library. What I don't understand is why your answer sounds like those 100 bp have just vanished..

Maybe I am just not able to get the big picture at the moment, from our rather simple discussion, but whatever..
shyam_la is offline   Reply With Quote
Old 06-11-2012, 11:40 PM   #12
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by Jeremy View Post
If you have 30 Gbp of sequence data then you have 30 Gbp of sequence data, what difference does it make how many files that data is separated into?
You think it makes no difference knowing whether that 30Gbp of raw seq data corresponds to 300Mbp or 150Mbp of a reference after alignment?

Maybe this is getting confusing, because we haven't invoked the idea of coverage in a proper way..
shyam_la is offline   Reply With Quote
Old 06-12-2012, 12:09 AM   #13
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Quote:
Originally Posted by shyam_la View Post
You think it makes no difference knowing whether that 30Gbp of raw seq data corresponds to 300Mbp or 150Mbp of a reference after alignment?

Maybe this is getting confusing, because we haven't invoked the idea of coverage in a proper way..
I think perhaps your original question was worded in a confusing manner if you were asking about reference coverage. Your original question only talked about raw data, you mentioned nothing of coverage.

If your original question was pertaining to coverage of a genome then assume random read distribution and divide sequence data size by genome size. For example if you have 30 Gbp of sequence data and a 1 Gbp genome then theoretically you have an average of 30x coverage. It makes little difference how much of that redundancy is from overlap in paired reads (which are size selected to be very little to nill anyway since the fragments are 200-500 bp and you sequence 75-100 bp of each end) vs different reads.
Jeremy is offline   Reply With Quote
Old 06-12-2012, 09:07 AM   #14
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by Jeremy View Post
I think perhaps your original question was worded in a confusing manner if you were asking about reference coverage. Your original question only talked about raw data, you mentioned nothing of coverage.

If your original question was pertaining to coverage of a genome then assume random read distribution and divide sequence data size by genome size. For example if you have 30 Gbp of sequence data and a 1 Gbp genome then theoretically you have an average of 30x coverage. It makes little difference how much of that redundancy is from overlap in paired reads (which are size selected to be very little to nill anyway since the fragments are 200-500 bp and you sequence 75-100 bp of each end) vs different reads.
Perfect! Answers everything.. Thanks.
shyam_la is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:51 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO