SEQanswers

Go Back   SEQanswers > Introductions

Similar Threads
Thread Thread Starter Forum Replies Last Post
Insert size != Fragment size? Boel Bioinformatics 6 12-12-2013 08:28 AM
amplicon read lengths SeqNerd Ion Torrent 2 06-08-2011 10:50 PM
Multiple fragment lengths in single 454 titanium run? Tom McFarland 454 Pyrosequencing 3 05-18-2011 06:47 AM
Are Illumina library fragment lengths actually normally distributed? delphi_ote Sample Prep / Library Generation 6 05-09-2011 01:01 PM
Calculating read lengths - SOLiD naluru SOLiD 1 01-26-2011 04:57 AM

Reply
 
Thread Tools
Old 05-06-2012, 12:08 PM   #1
avm
Junior Member
 
Location: Gothenburg

Join Date: May 2012
Posts: 8
Question Read lengths, inserts, fragment size...

Hi,

I am new in sequencing and bioinformatics and trying to get the terms right.

I am doing WGS of e. coli genomes on Illumina HiSeq machines.

Read length: Is that the length of the DNA fragment between the tags being replicated on the flow cell?

Inserts: The actual sequence you get from the machine?

Fragment size: Not sure...

Cycles: I have samples run in 75 and 100 cycles, meaning what?

Please help me, so confused...
avm is offline   Reply With Quote
Old 05-06-2012, 05:04 PM   #2
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

You should read through the stickies in the Illumina library prep section. That said:

Read length and cycles are related terms. With Illumina, each base pair is sequenced one cycle at a time. So 100 cycles gives you 100bp reads. The read length is the number of bases sequenced. It does not matter where along the entire strand the bases are sequenced (although typically they start right after the sequencing primer which is ligated on to each fragment during the library prep).

Insert refers to the sequence between the universal adapters that are ligated on. These adapters typically have the flow cell adapter sequence and sequencing primer. They can also have an index or a barcode. If they have a barcode, the barcode is considered to be included in the insert since it will be sequenced at the beginning of the read. An index would not be sequenced at the beginning of the read and thus does not contribute to the insert size.

Fragment size is similar to insert size; the average size of your fragments. It should be specified if this is in regards to before or after ligating on the adapters.
Heisman is offline   Reply With Quote
Old 05-06-2012, 09:32 PM   #3
avm
Junior Member
 
Location: Gothenburg

Join Date: May 2012
Posts: 8
Default

Okey,

I have reads of 75 bp but the mean insert size is in one sample 239 bp. Is the barcode that long?

Thank you for making it more clear!
avm is offline   Reply With Quote
Old 05-06-2012, 09:38 PM   #4
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Quote:
Originally Posted by avm View Post
Okey,

I have reads of 75 bp but the mean insert size is in one sample 239 bp. Is the barcode that long?

Thank you for making it more clear!
No, that means (probably), that there is a 239 bp DNA strand between the two adapters that were ligated on. The insert size defined like this is independent of the read length.

To explain better, some people I think state that the fragment size is the distance between where the adapters are ligated and the insert size is the distance between where the reads end in paired end sequencing. With that definition, the insert size would vary with read length. However, I do not believe that's universal, and I always refer to insert size as the total length of DNA between where the adapters are ligated.

If you use software where you have to specify the insert size, check to make sure that you understand what definition the software is using.
Heisman is offline   Reply With Quote
Old 05-06-2012, 09:45 PM   #5
avm
Junior Member
 
Location: Gothenburg

Join Date: May 2012
Posts: 8
Default

Thank you for explaining and for quick answers!!
avm is offline   Reply With Quote
Old 05-07-2012, 06:59 AM   #6
archie.chauhan
Junior Member
 
Location: tn

Join Date: Nov 2011
Posts: 9
Default

We got our illumina paired end data for 2x 100bp run processed from CASAVA 1.8 (demultiplexed fastq files). Since this is our very first run and we are a newbie to the downstream illumina data processing, I would appreciate if you can answer out queries:
1). Our data for almost all the lanes looks as below. Is this normal? The position of NNNs is almost same in each sample from different lanes. If not, whats the cuse of such a data?

********************************************************************************************************
@DJG64KN1:78:C0MG3ACXX:4:1101:1119:1986 1:Y:0:GCCAATA
TTCTCCCCTTNNNNNNNNNNNTTCTTTGAACCCACNNNNNNNNTATCATGACTACTTATGTAANNNNNNNTACACAGCCACCATTTCTGANNNCTGCTCA
+
<<<?@???@@###########228???????????########--<=????????????;?@@#####################################
@DJG64KN1:78:C0MG3ACXX:4:1101:1212:1989 1:Y:0:GCCAATA
TATGAAAAATNNNNNNNNNNAATGTTATAATTTCTANGNNNNNGAGGGCTATTTATAGTCTAANNNNNTCAACTATGCTAATTATCACAATTAGCCCCTT
+
<<<@?@@@?@##########42@=@??@@?@?????#0#####00==????????>???@@@@#####,,9==>>?>???>>?????==========<<<
@DJG64KN1:78:C0MG3ACXX:4:1101:1473:1987 1:Y:0:GCCAATA
CTTACATATANNNNNNNNNNNAAAAGTAAGTTTGAGNCNNNNNTCCAATTTAGATGAAGAATCNNNNNACATTTCATATTTTTAATAGATACTTAACTAT
+
<<<@@@@@@@###########22==@>???@???)=#0#####00<????>???>??=?9;?;#####,,9==?>=>>>?????;=:===26;===<===
@DJG64KN1:78:C0MG3ACXX:4:1101:1253:1997 1:Y:0:GCCAATA
ATTTGTATTANNNNNNNNNTCAAAAATTAAGATGAGTATNNNNTGAAGTAAACATGATTTGGCNNNNNTGAAAACATAGACGAGATAGGAAAATAGAAAG
+
<<<@@@@@@@#########34=@@@@@@??@????????####00=??????????>?@@@@?#####--=???><>?>??<<<<<======<=======
@DJG64KN1:78:C0MG3ACXX:4:1101:1385:1998 1:Y:0:GCCAATA
AACCAAAGCTNNNNNNNNNAATTAAAGTCATTTCTCAACNNNNAGTATCAACATCTATACATANNNNNATTATCGATCAGTTATATAAAGTTCTTTTCTA
+
<<<@@@@@@@#########32@@@@?@????????????####00<=???????????@@@@>#####-,9=????=?<??????===============
@DJG64KN1:78:C0MG3ACXX:4:1101:1667:1982 1:Y:0:GCCAATA
ANGACTTAAGNNNNNNNNNNNTCCAGAGATAATTANNNNNNNNTTTTTTTCTTATTTATGAGNNNNNNNAACATCCAAAAAACTATTGTATTTTTGTGTC
+
<#0@@@@@@@###########22@>@>????????########00<????????????????######################################
@DJG64KN1:78:C0MG3ACXX:4:1101:1519:1984 1:Y:0:GCCAATA
TNCCCATTTTNNNNNNNNNNNCTTATTCACAAATCNNNNNNNNAACTTACAGTAGTTTTCATNNNNNNNAAAAACAGTTCAAACTGCAATTGTATTTGTG
+
9#0<@@(.@@##########################################################################################
@DJG64KN1:78:C0MG3ACXX:4:1101:1594:1985 1:Y:0:GCCAATA
TTATAATCAANNNNNNNNNNNAAAAAAAAAGCCCGNNNNNNNNAATTAAACATTGTTAAACCANNNNNNAACATTGTTAAACCAATAATAAGCAGTTATT
+
<<<@?@?@??###########22@@?????8>???#################################################################
@DJG64KN1:78:C0MG3ACXX:4:1101:1644:1989 1:Y:0:GCCAATA
AGATGAGTAANNNNNNNNNNTACATGCTCGAACGCTNTNNNNNGAGCAAATACGTTTTAAAACNNNNNAAGTTAAAACAACTTCTTGAAAATGAATCAAG
+
<<<@??@@@?##########32=?????????????#-#####.-<=??9;>??????@@???#####################################
@DJG64KN1:78:C0MG3ACXX:4:1101:1809:1988 1:Y:0:GCCAATA
TAGCCTTATCNNNNNNNNNNNCCAAACTAGACACCTNANNNNNCAACACTATGCCTTCTTTAANNNNNAAATGACATTTTTCCCAATTAAGAACAAGGTG
*****************************************************************************************************************

2): we have got around 21-30 fastq files per lane for both read 1 nad read 2 as: SJL-2b_ACAGTGA_L008_R1_001.fastq.gz ..................... SJL-2b_ACAGTGA_L008_R2_021.fastq.gz.
Does this mean that the read length of this sample is only 21 bp?
archie.chauhan is offline   Reply With Quote
Old 05-07-2012, 07:07 AM   #7
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

1. Having stretches of N's like that is not normal. I'm not sure what the cause would be. You should check with whatever sequencing core ran those samples to see if there was anything weird with the run as a whole. If so you may be able to get them to rerun it for free.

2. No, your reads are 100bp. That's just the name of the file; could mean anything.
Heisman is offline   Reply With Quote
Old 05-07-2012, 07:24 AM   #8
archie.chauhan
Junior Member
 
Location: tn

Join Date: Nov 2011
Posts: 9
Default

thanks for the response. the pattern looks same in almost all the sample? Is this a problem with the library preparation or just sequencing run problem?

I want to elaborate on my second question: (both R1 and R2)
some samples have R1_001..to ...20.fastq.gz and some R1_001..to ...35.fastq.gz. Why different samples have different number of files? what does this suggest?

Can you please let me know which software to use for clipping the adapter seq and the indices and further downstream processing

thanks a lot sir!
archie.chauhan is offline   Reply With Quote
Old 05-07-2012, 07:29 AM   #9
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

No idea if it's a problem with the library prep or the run; I would check with the sequencing core (I'd imagine it's a problem with the run, though).

There may be different numbers of files if there are different total number of reads, not different read lengths.

From the reads you showed it looks like the indices have already been clipped and put into the headers. You may want to look at the FastX toolkit to find a way to trim adapter sequences. I align with Novoalign which does it during the alignment.
Heisman is offline   Reply With Quote
Old 05-07-2012, 07:34 AM   #10
archie.chauhan
Junior Member
 
Location: tn

Join Date: Nov 2011
Posts: 9
Default

thanks a lot...it helped
archie.chauhan is offline   Reply With Quote
Old 05-07-2012, 12:11 PM   #11
archie.chauhan
Junior Member
 
Location: tn

Join Date: Nov 2011
Posts: 9
Default

just a follow up of the above. illumina support has the following answer to the problem and i did find that the seq in the middle are good.

"The data that you provided looks to be very normal. Generally speaking there will be data at the beginning and end of the FASTQ that is of lower quality than the data in the middle of the file. This is simply due to sorting. This data appears to be of normal quality and appears to be intact. "

If you have time i want to discuss my course of action:

I am having 454 unpaired and paired data and illumina reads. I have assembled the 454 data using newbler. I plan to assemble illumina data using velvet. Combine the assemblies using minimus.

arc
archie.chauhan is offline   Reply With Quote
Old 05-07-2012, 01:30 PM   #12
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Oh, right; the first reads of a fastq file for Illumina will be around the edge of the flowcell, I think, making them more likely to be weird. Maybe do "less +1000000" and see what that looks like.

I've never done any assembly so you'll have to find somebody else.
Heisman is offline   Reply With Quote
Old 05-09-2012, 07:09 AM   #13
archie.chauhan
Junior Member
 
Location: tn

Join Date: Nov 2011
Posts: 9
Default

Hi sorry for the delayed response. I did "less +1000000" and the data looked good.

I have a few more queries:
1) I can see both the sequences flagged with "N" and "Y" which indicated that the sequences have not been filtered. Are there prog to do that.
2) Out seq provider has given multiple fastq.gz files per lane. What is the protocolto concatenate such files.
3) I am confused about the illumina paired end library in comparison to 454 pe library. The illumina lib has the following setup : adapter-seq-adapter in comparison to 454 which as seq-linker-seq. If the seq are 100bp each than in 454 we end up getting 200bp pe reads wheres in illumina we get separate 100bp R1 and R2 reads (for 2x100run). This means that in illumina we are just getting extra 100bp reads from the pe run which do not have any linking information. We can save money by doing unpaired ilumina runs. What is the use doing pe illumina run.

sorry for bombarding u with so many question.

regards,
arc
archie.chauhan is offline   Reply With Quote
Old 05-09-2012, 07:38 AM   #14
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

1. I don't know, as we only receive the unfiltered ones. Maybe "grep -v"?

2. I think you'll need to unzip them all and then concatenate.

3. With Illumina paired end runs you have something like this:

[flowcell adapter][sequencing primer][insert][sequencing primer][flowcell adapter]

The key is that the insert may be say 300bp, and if you do 2x100 reads, you'll sequence it like this (the dots are only spacers):

........--------->...................<---------.....
........xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....

So when aligning paired end data it is clear that the two mates for one read should align in that orientation fairly close to each other.
Heisman is offline   Reply With Quote
Old 05-09-2012, 07:48 AM   #15
archie.chauhan
Junior Member
 
Location: tn

Join Date: Nov 2011
Posts: 9
Default

thanks a lot.
archie.chauhan is offline   Reply With Quote
Old 05-25-2012, 10:15 PM   #16
modi2020
Member
 
Location: New York

Join Date: May 2012
Posts: 21
Default

Hi Heisman,

I Just need a simple clarification please.
Suppose you have a read like (AAACGGCGTTTCCC)
and you want to sequence it using Illumina paired end runs.
Does paired end mean you will get the sequence of the ends?
i.e does it imply we only sequence AAA and CCC in the sequence above.
if it is true, I assumed that using sequencing by synthesis we would get TTT and GGG.
I understand that if we mapped the sequences back to the reference we would anticipate that they are 8 bases apart (given no INDELS are present in our DNA at hand). Is this right?
However, I am majorly concerned about the sequence in between.
What really happens to it?
I guess my question is:
What is the benefit of paired end reads if we only sequence the ends and not whats in between?

I would really appreciate the help on clarifying this thought

Quote:
Originally Posted by Heisman View Post
1. I don't know, as we only receive the unfiltered ones. Maybe "grep -v"?

2. I think you'll need to unzip them all and then concatenate.

3. With Illumina paired end runs you have something like this:

[flowcell adapter][sequencing primer][insert][sequencing primer][flowcell adapter]

The key is that the insert may be say 300bp, and if you do 2x100 reads, you'll sequence it like this (the dots are only spacers):

........--------->...................<---------.....
........xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.....

So when aligning paired end data it is clear that the two mates for one read should align in that orientation fairly close to each other.
modi2020 is offline   Reply With Quote
Old 05-26-2012, 07:26 AM   #17
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Quote:
Originally Posted by modi2020 View Post
Hi Heisman,

I Just need a simple clarification please.
Suppose you have a read like (AAACGGCGTTTCCC)
and you want to sequence it using Illumina paired end runs.
Does paired end mean you will get the sequence of the ends?
i.e does it imply we only sequence AAA and CCC in the sequence above.
if it is true, I assumed that using sequencing by synthesis we would get TTT and GGG.
I understand that if we mapped the sequences back to the reference we would anticipate that they are 8 bases apart (given no INDELS are present in our DNA at hand). Is this right?
However, I am majorly concerned about the sequence in between.
What really happens to it?
I guess my question is:
What is the benefit of paired end reads if we only sequence the ends and not whats in between?

I would really appreciate the help on clarifying this thought
I have never thought about this an am honestly not sure if you get TTT or AAA in your read.

You are correct that you sequence the ends, and if you did 2x3bp reads you would get AAA and CCC (or maybe TTT and GGG). EDIT: You would actually get reverse complements, so AAA and GGG or TTT and CCC.

You would probably NOT anticipate that they are 8 base pairs apart. When you do a library prep you will almost certainly get some distribution of insert size fragments around a mean. So you would anticipate they will be 8 +/- 2bp apart, for example (more realistically maybe 250 +/- 50bp or something like that). The sequence in between remains unknown to you.

So, the benefit to paired end sequencing is three-fold, in my opinion. First, it makes it easier to map each fragment. If your read has two ends, A and B, and read A can be mapped almost equivalently to two locations in the genome, but read B can only be mapped to one location, the aligner will put read A at the location close to where read B maps.

Second, if you are at all interested in detecting larger CNVs/structural variants, PE reads are much more helpful. Two examples: first, if one read maps and the other does not it's possible the unmapped read spans a breakpoint of a CNV/SV, and you can do a split-read mapping of that read to try to determine the breakpoint. Second, If both reads map but the orientation is abnormal (ie, both map like "---->" instead of "---->" and "<----"), or if the distance between the mapped reads is abnormal (ie, you expect 250 +/- 50 but you observe for one PE read that the two reads are mapped 1000bp apart), that gives you a lot of information.

Third, and possibly the most useful (although the first point is quite useful), with PE reads it's much easier to remove duplicate reads and be more confident that they are in fact PCR duplicates as opposed to just being two random reads that align to the same location. If you have 1x100bp reads, you can have at most 100x coverage of any base without duplication (barring indels in the read). If you have 2x100bp reads an the insert sizes distribution is say 250 +/-50bp, you can potentially have 10,000x coverage or higher after removing all reads that look like duplicates.

Last edited by Heisman; 05-27-2012 at 08:15 AM.
Heisman is offline   Reply With Quote
Old 05-26-2012, 09:45 AM   #18
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Sorry, the two ends would be the reverse complements of each other's strands.
Heisman is offline   Reply With Quote
Old 05-27-2012, 07:29 AM   #19
modi2020
Member
 
Location: New York

Join Date: May 2012
Posts: 21
Default

Hi Heisman,

This clarifies the process a lot to me. Thank you so much for your detailed answer and help with this.

Best

Quote:
Originally Posted by Heisman View Post
I have never thought about this an am honestly not sure if you get TTT or AAA in your read.

You are correct that you sequence the ends, and if you did 2x3bp reads you would get AAA and CCC (or maybe TTT and GGG).

You would probably NOT anticipate that they are 8 base pairs apart. When you do a library prep you will almost certainly get some distribution of insert size fragments around a mean. So you would anticipate they will be 8 +/- 2bp apart, for example (more realistically maybe 250 +/- 50bp or something like that). The sequence in between remains unknown to you.

So, the benefit to paired end sequencing is three-fold, in my opinion. First, it makes it easier to map each fragment. If your read has two ends, A and B, and read A can be mapped almost equivalently to two locations in the genome, but read B can only be mapped to one location, the aligner will put read A at the location close to where read B maps.

Second, if you are at all interested in detecting larger CNVs/structural variants, PE reads are much more helpful. Two examples: first, if one read maps and the other does not it's possible the unmapped read spans a breakpoint of a CNV/SV, and you can do a split-read mapping of that read to try to determine the breakpoint. Second, If both reads map but the orientation is abnormal (ie, both map like "---->" instead of "---->" and "<----"), or if the distance between the mapped reads is abnormal (ie, you expect 250 +/- 50 but you observe for one PE read that the two reads are mapped 1000bp apart), that gives you a lot of information.

Third, and possibly the most useful (although the first point is quite useful), with PE reads it's much easier to remove duplicate reads and be more confident that they are in fact PCR duplicates as opposed to just being two random reads that align to the same location. If you have 1x100bp reads, you can have at most 100x coverage of any base without duplication (barring indels in the read). If you have 2x100bp reads an the insert sizes distribution is say 250 +/-50bp, you can potentially have 10,000x coverage or higher after removing all reads that look like duplicates.
modi2020 is offline   Reply With Quote
Old 08-01-2013, 12:40 AM   #20
jp.
Senior Member
 
Location: NikoNarita.jp

Join Date: Jul 2013
Posts: 135
Question

Dear Senior Member Heisman
I read all above in this post and thank you very much for increasing understanding for the newcomers, including me.

I got a question and think that you may answer it very easily; I received the results from company and didn't understand how much the insert size is as I have to mention insert size for aligning it. I called the company but they didnt give me an exact answer. They said adapter details are there in the report. The details, as I received, are;
Hiseq2000: PE
Read Length: 101 x 2
Insert Size: 80~380 (main 150)
Adapter 5': (1).TruSeq Universal Adapter, 58bp; (2.)TruSeq Adapter Index 1-12, 63bp; (3).TruSeq Adapter Index 13-27, 65bp.

My question comes with --mate-inner-dist / -r (Tophat2). If I calculate as per the tophat manual then it comes like this:
Example Adaptor PE(x2) Whole_Insert_Size Calculated(-r)
Tophat 50 100 300 200
If,(1). 58 116 380 264
If,(2). 63 126 380 254
If,(3). 65 130 380 250

Q1. Which is the correct -r calculated above of my sample, if any? Is it 250 (+/-14)?
Q2. Do I need more information from seq-company to calculate these values ?
Q3. What am I missing for calculating insert size ?

Please do reply as I am troubling so much ..
Thank you.


Quote:
Originally Posted by Heisman View Post
Sorry, the two ends would be the reverse complements of each other's strands.
jp. is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:15 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.