SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Average Read Coverage for 454 paired end read data lisa1102 Core Facilities 8 10-18-2011 08:40 AM
Paired end Short read data SS1234 Bioinformatics 6 06-09-2010 01:16 PM
help! what is a paired-end read? hitdavid Bioinformatics 1 01-14-2010 11:42 AM
Difference in paired-end and single-end read ? darshan Bioinformatics 1 09-30-2009 11:44 PM

Reply
 
Thread Tools
Old 11-06-2011, 09:33 PM   #41
ywlim
Junior Member
 
Location: California

Join Date: Jul 2011
Posts: 6
Default

Hi all I am still struggling with using BWA to align my paired end reads. I used the command:

bwa sampe -P -s hg19.fasta CATTCG_1.sai CATTCG_3.sai CATTCG_1.fastq CATTCG_3.fastq > CATTCG_PE.sam

and the first few lines of the program running look like this:

Quote:
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] fail to infer insert size: too few good pairs
[bwa_sai2sam_pe_core] time elapses: 10.96 sec
[bwa_sai2sam_pe_core] changing coordinates of 6 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_sai2sam_pe_core] time elapses: 0.00 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.82 sec
[bwa_sai2sam_pe_core] print alignments... 1.99 sec
[bwa_sai2sam_pe_core] 262144 sequences have been processed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (3520, 39961, 70863)
[infer_isize] low and high boundaries: 94 and 205549 for estimating avg and std
[infer_isize] inferred external isize from 27 pairs: 37726.370 +/- 35311.353
[infer_isize] skewness: 0.341; kurtosis: -1.395; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 251007 (6.04 sigma)
[bwa_sai2sam_pe_core] time elapses: 10.87 sec
[bwa_sai2sam_pe_core] changing coordinates of 178 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_sai2sam_pe_core] time elapses: 0.00 sec
[bwa_sai2sam_pe_core] refine gapped alignments... 0.82 sec
[bwa_sai2sam_pe_core] print alignments... 1.97 sec
[bwa_sai2sam_pe_core] 524288 sequences have been processed.
The program seems to be running fine and quite quickly, but when I look at the output file, I see something like this:

Quote:
DJB775P1_0215:5:1101:1262:2347#0 65 chr13 92966174 37 94M chr5 33346129 0 AATAACCACCTAGATAAATGTTCACTCATCTCGCCTGTCTAGCCTGTCTTGAGGCCGGTTTCATCATGAGTCACTCCACCAATTACTTCAAAAC cgggeghhhhfhhfffgffegbcfgffdhfffhhhdb^efgfddfhhhhffS\eefgg\W`c]RZ^__GU]UMMZ_Z]\_a^_TR_bbbbYYYW XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:94
DJB775P1_0215:5:1101:1495:2155#0/3 129 chr5 33346129 37 94M chr13 92966174 0 AATTAACTTCCTTTTTTTGTCTTCATATAACACTGTTGACCTACTCATATTGAGCCCTCAGTCTTTTTTGTACACATGCTCATCCCTGGCATGT ceggggiiiiiiiiiiiiihiiiiihiiihihifgffhiig`fgfghhfhghffhhihigfgfgeeceacabbcccb`b_bcccccc_X[`b^Y XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:94
DJB775P1_0215:5:1101:1465:2351#0 81 chr14 44601925 37 94M chr5 33346129 0 TCTCCTACCTCCTCTCCCTTATAGAAATCCCTGTGATTCCATTAGTCTCACCTGGATAAACCAGAGTATTCTTATTATCCCAAGATCCCCATCT XTR^YGG\^ZZZRbb]]ccc_db]d^dgfcb`bZZe_bSfc`fgf_fc^Iffhhhhfhfagddhfhhhhfffgeefddfedbbd_agfcgebe\ XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:38A55
DJB775P1_0215:5:1101:1483:2169#0/3 161 chr5 33346129 37 94M chr14 44601925 0 AATTAACTTCCTTTTTTTGTCTTCATATAACACTGTTGACCTACTCATATTGAGCCCTCAGTCTTTTTTGTACACATGCTCATCCCTGGCATGT ceYbgae_egihffhhd_^efhihfhhfXcghfcgacgfI^[cba\eecgZ_HWWHLaZ`VVVb`gacccZb_RZ]`]bbcbbbb^bc^`X][S XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:94
DJB775P1_0215:5:1101:1317:2430#0 81 chr9 8239023 37 94M chr12 51288201 0 TTAAGTATTAAATGACATAAAACCTATAAAGCACATAGCAGGTAAATGTGGTAAACTCTTGATAAATGTTATTGTTATCATCATCATCATCACT b`]a^VHRcaggggbgeghgc\bhgefbZ\MW[gfgce^[gfff^^aa^OXeaYIae[hhhhgf^[d[hgfge_hhhgd_hY^bfdbcegecZc XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:94
DJB775P1_0215:5:1101:1461:2198#0/3 161 chr12 51288201 37 94M chr9 8239023 0 CTGTTGGCTGGAATGTAAAATGGTGCAGCTGCTGTGGAAAACTGCATGGCAGTTCCTAGAAAAATTAAAAATAGAATTACCATATGATCCAGCA egggggefdf`egg[bdgh`]gh^dbe`dfhhbffbgIIX^^e_fgffhabgH\\_\HM\d]dUGV\\VV_ZVVHHUZ__bbc]`BBBBBBBBB XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0
It is really peculiar that most if not all read pairs have been mapped to different chromosomes. How is this possible? When I use samtools to filter out the correctly paired reads, I only obtained a very small file.

Can someone tell me why my reads are paired so weirdly? Any help is greatly appreciated!
ywlim is offline   Reply With Quote
Old 11-07-2011, 08:53 AM   #42
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

If the reads have wildly different names, they aren't supposed to be paired with eaach other.

bwa assumes that the first read of the first fq goes with the first read of the second fq, and so on. That doesn't appear to be the case here, that's why your "pairs" are all over the place.
swbarnes2 is offline   Reply With Quote
Old 11-08-2011, 09:38 PM   #43
ywlim
Junior Member
 
Location: California

Join Date: Jul 2011
Posts: 6
Default

Thank you so much! Now that I make sure that both input files have the same fragment's reads in the same orders, everything is working now.
ywlim is offline   Reply With Quote
Old 02-08-2012, 05:22 AM   #44
Arturo S.G.
Junior Member
 
Location: Barcelona, Spain

Join Date: Feb 2012
Posts: 1
Default

Hello to the SEQanswers community!

I came looking for the answer of a simple question on paired-end reads and I found much more (useful) info on this thread.

Thanks to all the contributors
Arturo S.G. is offline   Reply With Quote
Old 11-09-2012, 12:34 AM   #45
wanfahmi
Member
 
Location: North Sea

Join Date: Apr 2008
Posts: 34
Default

Hey,

I just wanna ask about paired-end data filtering. Do I need to filter read 1 and read 2 separately or combine read 1 and read 2 then filter? Because later on I want to use the filter data for RNA-seq analysis, using Tophat and cufflink. But, the Tophat require the read 1 and read 2 as input not as paired-end.
wanfahmi is offline   Reply With Quote
Old 12-12-2012, 04:39 PM   #46
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Hi!

I'm analyzing a "second-hand" dataset generated using SOLiD 4. It is a transcriptome mate pair library that is 52 x 37 nt, and I cannot for the sake of me find the protocol that was used to generate those specific read lengths. I have F3 and R3 reads, so I am assuming it is a circularization protocol, but I do not know what the size selection parameters were, or how the circles were cut to produce the final fragments. This info would be very valuable for a more accurate mapping.

Any knowledge would be greatly appreciated!

Thanks a lot,

Carmen
carmeyeii is offline   Reply With Quote
Old 06-13-2013, 11:07 PM   #47
naveedakhtar
Junior Member
 
Location: Pakistan

Join Date: Jun 2013
Posts: 2
Smile to [QUOTE=ECO;1350]biocc,

Quote:
Originally Posted by ECO View Post
biocc,

"paired end" or "mate pair" refers to how the library is made, and then how it is sequenced. Both are methodologies that, in addition to the sequence information, give you information about the physical distance between the two reads in your genome.

For example, you shear up some genomic DNA, and cut a region out at ~500bp. Then you prepare your library, and sequence 35bp from each end of each molecule. Now you have three pieces of information:

--the tag 1 sequence
--the tag 2 sequence
--that they were 500bp (some) apart in your genome

This gives you the ability to map to a reference (or denovo for that matter) using that distance information. It helps dramatically to resolve larger structural rearrangements (insertions, deletions, inversions), as well as helping to assemble across repetitive regions.

Structural rearrangements can be deduced when your read pairs map to a reference at a distance that is substantially different from how that library was constructed (~500bp in the above example). Let's say you had two reads that mapped to your reference 1000bp apart...this suggests there has been a deletion between those two sequence reads within your genome. Same thing with an insertion, if your reads mapped 100bp apart on the reference, this suggests that your genome has an insertion.

Mapping over repeats is similar...if one read is unmappable because it falls in a very repetitive region (eg. LINE, LTR, SINE), but the other is unique, you can again use that distance information to map both reads. The first read would likely come from the repeat that is ~500bp away from your unique second read.

Hope that helps. It's a weird concept at first, but very useful for all types of sequencing. It's been around at some levels since the days of shotgun sequencing.

And lastly, the terminology between "paired end" and "mate pair" is typically that "paired end" refers to sequencing both ends of the same molecule, while "mate pair" (in ABI's case) refers to sequencing only two tags (made by Type IIS restriction enzymes a la SAGE) from the ends of a typically much larger molecule. I could be wrong here though...
how can paired end sequencing detect inversion? that you mentioned it along the detection of strucural rearrangment?
naveedakhtar is offline   Reply With Quote
Old 06-13-2013, 11:18 PM   #48
naveedakhtar
Junior Member
 
Location: Pakistan

Join Date: Jun 2013
Posts: 2
Default

my another question is that in the whole genome shot gun assembly the paired end sequencing of large insert clone, specially prepared, used as strategy to overcome the genome assembly problem due to repetitive sequences in the sequencing of complex genome. how does paired end sequencing perform this in the absence of a reference genome?
naveedakhtar is offline   Reply With Quote
Old 06-14-2013, 03:56 AM   #49
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 659
Default what is a paired-end read?

Quote:
Originally Posted by naveedakhtar View Post
my another question is that in the whole genome shot gun assembly the paired end sequencing of large insert clone, specially prepared, used as strategy to overcome the genome assembly problem due to repetitive sequences in the sequencing of complex genome. how does paired end sequencing perform this in the absence of a reference genome?
You use a program that does de novo assembly (velvet, abyss, mira, soapdenovo, many others). If there is a reference genome of a related species you can use that as a reference for the assembly. Having paired reads can help to scaffold the contigs.
mastal is offline   Reply With Quote
Old 08-15-2013, 07:37 AM   #50
OTU
Member
 
Location: Utah

Join Date: May 2013
Posts: 44
Default

Hi all!

I have a question... Found in some old papers (1999) a term "forward-reverse constraints". The question is - is this term the same as "paired-end reads"???

OTU
OTU is offline   Reply With Quote
Old 08-19-2013, 06:56 AM   #51
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

No. Not the same as 'paired-end reads' although it has to do with paired-end. Google the term. That should be enlightening.
westerman is offline   Reply With Quote
Old 05-21-2014, 05:29 PM   #52
binlangman
Member
 
Location: China

Join Date: Dec 2013
Posts: 11
Default what is the paired-end distance?

I read papers, and they mentioned 'the paired-end distance' many times. What is the paired -end distance?
Example:
If |-----75----|----------------------100-----------------|-----75-----|,
and paired-end data both 75bp, and in this case,the paired-end distance is 100 or 250 or others?

Thanks!
binlangman is offline   Reply With Quote
Old 05-22-2014, 01:41 AM   #53
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 659
Default

It could be either 250 or 100.

Different software packages may have their own definition of 'insert length', so it's best to read the documentation carefully.

For example, in the case you have illustrated, velvet would define the 'insert length' as 250.

This is the definition given in the velvet manual:

"The insert length is understood to be the length of the sequenced fragment, i.e. it includes the length of the reads themselves."

https://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf

Bowtie2 also uses the same definition of fragment length.
mastal is offline   Reply With Quote
Old 10-21-2015, 08:07 AM   #54
mido1951
Senior Member
 
Location: Tunisia

Join Date: May 2014
Posts: 123
Default

how to do an assembly if we have paired end reads? (two files R1.fq and R2.fq)?
thankyou
mido1951 is offline   Reply With Quote
Old 10-21-2015, 08:15 AM   #55
OTU
Member
 
Location: Utah

Join Date: May 2013
Posts: 44
Default

What is your data on? Metagenome, single genome?
What sequencing platform did you use? What is the processing computer power that you can use?
OTU is offline   Reply With Quote
Old 10-21-2015, 08:48 AM   #56
mido1951
Senior Member
 
Location: Tunisia

Join Date: May 2014
Posts: 123
Default

I have llumina paired end data.
I want to make an assembly of these data.
But the problem I do not understand the two F1.fq file and F2.fq.
Is that reads and reads of F1.fq F2.fq are complementary or not?
for the assembly do I have to overlap F1.fq or I have to overlap and F1.fq F2.fq?
thanky
mido1951 is offline   Reply With Quote
Old 10-21-2015, 09:02 AM   #57
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,436
Default

Quote:
Originally Posted by mido1951 View Post
I have llumina paired end data.
I want to make an assembly of these data.
But the problem I do not understand the two F1.fq file and F2.fq.
Is that reads and reads of F1.fq F2.fq are complementary or not?
for the assembly do I have to overlap F1.fq or I have to overlap and F1.fq F2.fq?
thanky
Cross-posted: https://www.biostars.org/p/162806/

@mido1951: See this page for a simple explanation of "shotgun sequencing": https://en.wikipedia.org/wiki/Shotgun_sequencing In the past people used sanger sequencing for this, which has now been replaced with NGS.

R1/R2 are merely sequences from the two ends of a fragment. They do not need to be complementary (in fact in most cases they will not be). You do not need to worry about R1/R2 reads individually but use them as a set for assembly.
GenoMax is online now   Reply With Quote
Old 10-22-2015, 01:53 PM   #58
mido1951
Senior Member
 
Location: Tunisia

Join Date: May 2014
Posts: 123
Default

for example:
we have the sequence: S1: ATCGTTGAGCAGACT and the sequence S2: TGAGCAGACTTAAGTAGTTTT .
and for example, was the first sequenced reads from S1: R1 = ATCGTTGAG
R2 = AGTCTGCTC (reverse complement from the right)
and from the second sequence: R1: TGAGCAGAC
R2: AAAACTACT (reverse complement from the right)
So we have the two files paired end:
F1.fq:
S1: R1=ATCGTTGAG
S2: R1=TGAGCAGAC
F2.fq:
S1: R2=AGTCTGCTC
S2: R2=AAAACTACT

in the assembly here there is an overlap between R1(S1) and R1(S2).
in assembly, we can have overlap between R1 and R2 from two differents sequence??

Last edited by mido1951; 10-22-2015 at 01:55 PM.
mido1951 is offline   Reply With Quote
Old 10-22-2015, 01:58 PM   #59
OTU
Member
 
Location: Utah

Join Date: May 2013
Posts: 44
Default

Don't see why
"F1.fq:
S1: R1=ATCGTTGAG
S2: R1=TGAGCAGAC " would overlap. They only match at 3 bp. Assemblers won't combine them.
OTU is offline   Reply With Quote
Old 10-22-2015, 02:01 PM   #60
mido1951
Senior Member
 
Location: Tunisia

Join Date: May 2014
Posts: 123
Default

I speak of an example.
it is assumed that it is an overlap.
I want to create an assembly tool but first I need to know how to detect overlap between the paired ends (from two files). and make assembly with paired end.
mido1951 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:13 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO