SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
50 bp paired end reads vs. 100 bp single end reads efoss Bioinformatics 12 01-15-2014 09:05 PM
Can Cuffdiff treat paired-end and single-end reads at the same time? zun RNA Sequencing 3 06-12-2012 06:37 PM
question about paired-end libraries ajthomas 454 Pyrosequencing 29 11-29-2011 01:30 PM
paired-end reads mapped to genome.. gene with only one direction of paired-end reads? danwiththeplan Bioinformatics 2 09-22-2011 03:06 AM
paired-end question abelhj Illumina/Solexa 1 12-30-2009 02:01 PM

Reply
 
Thread Tools
Old 07-31-2011, 09:41 PM   #1
ashwatha
Member
 
Location: India

Join Date: Jul 2011
Posts: 14
Question Question about paired end reads

Hi,

I have a slightly weird question about paired end reads. I will try to explain as best as I can:

For simplicity, let's assume that the read length is just 3 base pairs. Let the DNA fragment being read have the sequence AGCTAAGGTCG.

With paired end reads, my understanding is that we will read the first three (AGC) and last three (TCG) bases of this sequence, with the middle section (TAAGG) unknown.

With the common data formats used to represent paired end reads (FastQ etc), how is the pair represented? Are the two pairs shown as AGC and TCG (both reads running left to right on the original sequence) or as AGC and GCT - the "left" read running from left to right and the "right" read running from right to left, presumably the direction in which the two reads were extracted?

I guess what I am asking is: Is there a directionality to the reads? Are all reads represented in the same "direction" as related to the genome from which they were extracted? Does this apply to the two pairs of a paired end read?

Please let me know if I am not making any sense at all :-)

Last edited by ashwatha; 07-31-2011 at 09:54 PM. Reason: grammar
ashwatha is offline   Reply With Quote
Old 08-02-2011, 07:30 AM   #2
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 192
Default

Quote:
Originally Posted by ashwatha View Post
Is there a directionality to the reads?
Yes there is, normally it is indicated for ech read through a '+' or '-' or W(atson)/C(rick) or F(orward)/R(everse). So you can distinguish between reads from both strand.


Quote:
Originally Posted by ashwatha View Post
Are all reads represented in the same "direction" as related to the genome from which they were extracted? Does this apply to the two pairs of a paired end read?
No all reads are considered to be written from left to right. The strand flag should make clear which strand the read originated from.
To answer your question how one is able to find mate pairs in the sequence file. Usually in the fastq file there is a flag at the end of the header line (normally '/1' or '/2') which indicates whether it is a 'front' or an 'end' read. Comming up with your example it should look like this:
>Read1 more headerinfo /1
AGC
>Read2 more headerinfo /2
TCG

nice revision on all such stuff can be found on: http://en.wikipedia.org/wiki/FASTQ_format , for instance.

hope that helps,

best

phil
sphil is offline   Reply With Quote
Old 08-02-2011, 09:30 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by ashwatha View Post
Hi,

I have a slightly weird question about paired end reads. I will try to explain as best as I can:

For simplicity, let's assume that the read length is just 3 base pairs. Let the DNA fragment being read have the sequence AGCTAAGGTCG.

With paired end reads, my understanding is that we will read the first three (AGC) and last three (TCG) bases of this sequence, with the middle section (TAAGG) unknown.

With the common data formats used to represent paired end reads (FastQ etc), how is the pair represented? Are the two pairs shown as AGC and TCG (both reads running left to right on the original sequence) or as AGC and GCT - the "left" read running from left to right and the "right" read running from right to left, presumably the direction in which the two reads were extracted?

I guess what I am asking is: Is there a directionality to the reads? Are all reads represented in the same "direction" as related to the genome from which they were extracted? Does this apply to the two pairs of a paired end read?

Please let me know if I am not making any sense at all :-)
Here's a real example from a Staph Aureus run we did a few weeks ago. The first is from read 1, the second is from read 2

Quote:
@I-HWUSI-EAS1826:5:70N3AAAXX_FL:8:4:16707:8219 1:N:0:CGATGT
ATACATCCTCATTTCTCACTAATTTATTTCTGTTAAAATATTAAAACTAACATGATCCAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

@I-HWUSI-EAS1826:5:70N3AAAXX_FL:8:4:16707:8219 2:N:0:CGATGT
AATTACAGCGAAGGATTTATTAGAAAATATGCAAGCGTAGTAAATATTGAACCTAACCAA
+
IIIIIIIIIIIIIIIHIHIIIHIIIIIIIIIHIIIIIHIIIIIIIHIIIIIIIIIIGIII
If you blast those, you'll see that they run in opposite directions, and towards each other, as a proper paired end pair of reads should.

So actually, in your example, your reads would be AGC and CGA. Most alignment programs would report them both in the forward direction, and have a tag in there to tell you that the read is rev comped where appropriate.
swbarnes2 is offline   Reply With Quote
Old 08-02-2011, 08:31 PM   #4
ashwatha
Member
 
Location: India

Join Date: Jul 2011
Posts: 14
Default

Hi Phil and swbarnes,

thanks for the info - very helpful.
ashwatha is offline   Reply With Quote
Old 08-03-2011, 06:31 PM   #5
chenyao
Member
 
Location: Beijing

Join Date: Jul 2011
Posts: 74
Default

Quote:
Originally Posted by swbarnes2 View Post
Here's a real example from a Staph Aureus run we did a few weeks ago. The first is from read 1, the second is from read 2



If you blast those, you'll see that they run in opposite directions, and towards each other, as a proper paired end pair of reads should.

So actually, in your example, your reads would be AGC and CGA. Most alignment programs would report them both in the forward direction, and have a tag in there to tell you that the read is rev comped where appropriate.
I don't get it. Does the pair-end reads have to come from the opposite directions (one is "+", the other is "-"). If it is, why your example show both read are "+"?
chenyao is offline   Reply With Quote
Old 08-03-2011, 07:41 PM   #6
Stuart Inglis
Registered Vendor
 
Location: New Zealand

Join Date: Jun 2011
Posts: 9
Default A little forward/reverse and paired end example

I thought maybe a little example would help (using RTG Investigator tool chain of course )

I grabbed a bit of the sequence above and manually made two reads. Two 10-mers, the first forward from the beginning of the sequence and second reverse complement from the end of the sequence. e.g. I grabbed the last 10-mer (CATGATCCAT) and reverse complemented it to get ATGGATCATG.

$ cat template.fasta
>test
ATACATCCTCATTTCTCACTAATTTATTTCTGTTAAAATATTAAAACTAACATGATCCAT

$ cat reads.fasta
>read1
ATACATCCTC
>read2
ATGGATCATG

$ rtg format -o t template.fasta

Run a single end mapping run:

$ rtg map -o o -i reads.fasta -F fasta -t t
$ zcat o/alignments.sam.gz | grep -v "@"
0 0 test 1 37 10= * 0 0 ATACATCCTC * AS:i:0 NM:i:0 IH:i:1 NH:i:1
1 16 test 51 37 10= * 0 0 CATGATCCAT * AS:i:0 NM:i:0 IH:i:1 NH:i:1

The second column of the SAM file shows that a bit (0x10 which equals 16 in decimal) is set if the read is reverse frame.

The SAM file contains the read in the forward direction (same as the template sequence), but this extra flag allows you to determine the direction.

In the paired end world this may look like:

$ cat left.fasta
>read
ATACATCCTC

$ cat right.fasta
>read
ATGGATCATG


Then run a paired-end mapping run:

$ rtg map -o o -l left.fasta -r right.fasta -F fasta -t t
$ zcat o/mated.sam.gz | grep -v "@"
0 99 test 1 55 10= = 51 60 ATACATCCTC * AS:i:0 NM:i:0 MQ:i:255 XA:i:0 IH:i:1 NH:i:1
0 147 test 51 55 10= = 1 -60 CATGATCCAT * AS:i:0 NM:i:0 MQ:i:255 XA:i:0 IH:i:1 NH:i:1


The second column is harder to decode now. 99 and 147 mean mapped in correct orientation and correct insert size. For a breakdown of the two codes see http://ppotato.wordpress.com/2010/08...-paired-reads/

Hope this helps.

cheers
Stu
__________________
Stuart Inglis, Ph.D.
Real Time Genomics
www.realtimegenomics.com
Stuart Inglis is offline   Reply With Quote
Old 08-03-2011, 10:46 PM   #7
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by chenyao View Post
I don't get it. Does the pair-end reads have to come from the opposite directions (one is "+", the other is "-"). If it is, why your example show both read are "+"?
It's a fastq file, it hasn't been mapped, the software that made it has no idea whether it is in the forward or reverse direction, it doesn't even know what reference I want to align it to.

The plus is just a place holder. In the old days, before fastqs routinely had several million individual entries per file, the name of the read was rewritten after the + sign. Once fastqs started having millions of 40-mers and their 40 character quality scores, repeating the read name made each read 25% bigger than it had to be, so now, no one writes anything after that plus sign.

And if you do a standard paired end read, then yes, the reads should point in at each other. I think mate paired reads, which are a more complex prep intended to greatly increase the genomic distance between the two ends, the reads read outwardly, but I might be mistaken on that point.

If you have paired end reads that don't point in at each other, then you have inaccurate reads, or an inaccurate reference as compared to the sample.
swbarnes2 is offline   Reply With Quote
Old 02-14-2012, 08:21 PM   #8
Arthur123
Junior Member
 
Location: TX

Join Date: Feb 2012
Posts: 2
Default

Quote:
Originally Posted by swbarnes2 View Post
It's a fastq file, it hasn't been mapped, the software that made it has no idea whether it is in the forward or reverse direction, it doesn't even know what reference I want to align it to.

.
So for illumina pair end data, read 1 and read 2 does not denote forward and reverse, right?
Arthur123 is offline   Reply With Quote
Old 02-15-2012, 08:30 AM   #9
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 274
Default

Quote:
Originally Posted by swbarnes2 View Post
I think mate paired reads, which are a more complex prep intended to greatly increase the genomic distance between the two ends, the reads read outwardly, but I might be mistaken on that point.
Yes, that is my understanding as well. Paired-end reads are "innie" and mate pairs are "outie." Sanger paired ends are generated from a completely different process (sequencing the ends of BAC clones) and the result is that those paired ends are "outie." This leads to a lot of confusion when using a mix of technologies, or using software that expects your paired ends in a certain orientation.
SES is offline   Reply With Quote
Old 02-15-2012, 09:34 AM   #10
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by Arthur123 View Post
So for illumina pair end data, read 1 and read 2 does not denote forward and reverse, right?
The enzymes putting the adaptors on the piece of DNA have no idea which way your particular reference is oriented, and have no way of distinguishing which end of the DNA coresponds to the "forward" sequence. They are just molecules.

The only exception would be if you were doing something like a library of vectors with various insert sequences, and you wanted to know all the insert sequences. One could do PCR around those inserts, and put adaptor sequences on those PCR primers, and then adaptor 1 would be fixed at one point in the vector, and adaptor 2 woud be fixed at the other end.

But if you are just randomly cutting DNA, then half of read 1 will be in one orientation, half will be in the other. Same with read 2.
swbarnes2 is offline   Reply With Quote
Old 02-15-2012, 12:14 PM   #11
Arthur123
Junior Member
 
Location: TX

Join Date: Feb 2012
Posts: 2
Default

Quote:
Originally Posted by swbarnes2 View Post
The enzymes putting the adaptors on the piece of DNA have no idea which way your particular reference is oriented, and have no way of distinguishing which end of the DNA coresponds to the "forward" sequence. They are just molecules.

The only exception would be if you were doing something like a library of vectors with various insert sequences, and you wanted to know all the insert sequences. One could do PCR around those inserts, and put adaptor sequences on those PCR primers, and then adaptor 1 would be fixed at one point in the vector, and adaptor 2 woud be fixed at the other end.

But if you are just randomly cutting DNA, then half of read 1 will be in one orientation, half will be in the other. Same with read 2.
Thanks! You are awesome!
Arthur123 is offline   Reply With Quote
Old 07-18-2013, 03:58 PM   #12
fongchun
Member
 
Location: Vancouver, BC

Join Date: May 2011
Posts: 55
Default

Quote:
Originally Posted by sphil View Post
No all reads are considered to be written from left to right. The strand flag should make clear which strand the read originated from.
To answer your question how one is able to find mate pairs in the sequence file. Usually in the fastq file there is a flag at the end of the header line (normally '/1' or '/2') which indicates whether it is a 'front' or an 'end' read. Comming up with your example it should look like this:
>Read1 more headerinfo /1
AGC
>Read2 more headerinfo /2
TCG
nice revision on all such stuff can be found on: http://en.wikipedia.org/wiki/FASTQ_format , for instance.

hope that helps,

best

phil
Hi,

Are you sure about this? Because I have two paired fastq files from a MiSeq machine and here is the read pair:

Read Pair 1:
Quote:
@M00569:20:000000000-A3EGF:1:1101:14488:1761 1:N:0:1
ACAGAATGTAAGCTTTCTAACTCATAAAACTCTTTCTGGAGGTCTGTAATTTTCTGCATAGGATCTTCATAAATCTGTTCTGAAAGTCTTATCTTTTGCTCTCTTCCTTTCTGCTGCATAAATCCATTTTCTTCTTCTTGCCTTGTTAGCA
+
>>>334DBDB55EGGGGG65FGGBG5555FGHHHHHHFFBA?EFGFHEFGHHHHHBFBHBBB3FGHHHFHFBBFGHBFHHHE5E3BFGHH5GGHHHHFDHGFHHHHHHHHHHFFHBG3F43EFGHFHHHHHHFHHHHHHBFGHF3GGF4F4
Read Pair 2:
Quote:
@M00569:20:000000000-A3EGF:1:1101:14488:1761 2:N:0:1
NNGGGATGCTAATAGAGGATTATATTTATGAATCTTTAGTAGAAGACACGTACAATGGATCGGTAGATGGCAGTCTGCTAACAAGGCAAGAAGAAGAAAATGGATTTATGCAGCAGAAAGGAAGAGAGCAAAAGATAAGACTTTCAGAACA
+
##1111>1>D33B331111BFBGBGHHFHBFFFGGGHC1FB2B21CFBFCHFG?1FBB1FF//EA/AFDBG0EGGHFFHFFFFBGEFA0C00C10>BCCBGBB1FGHFGFGFFFF0C01CE0CAAG0>GHCBFFAHFFHEHHGHHBB2FF0
Here the second read pair is actually the reverse complement of the reference human sequence at the loci. So in that example that was stated, I would have thought it would be:

>Read1 more headerinfo /1
AGC
>Read2 more headerinfo /2
CGA

Perhaps I am mistaken?
fongchun is offline   Reply With Quote
Old 07-19-2013, 01:15 AM   #13
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 192
Default

As swbarnes2 stated above. The reads are just like you said. I just wanted to point out that it is going from left to right and therefore didn't mention that it is actually also rev. comped. So your second read should always be the reverse complement of the loci the 'first' read maps to.

Maybe this http://www.illumina.com/technology/p...ing_assay.ilmn helps to clarify things for good .

Quote:
Originally Posted by swbarnes2 View Post
So actually, in your example, your reads would be AGC and CGA. Most alignment programs would report them both in the forward direction, and have a tag in there to tell you that the read is rev comped where appropriate.
sphil is offline   Reply With Quote
Old 07-23-2017, 05:14 AM   #14
serrano.gus
Junior Member
 
Location: Philippines

Join Date: Mar 2017
Posts: 3
Default

Hi. I was given raw reads by a service provider but there were no Left or Right reads. Is there any way that I could revert back to separate R and L?
serrano.gus is offline   Reply With Quote
Old 07-23-2017, 05:22 AM   #15
serrano.gus
Junior Member
 
Location: Philippines

Join Date: Mar 2017
Posts: 3
Default

Please ignore my question. I already found the paired reads. Thanks.
serrano.gus is offline   Reply With Quote
Reply

Tags
paired end read, sequencing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO