Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about paired end reads

    Hi,

    I have a slightly weird question about paired end reads. I will try to explain as best as I can:

    For simplicity, let's assume that the read length is just 3 base pairs. Let the DNA fragment being read have the sequence AGCTAAGGTCG.

    With paired end reads, my understanding is that we will read the first three (AGC) and last three (TCG) bases of this sequence, with the middle section (TAAGG) unknown.

    With the common data formats used to represent paired end reads (FastQ etc), how is the pair represented? Are the two pairs shown as AGC and TCG (both reads running left to right on the original sequence) or as AGC and GCT - the "left" read running from left to right and the "right" read running from right to left, presumably the direction in which the two reads were extracted?

    I guess what I am asking is: Is there a directionality to the reads? Are all reads represented in the same "direction" as related to the genome from which they were extracted? Does this apply to the two pairs of a paired end read?

    Please let me know if I am not making any sense at all :-)
    Last edited by ashwatha; 07-31-2011, 08:54 PM. Reason: grammar

  • #2
    Originally posted by ashwatha View Post
    Is there a directionality to the reads?
    Yes there is, normally it is indicated for ech read through a '+' or '-' or W(atson)/C(rick) or F(orward)/R(everse). So you can distinguish between reads from both strand.


    Originally posted by ashwatha View Post
    Are all reads represented in the same "direction" as related to the genome from which they were extracted? Does this apply to the two pairs of a paired end read?
    No all reads are considered to be written from left to right. The strand flag should make clear which strand the read originated from.
    To answer your question how one is able to find mate pairs in the sequence file. Usually in the fastq file there is a flag at the end of the header line (normally '/1' or '/2') which indicates whether it is a 'front' or an 'end' read. Comming up with your example it should look like this:
    >Read1 more headerinfo /1
    AGC
    >Read2 more headerinfo /2
    TCG

    nice revision on all such stuff can be found on: http://en.wikipedia.org/wiki/FASTQ_format , for instance.

    hope that helps,

    best

    phil

    Comment


    • #3
      Originally posted by ashwatha View Post
      Hi,

      I have a slightly weird question about paired end reads. I will try to explain as best as I can:

      For simplicity, let's assume that the read length is just 3 base pairs. Let the DNA fragment being read have the sequence AGCTAAGGTCG.

      With paired end reads, my understanding is that we will read the first three (AGC) and last three (TCG) bases of this sequence, with the middle section (TAAGG) unknown.

      With the common data formats used to represent paired end reads (FastQ etc), how is the pair represented? Are the two pairs shown as AGC and TCG (both reads running left to right on the original sequence) or as AGC and GCT - the "left" read running from left to right and the "right" read running from right to left, presumably the direction in which the two reads were extracted?

      I guess what I am asking is: Is there a directionality to the reads? Are all reads represented in the same "direction" as related to the genome from which they were extracted? Does this apply to the two pairs of a paired end read?

      Please let me know if I am not making any sense at all :-)
      Here's a real example from a Staph Aureus run we did a few weeks ago. The first is from read 1, the second is from read 2

      @I-HWUSI-EAS1826:5:70N3AAAXX_FL:8:4:16707:8219 1:N:0:CGATGT
      ATACATCCTCATTTCTCACTAATTTATTTCTGTTAAAATATTAAAACTAACATGATCCAT
      +
      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

      @I-HWUSI-EAS1826:5:70N3AAAXX_FL:8:4:16707:8219 2:N:0:CGATGT
      AATTACAGCGAAGGATTTATTAGAAAATATGCAAGCGTAGTAAATATTGAACCTAACCAA
      +
      IIIIIIIIIIIIIIIHIHIIIHIIIIIIIIIHIIIIIHIIIIIIIHIIIIIIIIIIGIII
      If you blast those, you'll see that they run in opposite directions, and towards each other, as a proper paired end pair of reads should.

      So actually, in your example, your reads would be AGC and CGA. Most alignment programs would report them both in the forward direction, and have a tag in there to tell you that the read is rev comped where appropriate.

      Comment


      • #4
        Hi Phil and swbarnes,

        thanks for the info - very helpful.

        Comment


        • #5
          Originally posted by swbarnes2 View Post
          Here's a real example from a Staph Aureus run we did a few weeks ago. The first is from read 1, the second is from read 2



          If you blast those, you'll see that they run in opposite directions, and towards each other, as a proper paired end pair of reads should.

          So actually, in your example, your reads would be AGC and CGA. Most alignment programs would report them both in the forward direction, and have a tag in there to tell you that the read is rev comped where appropriate.
          I don't get it. Does the pair-end reads have to come from the opposite directions (one is "+", the other is "-"). If it is, why your example show both read are "+"?

          Comment


          • #6
            A little forward/reverse and paired end example

            I thought maybe a little example would help (using RTG Investigator tool chain of course )

            I grabbed a bit of the sequence above and manually made two reads. Two 10-mers, the first forward from the beginning of the sequence and second reverse complement from the end of the sequence. e.g. I grabbed the last 10-mer (CATGATCCAT) and reverse complemented it to get ATGGATCATG.

            $ cat template.fasta
            >test
            ATACATCCTCATTTCTCACTAATTTATTTCTGTTAAAATATTAAAACTAACATGATCCAT

            $ cat reads.fasta
            >read1
            ATACATCCTC
            >read2
            ATGGATCATG

            $ rtg format -o t template.fasta

            Run a single end mapping run:

            $ rtg map -o o -i reads.fasta -F fasta -t t
            $ zcat o/alignments.sam.gz | grep -v "@"
            0 0 test 1 37 10= * 0 0 ATACATCCTC * AS:i:0 NM:i:0 IH:i:1 NH:i:1
            1 16 test 51 37 10= * 0 0 CATGATCCAT * AS:i:0 NM:i:0 IH:i:1 NH:i:1

            The second column of the SAM file shows that a bit (0x10 which equals 16 in decimal) is set if the read is reverse frame.

            The SAM file contains the read in the forward direction (same as the template sequence), but this extra flag allows you to determine the direction.

            In the paired end world this may look like:

            $ cat left.fasta
            >read
            ATACATCCTC

            $ cat right.fasta
            >read
            ATGGATCATG


            Then run a paired-end mapping run:

            $ rtg map -o o -l left.fasta -r right.fasta -F fasta -t t
            $ zcat o/mated.sam.gz | grep -v "@"
            0 99 test 1 55 10= = 51 60 ATACATCCTC * AS:i:0 NM:i:0 MQ:i:255 XA:i:0 IH:i:1 NH:i:1
            0 147 test 51 55 10= = 1 -60 CATGATCCAT * AS:i:0 NM:i:0 MQ:i:255 XA:i:0 IH:i:1 NH:i:1


            The second column is harder to decode now. 99 and 147 mean mapped in correct orientation and correct insert size. For a breakdown of the two codes see http://ppotato.wordpress.com/2010/08...-paired-reads/

            Hope this helps.

            cheers
            Stu
            Stuart Inglis, Ph.D.
            Real Time Genomics
            www.realtimegenomics.com

            Comment


            • #7
              Originally posted by chenyao View Post
              I don't get it. Does the pair-end reads have to come from the opposite directions (one is "+", the other is "-"). If it is, why your example show both read are "+"?
              It's a fastq file, it hasn't been mapped, the software that made it has no idea whether it is in the forward or reverse direction, it doesn't even know what reference I want to align it to.

              The plus is just a place holder. In the old days, before fastqs routinely had several million individual entries per file, the name of the read was rewritten after the + sign. Once fastqs started having millions of 40-mers and their 40 character quality scores, repeating the read name made each read 25% bigger than it had to be, so now, no one writes anything after that plus sign.

              And if you do a standard paired end read, then yes, the reads should point in at each other. I think mate paired reads, which are a more complex prep intended to greatly increase the genomic distance between the two ends, the reads read outwardly, but I might be mistaken on that point.

              If you have paired end reads that don't point in at each other, then you have inaccurate reads, or an inaccurate reference as compared to the sample.

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                It's a fastq file, it hasn't been mapped, the software that made it has no idea whether it is in the forward or reverse direction, it doesn't even know what reference I want to align it to.

                .
                So for illumina pair end data, read 1 and read 2 does not denote forward and reverse, right?

                Comment


                • #9
                  Originally posted by swbarnes2 View Post
                  I think mate paired reads, which are a more complex prep intended to greatly increase the genomic distance between the two ends, the reads read outwardly, but I might be mistaken on that point.
                  Yes, that is my understanding as well. Paired-end reads are "innie" and mate pairs are "outie." Sanger paired ends are generated from a completely different process (sequencing the ends of BAC clones) and the result is that those paired ends are "outie." This leads to a lot of confusion when using a mix of technologies, or using software that expects your paired ends in a certain orientation.

                  Comment


                  • #10
                    Originally posted by Arthur123 View Post
                    So for illumina pair end data, read 1 and read 2 does not denote forward and reverse, right?
                    The enzymes putting the adaptors on the piece of DNA have no idea which way your particular reference is oriented, and have no way of distinguishing which end of the DNA coresponds to the "forward" sequence. They are just molecules.

                    The only exception would be if you were doing something like a library of vectors with various insert sequences, and you wanted to know all the insert sequences. One could do PCR around those inserts, and put adaptor sequences on those PCR primers, and then adaptor 1 would be fixed at one point in the vector, and adaptor 2 woud be fixed at the other end.

                    But if you are just randomly cutting DNA, then half of read 1 will be in one orientation, half will be in the other. Same with read 2.

                    Comment


                    • #11
                      Originally posted by swbarnes2 View Post
                      The enzymes putting the adaptors on the piece of DNA have no idea which way your particular reference is oriented, and have no way of distinguishing which end of the DNA coresponds to the "forward" sequence. They are just molecules.

                      The only exception would be if you were doing something like a library of vectors with various insert sequences, and you wanted to know all the insert sequences. One could do PCR around those inserts, and put adaptor sequences on those PCR primers, and then adaptor 1 would be fixed at one point in the vector, and adaptor 2 woud be fixed at the other end.

                      But if you are just randomly cutting DNA, then half of read 1 will be in one orientation, half will be in the other. Same with read 2.
                      Thanks! You are awesome!

                      Comment


                      • #12
                        Originally posted by sphil View Post
                        No all reads are considered to be written from left to right. The strand flag should make clear which strand the read originated from.
                        To answer your question how one is able to find mate pairs in the sequence file. Usually in the fastq file there is a flag at the end of the header line (normally '/1' or '/2') which indicates whether it is a 'front' or an 'end' read. Comming up with your example it should look like this:
                        >Read1 more headerinfo /1
                        AGC
                        >Read2 more headerinfo /2
                        TCG
                        nice revision on all such stuff can be found on: http://en.wikipedia.org/wiki/FASTQ_format , for instance.

                        hope that helps,

                        best

                        phil
                        Hi,

                        Are you sure about this? Because I have two paired fastq files from a MiSeq machine and here is the read pair:

                        Read Pair 1:
                        @M00569:20:000000000-A3EGF:1:1101:14488:1761 1:N:0:1
                        ACAGAATGTAAGCTTTCTAACTCATAAAACTCTTTCTGGAGGTCTGTAATTTTCTGCATAGGATCTTCATAAATCTGTTCTGAAAGTCTTATCTTTTGCTCTCTTCCTTTCTGCTGCATAAATCCATTTTCTTCTTCTTGCCTTGTTAGCA
                        +
                        >>>334DBDB55EGGGGG65FGGBG5555FGHHHHHHFFBA?EFGFHEFGHHHHHBFBHBBB3FGHHHFHFBBFGHBFHHHE5E3BFGHH5GGHHHHFDHGFHHHHHHHHHHFFHBG3F43EFGHFHHHHHHFHHHHHHBFGHF3GGF4F4
                        Read Pair 2:
                        @M00569:20:000000000-A3EGF:1:1101:14488:1761 2:N:0:1
                        NNGGGATGCTAATAGAGGATTATATTTATGAATCTTTAGTAGAAGACACGTACAATGGATCGGTAGATGGCAGTCTGCTAACAAGGCAAGAAGAAGAAAATGGATTTATGCAGCAGAAAGGAAGAGAGCAAAAGATAAGACTTTCAGAACA
                        +
                        ##1111>1>D33B331111BFBGBGHHFHBFFFGGGHC1FB2B21CFBFCHFG?1FBB1FF//EA/AFDBG0EGGHFFHFFFFBGEFA0C00C10>BCCBGBB1FGHFGFGFFFF0C01CE0CAAG0>GHCBFFAHFFHEHHGHHBB2FF0
                        Here the second read pair is actually the reverse complement of the reference human sequence at the loci. So in that example that was stated, I would have thought it would be:

                        >Read1 more headerinfo /1
                        AGC
                        >Read2 more headerinfo /2
                        CGA

                        Perhaps I am mistaken?

                        Comment


                        • #13
                          As swbarnes2 stated above. The reads are just like you said. I just wanted to point out that it is going from left to right and therefore didn't mention that it is actually also rev. comped. So your second read should always be the reverse complement of the loci the 'first' read maps to.

                          Maybe this http://www.illumina.com/technology/p...ing_assay.ilmn helps to clarify things for good .

                          Originally posted by swbarnes2 View Post
                          So actually, in your example, your reads would be AGC and CGA. Most alignment programs would report them both in the forward direction, and have a tag in there to tell you that the read is rev comped where appropriate.

                          Comment


                          • #14
                            Hi. I was given raw reads by a service provider but there were no Left or Right reads. Is there any way that I could revert back to separate R and L?

                            Comment


                            • #15
                              Please ignore my question. I already found the paired reads. Thanks.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X