Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mate Pair orientation in illumina

    Hi everyone,
    i'm working for a university project about "resequencing" of a small genome (the reference genome is laidlawii).
    I have the reference genome and two fastq files containing the reads of an illumina mate-pair library from a target genome.
    Going to the point: i have a problem when i'm asked to generate a track for IGV representing the "percentage of oriented mates", i simply can't understand which read in each pair is the left and which is the right one.
    Each read in the two fastq files has an id and is also marked with tag /1 /2: in one file i have all the /1 and in the other one i have all the /2.
    Now the question is if there is a strong relation between the tag and the fact that the read is the left or right.
    For aligment i use PASS (pass.cribi.unipd.it) that outputs a sam file with different informations among which the reverse complemented alignmen (flag bit 0x10 setted).
    In almost every pair one read is aligned l->r while the other is reverse complemented aligned (maybe illumina sequences the borders from different strands?).

    Making it easier: can i say that every mate-pair with /1 aligning left->right and /2 aligning reversed complemented, is left->right oriented on the reference?
    And in the opposite case, then the mate pair aligns reversed on the reference?
    (the assumption to prove is that /1 is always the left (or right) mate)

    Thank you,
    i hope i've made myself understood (english is not my first language and i'm a poor informatician :P)


    edit: to be exhaustive as much as possible, here a situation that make me me crazy:
    Code:
    sq_1607_4547_0_1_0_0_0:0:1_0:0:0_3f6b5	83	Chromosome	4547	50	50M	=	1607	-2990	GACTACATCGGTTCCGGAGGGGAAACGAAGTATTTTTTATATGAGCATAA	
    sq_1607_4547_0_1_0_0_0:0:1_0:0:0_3f6b5	163	Chromosome	1607	49	5M1D45M	=	4547	2990	ACTCGTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTA	
    sq_1610_3842_0_1_0_0_0:0:1_0:0:0_6dbfe	83	Chromosome	3842	50	50M	=	1612	-2280	ATACCCGGATACAGCAAAAATCATACCTGTTAATTTTCCTACTGTCATTA	
    sq_1610_3842_0_1_0_0_0:0:1_0:0:0_6dbfe	163	Chromosome	1612	49	49M	=	3842	2280	GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAA
    sq_1611_220_1_0_0_0_0:0:1_0:0:1_2059e	99	Chromosome	220	49	9M1D41M	=	1612	1442	TAATAAATTGTCGTTTCTTATGCTATCATAGTTTTACATAAATTATTAAC	
    sq_1611_220_1_0_0_0_0:0:1_0:0:1_2059e	147	Chromosome	1612	50	50M	=	220	-1442	GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAA	
    sq_1611_4420_0_1_0_0_0:0:1_0:0:0_35f5b	83	Chromosome	4420	50	50M	=	1612	-2858	AAGCGTTAAAAAGTGCGCTTTTTTACTTATATTATGTTATAATATAATAG	
    sq_1611_4420_0_1_0_0_0:0:1_0:0:0_35f5b	163	Chromosome	1612	50	50M	=	4420	2858	GTTGTCAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAA
    sq_1617_4456_0_1_0_0_0:0:0_0:0:0_3e90d	83	Chromosome	4456	50	50M	=	1617	-2889	TTATAATATAATAGGTAGGTGAATGAAGCGTATGAATCATTTTGAGTTAG	
    sq_1617_4456_0_1_0_0_0:0:0_0:0:0_3e90d	163	Chromosome	1617	50	50M	=	4456	2889	CAAAAAAATAGATTCACCATTATTAAAGTGATAAATGTTTATAAAAATGA
    This is the output of PASS aligner, with mates ordered by id.
    You see... the first 2 mates pairs align so that the "first segment in the template is reversed complemented" (flag = 83 with bit 5 and 7 setted according to sam specs) and the "second segment is forward aligned" (flag = 163). And this is the case of the hundreds of pairs preceding that point, so for the first 1612 bases i have /2 forward aligned and /1 reversed complemented.
    Then the third mate pairs in the example is different. flag = 99 means that "this is the first segment and is forward aligned" and flag = 147 means "this is the second segment and is reversed complemented". So in this case /2 is reversed and /1 is forward.
    After that, all returns normal...
    This example make me think that there's no strong relation between the /1 /2 indication and the fact that a read is left or right.
    In fact if it was like that, how can i explain that i have /2 of the second mate pair aligning forward on position 1612 and /1 of third mate pair aligning reversed in the same position?
    The only possible case is that i have another region of my genome with the same code reversed, but in this case i'd have multiple reads, and this is not the case (the reads are id sorted so i should notice...).
    An example of multiple read is this:
    Code:
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	99	Chromosome	73696	50	50M	=	76178	2532	ATTTATCGGTTTAAGAGGGGTCTGCGGCGCATTAGTTAGTTGGTGGGGTA
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	147	Chromosome	76178	49	50M	=	73696	-2532	AATATATGCTAAGTGGAAACGGAAGTAGAGATGCACAAACAGCCAGGAGG
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	83	Chromosome	1204898	50	50M	=	1202206	-2742	TACCCCACCAACTAACTAATGCGCCGCAGACCCCTCTTAAACCGATAAAT	
    sq_76677_74195_1_0_0_0_0:1:0_0:0:0_61c68	163	Chromosome	1202206	49	50M	=	1204898	2742	CCTCCTGGCTGTTTGTGCATCTCTACTTCCGTTTCCACTTAGCATATATT
    In this case you see that the two mate pairs aligns correctly in different parts of genome with the first mate aligning /1 forward and /2 reversed, while the second mate align /1 reversed and /2 forward.
    This is plausible considering that probably (but i know it is) the code around position 1200000 is the same of around position 70000 but reversed complemented.
    But if i dont know which one between /1 /2 is the left mate, i can't say where in my target genome i have the inversion.

    Anyway, do you know if that read identificator is splittable for gain more information? I've noticed quite a regularity like if the value of the first 2 "boolean" values could say which read is the left one (if "_0_1_" meant that the second read is the left and "_1_0_" that the first read is the left, then i'd have solved my question). However i've no documentation about that and it does not match the fastq illumina standards.
    Last edited by d3mux; 08-10-2014, 05:36 AM.

  • #2
    Do you mean paired-end or mate pair? For Illumina technology, the orientation of the two reads relative to each other is different for paired end and mate pair.

    The read identifiers in your sam file don't look like typical Illumina read IDs.

    Anyways, if your reads are paired end Illumina reads, it is just random whether the reads in file /1 align to the + strand or the - strand, some of the /1 reads will align to one strand, and some to the other strand.
    The reads in file /2 will align to the opposite strand from the paired read in file /1.

    Comment


    • #3
      Originally posted by mastal View Post
      Do you mean paired-end or mate pair? For Illumina technology, the orientation of the two reads relative to each other is different for paired end and mate pair.

      The read identifiers in your sam file don't look like typical Illumina read IDs.

      Anyways, if your reads are paired end Illumina reads, it is just random whether the reads in file /1 align to the + strand or the - strand, some of the /1 reads will align to one strand, and some to the other strand.
      The reads in file /2 will align to the opposite strand from the paired read in file /1.
      I was told Mate Pair (i received also written instructions).
      Well... if there's no relation between /1 /2 and orientation, i really wonder how i can decide which orientation my mate pairs have...
      About the reads, i can't exclude that they have been sequenced artifically (with a simulation on sample genome).
      Thanks for the reply

      Comment


      • #4
        They should be RF. That being said, there is generally significant contamination in MP libraries of PE. Check out NextClip and the Illumina technical bulletin on MP library analysis.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:47 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X