Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mate pair sequencing - quality, duplication, throughput

    We recently got Illumina mate pair sequencing data for a human sample. I am trying to do some some Quality Check to see how good the data is.

    I used bwa for alignment and the output bam file gave these numbers:
    324061746 + 0 in total (QC-passed reads + QC-failed reads)
    265547763 + 0 mapped (81.94%:-nan%)
    239584614 + 0 properly paired (73.93%:-nan%)
    244063409 + 0 with itself and mate mapped
    21484354 + 0 singletons (6.63%:-nan%)
    993167 + 0 with mate mapped to a different chr
    812922 + 0 with mate mapped to a different chr (mapQ>=5)

    I knew (from FastQC) that the data had high rates of duplication (>90%), so I marked and removed duplicates with picard to get these numbers from the dedupped bam file:

    114026332 + 0 in total (QC-passed reads + QC-failed reads)
    55512349 + 0 mapped (48.68%:-nan%)
    114026332 + 0 paired in sequencing
    57259162 + 0 read1
    56767170 + 0 read2
    49051344 + 0 properly paired (43.02%:-nan%)
    50304349 + 0 with itself and mate mapped
    5208000 + 0 singletons (4.57%:-nan%)
    371423 + 0 with mate mapped to a different chr
    246089 + 0 with mate mapped to a different chr (mapQ>=5)

    As you can see, ~200 M reads are duplicates and consequently, removed from the alignment file.

    My questions are:
    1) Is it reasonable for human mate-pair libraries (insert size ~5kb) to have such high rates of duplication?
    2) Does this reflect an average/good/bad mate pair sequencing run?
    3) Any other suggestions regarding checking quality of mate pair sequencing data in general.

    I have looked at other threads and this seemed like the only one somewhat relevant.

    Thanks

  • #2
    General related question: What is the largest insert size that can be used with Illumina sequencing if you want the pairs to actually overlap? Is it something like 200bp?
    savetherhino.org

    Comment


    • #3
      Hi,
      which MP library protocol did you use? One using linkers or not? Did you trim the linkers?
      How exactly did you carry out the alignments?

      Comment


      • #4
        MiSeq does allow for 2x250 bp PE sequencing;
        HiSeq does allow for 2x150 bp PE sequencing;

        How much overlap do you want?

        There is one program which claims to merge reads even without actual overlap - given enough coverage:

        "COPE: An accurate k-mer based pair-end reads connection tool
        to facilitate genome assembly"



        Originally posted by rhinoceros View Post
        General related question: What is the largest insert size that can be used with Illumina sequencing if you want the pairs to actually overlap? Is it something like 200bp?

        Comment


        • #5
          Originally posted by rhinoceros View Post
          General related question: What is the largest insert size that can be used with Illumina sequencing if you want the pairs to actually overlap? Is it something like 200bp?
          Upwards of 400 bp. With MiSeq 2 x 250 bp runs you will get overlap in the middle. For HiSeq runs would be shorter since 2 x 150 bp is the max supported on the 2500 in rapid mode at the moment.

          Comment


          • #6
            The libraries were prepared by Illumina Mate Pair Library Preparation Kit v2 and sequenced using HiSeq2000 generating 2*100bp reads.

            The alignments were done using bwa aln with default parameters. I did no pre-processing of data in this current analysis.

            Code:
            bwa aln -t 4  /BWAIndex/genome.fa L001_R1.fastq.gz > L001_1.sai
            bwa aln -t 4  /BWAIndex/genome.fa L001_R2.fastq.gz > L001_2.sai
            bwa sampe /BWAIndex/genome.fa 1.sai 2.sai 1.fastq.gz 2.fastq.gz > L001.sam
            samtools view -bS L001.sam > L001.bam
            samtools sort L001.bam L001_sorted
            The duplicates were marked and removed, as I previously mentioned, using picard MarkDuplicates tool.

            Originally posted by luc View Post
            Hi,
            which MP library protocol did you use? One using linkers or not? Did you trim the linkers?
            How exactly did you carry out the alignments?

            Comment


            • #7
              Hi,

              I believe the "old" bwa has problems with mapping mate pairs:
              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


              BWA-mem, Bowtie, Novalign are better suited to my knowledge.

              Similarly the old Illumina MP kit produces lots of artifacts and PCR duplicates; the Nextera based mate-pair kit is more efficient.

              What was the fragment size of your library? People used to trim the MP reads to the first 38 bases each, in part to avoid mapping chimeric sequences.

              Comment


              • #8
                BWA and BWA mem
                I an using reverse-complemented reads. That brings the --rf orientation mate pair reads into --fr paired end orientation, which can then be used for bwa alignment. I am running bwa-mem as I write this so that I have a direct comparison between bwa/ bwa mem

                Library prep:
                The samples were sent to a commercial vendor. So, I did nothing with the library or sequencing steps. We just got the data from the vendor. But, I will keep that in mind for the future.

                Comment


                • #9
                  Hi agupta,

                  you could look at the distances of PE alignments to get an idea if their sequencing libraries had short inserts and thus were more likely to include chimeric reads.

                  Comment


                  • #10
                    I have carried out de novo assembly of an organism of interest using velvet optimiser. I have contig files. How can i now, predict genes from the sequences? I have also carried out comparsion of the assembly with a reference using abacas and tried to view it usin artemis comparsion viewer. But, i am unable to understand the contig ordering part. How does one do that? Secondly, what is the way to predict genes from the contigs?

                    Comment


                    • #11
                      Originally posted by diptarka View Post
                      I have carried out de novo assembly of an organism of interest using velvet optimiser. I have contig files. How can i now, predict genes from the sequences? I have also carried out comparsion of the assembly with a reference using abacas and tried to view it usin artemis comparsion viewer. But, i am unable to understand the contig ordering part. How does one do that? Secondly, what is the way to predict genes from the contigs?
                      You should create a separate thread for this question since:
                      a) your questions are not related to current thread
                      b) people who could potentially answer your question will not notice it if it is embedded here.

                      New threads can be started by doing following:

                      Seqanswers.com --> Forums (Navigation Menu on Left) --> Select an appropriate forum (e.g. Bioinformatics) --> "New Thread" button at the top left corner on next page.

                      That said: What type of organism is this (prokaryote/eukaryote)?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      68 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X