Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • The purpose of join/merge 2X150bp illumina seuqencing reads

    Hello, I have sequenced my metageonome by 2X150bp and 2X250bp illumina HighSeq. I read some papers. They reommend to join them before assembly. I understand the reason for 2X250bp. Let say you have 50bp overlap and you will get 400bp reads after join. Which makes sense.

    What about 2X150bp or 2X75bp or 2X50bp (there shorter insertions)? The forward and reverse reads are basically identical. I don't get it. Let say you joined two 50bp reads and get to 100bp? However, this doesn't make a lot of biological sense. To me, it just ligate two DNA seqs together, but it is not what it is in the organism.

    PS, I was wondering if Illumina will cut the adapters automatically nowadays. I got the raw fastq files from sequencing center, I tried to trim the adapters. I was surprised that I couldn't find a lot.

    Thanks

  • #2
    If you have extremely short inserts so R1/R2 completely overlap then that is a waste of sequencing (or you like to be over cautious).

    Adapters may be cut automatically by MiSeq Reporter/BaseSpace if a setting is chosen at run time. Good libraries (long inserts) should not have adapter contamination, so it is not unusual to see clean reads.

    Comment


    • #3
      Originally posted by GenoMax View Post
      If you have extremely short inserts so R1/R2 completely overlap then that is a waste of sequencing (or you like to be over cautious).

      Adapters may be cut automatically by MiSeq Reporter/BaseSpace if a setting is chosen at run time. Good libraries (long inserts) should not have adapter contamination, so it is not unusual to see clean reads.
      1> So, for 150bpX2 WGS are not completely overlapping? I thought this is short enough.

      2>Mine is from HiSeq, but it is from BASEspace. I didn't find a lot of adapters? does this mean they remove it?

      Comment


      • #4
        Originally posted by SDPA_Pet View Post
        1> So, for 150bpX2 WGS are not completely overlapping? I thought this is short enough.
        If your inserts are longer (say 350 bp) then R1/R2 won't overlap. Use the BBMerge program from BBMap to quickly determine how many overlap in middle.

        2>Mine is from HiSeq, but it is from BASEspace. I didn't find a lot of adapters? does this mean they remove it?
        If all R1/R2 reads are not full length (equal to number of cycles) then it is possible that they were already trimmed.

        Comment


        • #5
          Originally posted by GenoMax View Post
          If your inserts are longer (say 350 bp) then R1/R2 won't overlap. Use the BBMerge program from BBMap to quickly determine how many overlap in middle.



          If all R1/R2 reads are not full length (equal to number of cycles) then it is possible that they were already trimmed.
          "equal to number of cycles" -- Not sure the cycles means? Do you mean the theoretic length? I tell them to do 150bpX2. The full length will be 150bp. Does this also mean 150 cycles?

          Comment


          • #6
            [QUOTE=GenoMax;199750]If your inserts are longer (say 350 bp) then R1/R2 won't overlap. Use the BBMerge program from BBMap to quickly determine how many overlap in middle.

            Which part of this report tell me how many overlap in the middle?

            BBMerge version 36.38
            Extend2 is defaulting to 50 because it was unset but rem mode is being used.
            Executing assemble.Tadpole2 [in=ecct.ecco.clean.ELM010016AB_S1_L001_R_interleaved.fastq, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=62, prealloc=false, prefilter=0, ecctail=false, eccpincer=false, eccreassemble=true]

            Using 24 threads.
            Executing ukmer.KmerTableSetU [in=ecct.ecco.clean.ELM010016AB_S1_L001_R_interleaved.fastq, branchlower=3, branchmult1=20.0, branchmult2=3.0, mincountseed=3, mincountextend=2, minprob=0.5, k=62, prealloc=false, prefilter=0, ecctail=false, eccpincer=false, eccreassemble=true]

            Initial:
            Ways=61, initialSize=128000, prefilter=f, prealloc=f
            Memory: max=102900m, free=100216m, used=2684m

            Estimated kmer capacity: 1922031677
            After table allocation:
            Memory: max=102900m, free=99142m, used=3758m

            After loading:
            Memory: max=102900m, free=59791m, used=43109m

            Input: 8479764 reads 1249669006 bases.
            Unique Kmers: 648242372
            Load Time: 60.614 seconds.

            Writing mergable reads merged.
            Started output threads.
            Total time: 84.110 seconds.

            Pairs: 4239882
            Joined: 2489893 58.726%
            Ambiguous: 1749956 41.274%
            No Solution: 33 0.001%
            Too Short: 0 0.000%
            Fully Extended: 34222 0.404%
            Partly Extended: 88184 1.040%
            Not Extended: 8357336 98.556%
            Adapters Expected: 22 0.000%
            Adapters Found: 0 0.000%

            Avg Insert: 210.6
            Standard Deviation: 49.4
            Mode: 245

            Insert range: 35 - 390
            90th percentile: 274
            75th percentile: 252
            50th percentile: 216
            25th percentile: 173
            10th percentile: 138

            Comment


            • #7
              Originally posted by SDPA_Pet View Post
              Pairs: 4239882
              Joined: 2489893 58.726%
              Ambiguous: 1749956 41.274%
              No Solution: 33 0.001%
              Too Short: 0 0.000%
              Fully Extended: 34222 0.404%
              Partly Extended: 88184 1.040%
              Not Extended: 8357336 98.556%
              Adapters Expected: 22 0.000%
              Adapters Found: 0 0.000%

              Avg Insert: 210.6
              Standard Deviation: 49.4
              Mode: 245

              Insert range: 35 - 390
              90th percentile: 274
              75th percentile: 252
              50th percentile: 216
              25th percentile: 173
              10th percentile: 138
              Right there in the log you posted. If you wrote the merged reads to a file they will be in there as well. Looks like your average insert size is 210 bp so with 2 x 150 bp R1/R1 will overlap in middle.

              Comment


              • #8
                Hi GenoMax,

                I am new in this field. Forgive me if I ask the native question.

                1>What is insert? I thought the insert the fragment that they shear the genomic DNA. In my case they do 150bpX2, so the insert/fragment is about 150bp. If my understanding is wrong, what is the insert? How this software can calculate it?

                2>How can you calculate they are overlap. 2X150bp=300bp. Because 210.6<300bp, so it is overlap?

                Thanks

                Comment


                • #9
                  Is 241-150=91bp is the roughly overlapping region(Length) between R1 and R2?

                  Comment


                  • #10
                    Originally posted by SDPA_Pet View Post
                    Hi GenoMax,

                    I am new in this field. Forgive me if I ask the native question.

                    1>What is insert? I thought the insert the fragment that they shear the genomic DNA. In my case they do 150bpX2, so the insert/fragment is about 150bp. If my understanding is wrong, what is the insert? How this software can calculate it?
                    Fragments are what results from sheared DNA. Even though a certain size is aimed for in preps, not all fragments are of that size (generally there is a distribution). As a result the fragments could be smaller (or much larger) than the mean size determined by bioanalyzer. You add adapters to this fragment (which makes it the "insert") during library prep, which adds another 120 bp (or there about) so the fragment that goes into the sequencer is actually longer.

                    2>How can you calculate they are overlap. 2X150bp=300bp. Because 210.6<300bp, so it is overlap?

                    Thanks
                    That is correct.

                    Code:
                    ------------------------------   250 bp insert
                    --------------------->
                          150 bp R1
                                  <---------------
                                              150 bp R2
                    
                    -------------=======----------    Merged read with 50 bp overlap (====) in middle

                    Comment


                    • #11
                      Thanks GenoMax,

                      As you said, the insert was determined by bioanalyzer during the sequencing part. My bbmerge results tell me the everage insert is about 210.6. I just don't know how the software calculate this. I thought the sequencing person will be the only one knows the size of the insert, because they have analyzer and they did the lab work.

                      Comment


                      • #12
                        BBMerge looked at the end result of the merge process across all reads it could merge and then came up with the average insert size.

                        Comment


                        • #13
                          Thanks. Could you explain relationship between the cycles and length. You said "If all R1/R2 reads are not full length (equal to number of cycles) then it is possible that they were already trimmed. "

                          I thought there is no relationship between cycles and length? If you run more cycles of sequencing, you will get more reads, but not longer reads.

                          The read length (the theoretic length, i.e. 2X150bp, 2X 250bp, 2X300bp etc) is decided by what reagent you used and sequencing platform you used.

                          Comment


                          • #14
                            No of cycles of sequencing = Length (number of base pairs in each) of reads (that are untrimmed).

                            If Illumina software is asked to do trimming during post-processing of data then you may end up with some reads that will not be full length (not equal to number of cycles of sequencing, since the adapter sequence will have been removed, making those reads short).

                            If you run more cycles of sequencing you will get LONGER reads.

                            Number of reads is equal to clusters that are successfully producing sequence on an Illumina flowcell. Quality of library/concentration of library loaded on flowcell determines the number of clusters.

                            Not all clusters will pass quality filtering (e.g. some may produce mixed sequence if two clusters touch/mix and will be removed by Illumina software).

                            The number of reads you get in the final data file = number of clusters that passed Illumina QC filter.

                            The type of kit and sequencer used will determine how much sequence (and length of reads) you can get. Not to complicate things but it is possible to run asymmetric sequencing with kits (e.g. 2 x 300 bp kit can be used to give 1 x 600 bp reads, not that you would want to in most cases, but the upper cap is ~600 cycles of sequencing from this kit)
                            Last edited by GenoMax; 10-11-2016, 04:50 AM.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            30 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            32 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X