Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extreme 5' nucleotide bias in 2nd pair Illumina Hiseq reads

    Hi Everyone,

    I'm new to the forum and have a question regarding an odd pattern in a new transcriptome dataset of Illumina HiSeq 2000 (v1.9) paired end 100bp reads.

    Looking at the FastQC outputs provided with the raw data from our sequencing provider, I noticed that across 10 TrueSeq mRNA libraries (2 tissue types, different sexes, replicate individual fish), all 2nd pair reads appear to start with the same two bases - either an 'AT' or 'TT' or 'GA'. Base quality is good overall. See attached FastQC graphs as an example of 'AT' bias. The 1st pair reads do not show this bias.

    I have emailed my sequencing provider for clarification, but I also wanted to know if anyone here has come across this pattern? Any thought on what the cause might be?
    Attached Files

  • #2
    Would you know whether the libraries were stranded or non-stranded and also the kit that has been used for library prep.

    Comment


    • #3
      Sorry, libraries will have been stranded, prepared with a TruSeq Stranded mRNA LT Sample Prep Kit.

      Comment


      • #4
        Would you know which indices were used for barcoding samples and which libraries pooled for sequencing in the same lane if more than one lane used for sequencing. If multiple lanes were used, were they sequenced in the same flow cell or different one? Is this biase in all samples or a pool of samples in one particular lane?

        Comment


        • #5
          Yes, all samples were run on one lane and the bias occurs across all samples - but only for 2nd pair reads. The most common bias is by far 'AT', but 'TT' and 'GA' occur for some samples as well. I have pasted the index list used for barcoding samples below.
          ATCACG
          TTAGGC
          ACTTGA
          GATCAG
          TAGCTT
          GGCTAC
          GTGGCC
          GTTTCG
          CGATGT
          TGACCA
          ACAGTG
          GCCAAT
          CAGATC
          CTTGTA
          AGTCAA
          AGTTCC
          CGTACG
          GAGTGG
          ACTGAT
          ATTCCT

          Comment


          • #6
            Thanks for providing enough information to explorer the likely causes of this observation. I do not see any biological or technical reason for this:

            1) Bias is library specific and not observed in replicates of the same sample
            2) There is no biochemical reason for such extreme bias that could be attributed to reactions or kit used during library prep and I have never seen this. If it was kit specific, you would see the same bias in all or most of samples.

            That leaves me to think that these sequences are bases from index reads which somehow during demultiplexing has been added to the start of R2. This can happen as following:

            1) Your libraries were run in a flow cell where other libraries had dual index or 8 base index. So, they had to do 8 cycles for index reads in all lanes in the flow cell.
            2) LT indices are 6 base but are sequenced for 7 cycles. LT indices will be read normally up to 7th bases and for 8 base long index will be read all the way to 8th base below (I do not know why they have used 20 different indices for 10 libraries?):

            ATCACGAT
            TTAGGCAT
            ACTTGAAT
            GATCAGAT
            TAGCTTAT
            GGCTACAT
            GTGGCCTT
            GTTTCGGT
            CGATGTAT
            TGACCAAT
            ACAGTGAT
            GCCAATAT
            CAGATCAT
            CTTGTAAT
            AGTCAACT
            AGTTCCGT
            CGTACGTT
            GAGTGGAT
            ACTGATAT
            ATTCCTTT

            If during demultiplexing the final two bases from index reads (if they did 8 cycles for index) were added to your read2 start position, you will see AT, GT, CT or TT depending on the index used for that particular library. These will not explain the GA bias (are sure about this?). You may look for a similar explanation for those biases if other bases from index reads has been added to the start of R2, for example, position 6 and 7 will result in addition of GA, CA, AA, TA, CT, GG, AC, CG, GT and TT and so on.

            Comment


            • #7
              @nucacidhunter: I am scratching my head to imagine how #2 is possible. Since illumina index reads (1D or 2D) should have been read as reads independent of real sequence there is no way for CASAVA/bcl2fastq to add them to beginning of R2.

              @evt8: Can you ask the sequence provider if this happened to other lanes on the flowcell?

              Comment


              • #8
                Originally posted by GenoMax View Post
                @nucacidhunter: I am scratching my head to imagine how #2 is possible. Since illumina index reads (1D or 2D) should have been read as reads independent of real sequence there is no way for CASAVA/bcl2fastq to add them to beginning of R2.

                @evt8: Can you ask the sequence provider if this happened to other lanes on the flowcell?
                It's easy to get this screwed up if you have the wrong settings in BCL2FASTQ and ignore the errors that pop up. Most likely the run was sequenced with 8 index cycles yet during the demux they only used 6. So instead of --use-bases-mask y*,i6n*,y* they did y*,i6,y*. What that ended up doing was combining the remaining index reads with the start of R2.

                Comment


                • #9
                  I have now resolved the issue with our sequencing provider, although unfortunately not with a full explanation. They simply indicated there was a 'Bcl2fastq bug', and fixed the issue by relaunching the program. No other details were given.
                  However, I suspect the answers received in this thread (which I relayed to our provider) hit on the problem, so thank you all for your insights!

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  69 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X