Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MiSeq producing various length reads

    Hello all

    I'm processing a micro-RNA-seq experiment for a collaborator of ours, and see a very unusual thing. They have sequenced three samples using miSeq, with the expected read length of 51. However instead I see lots of reads that are NNNNNNNNNNNNNN of length 20-21, and quite a few of intermediate ones too.

    This is very unusual - do you have any idea about why it might have happened?

  • #2
    Are you saying that there are actual NNNN or just short(er) than 51 bp reads?

    If there are N's then that may indicate a failure of basecalling. It could be due to overloading. Generally sequencing facilities will not release this kind of data.

    If that is a result of some sort of post-run data processing (where they replaced the adapter sequences with N's for example, don't know if BaseSpace does something like that) then you would need to ask. If you ignore/strip the N's is the rest of the data good quality?

    Comment


    • #3
      There are a bunch of NNNNN reads that are 20 bp long, and there are bunch of other reads that are not N* but have a variable length. I'll try to align them to see if it will at least look like micro-RNA, but the thing is, you need to clip the adapters and it's hard to do it on a variable length read

      It does not look like the cell is overloaded from FastQC report though. It looks like there's a small bubble there but that's all.

      It was not a sequencing facility that did it - just a small institute ran it on their MiSeq. So they totally might have done something wrong there, they don't run it very often for this sort of libraries - mostly they sequence strains of viruses.

      Comment


      • #4
        Reads don't come off the machine with variable length unless you set the Illumina software to trim the adapters during base-calling or demultiplexing or something (not sure exactly when it happens), or they've been postprocessed in some way. You should ask how the data was generated, or better yet, see if you can get the raw fastq data.

        Comment


        • #5
          Those were supposed to be raw fastq. But you are right, I was thinking along the same lines. I'll just come over and get the data from the device myself.

          Comment


          • #6
            I've seen this with short small RNA libraries when using MiSeq reporter to demux with automatic adapter trimming.

            To fix this you can redemultiplex the run with BCL2FastQ, or remove the adapter sequences from your sample sheet and redumultiplex with MiSeq reporter. Then just trim the adapters yourself.
            Josh Kinman

            Comment


            • #7
              Originally posted by apredeus View Post
              Those were supposed to be raw fastq. But you are right, I was thinking along the same lines. I'll just come over and get the data from the device myself.
              If you can't get the raw data or can't get the facility to re-run the analysis then just trim the N's off. One can safely assume that Illumina would know how to identify their own adapter sequences. It sounds like they are masked by the default demux process.

              @Brian: What is an easy way to trim those N's using BBMap? I should add this to my BBMap tricks thread.

              Comment


              • #8
                Originally posted by GenoMax View Post
                If you can't get the raw data or can't get the facility to re-run the analysis then just trim the N's off. One can safely assume that Illumina would know how to identify their own adapter sequences. It sounds like they are masked by the default demux process.
                I couldn't find this info for MiSeq Reporter, but did see this in the Bcl2FastQ guide..

                --mask-short-adapter-reads arg (=22) smallest number of remaining bases (after masking bases below the minimum trimmed read length) below which whole read is masked

                So it looks like it is possible that the adapters are being correctly identified, but the remaining read after trimming is shorter than 22bp and may be being masked with NNNN.

                Since this is micro RNA, I think it is worth trying to redemux without adapter trimming or changing this variable in order to unmask these reads instead of removing them. Doing this has worked for me when sequencing Small RNA libraries on the MiSeq.
                Josh Kinman

                Comment


                • #9
                  From MiSeq Reporter User Guide

                  Masking Short Reads
                  MiSeq Reporter includes a setting that prevents reads that have been almost entirely
                  trimmed or masked from confounding downstream analysis, which is based on the following criteria:
                  } If the adapter is encountered within the first 32 bases of the read, the adapter sequence is N-masked.
                  } If the adapter is identified in the first 32 bases and the read includes ten or more bases from the start of the adapter, the entire read is N-masked. This ten-base limit is controlled by the configuration setting NMaskShortAdapterReads.
                  Josh Kinman

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    One can safely assume that Illumina would know how to identify their own adapter sequences.
                    I'd like to think so...

                    What is an easy way to trim those N's using BBMap? I should add this to my BBMap tricks thread.
                    You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale.

                    @jdk787, thanks for posting the specific details of what's going on. Looks like defaults that make sense in many cases but not for small RNAs.

                    Comment


                    • #11
                      Originally posted by apredeus View Post
                      Hello all

                      I'm processing a micro-RNA-seq experiment for a collaborator of ours, and see a very unusual thing. They have sequenced three samples using miSeq, with the expected read length of 51. However instead I see lots of reads that are NNNNNNNNNNNNNN of length 20-21, and quite a few of intermediate ones too.

                      This is very unusual - do you have any idea about why it might have happened?
                      Hello Alexander, have you already found the reason of this problem?
                      I have the same problem with last sequencing data: the reads 1 are considered to have the length 41 bp, but real length varies from 35 bp to 41 bp and some of reads are polyN!

                      Comment


                      • #12
                        Are your sequences adapter masked or are there genuine N's (no calls)?

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          Are your sequences adapter masked or are there genuine N's (no calls)?
                          I think these are adapter sequences masked, but it was not me who performed sequencing experience, I process fastq raw data

                          Comment


                          • #14
                            Originally posted by agent_pilin View Post
                            Hello Alexander, have you already found the reason of this problem?
                            I have the same problem with last sequencing data: the reads 1 are considered to have the length 41 bp, but real length varies from 35 bp to 41 bp and some of reads are polyN!
                            Hello,

                            I don't quite remember since it was a long time ago but I'm pretty sure that the reason this happened is due to Illumina software being confused by the adapter and short read sequence. So you would need to get the untrimmed sequences. If these are not available, get the BCL files and convert them to fastq yourself.

                            Comment


                            • #15
                              Originally posted by apredeus View Post
                              Hello,

                              I don't quite remember since it was a long time ago but I'm pretty sure that the reason this happened is due to Illumina software being confused by the adapter and short read sequence. So you would need to get the untrimmed sequences. If these are not available, get the BCL files and convert them to fastq yourself.
                              Thank you for your answer, it's a good idea !
                              Stanislav

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X