Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Marisa_Miller View Post
    Hi all,

    My question is, will this matter for mapping and assembly?
    Most likely no.

    Worrisome part is "blasted some sequences but got no hits". What DB was that against? You are sure you have the right data.

    Comment


    • Originally posted by Marisa_Miller View Post
      Hi all,
      After reading around on the forums and elsewhere on the internet, it seems like seeing weird results for Kmer overrepresentation and per base sequence content after running FastQC on Nextera XT libraries is common.
      Hi Marisa,

      I believe the issue with Nextera kits is similar to the one you get with RNA-Seq libraries in that there is a random priming step which creates some selection bias and results in the types of distribution you see in the FastQC reports.

      For the RNA-Seq problem there is some literature about this and I suspect something very similar will come from the nextera fragmentation.

      You wouldn't expect that the 5' bias is removed by trimming since the bases are not of poor quality and they will be correct sequence from your library. Trimming them would not help with the library, nor would it remove the bias (you'll end up with sequences slightly downstream of biased positions instead of over them).

      The way to tell whether the bias is adversely affecting your results would be to look at the evenness of coverage of your genome in your mapped data. For RNA-Seq, although you see priming bias in the library you still get pretty even coverage of transcripts, so the practical effect doesn't seem to be too bad.

      If would be great if it would be possible to remove this bias in the library prep stage, but for now this is an effect everyone seems to get, and which doesn't seem to have too detrimental an effect on your results.

      Simon.

      Comment


      • Originally posted by GenoMax View Post
        Most likely no.

        Worrisome part is "blasted some sequences but got no hits". What DB was that against? You are sure you have the right data.
        I tried blasting the short heptamers against the Nucleotide collection (nr/nt) database. Not sure why I don't get any hits back.

        Comment


        • Originally posted by simonandrews View Post
          Hi Marisa,

          I believe the issue with Nextera kits is similar to the one you get with RNA-Seq libraries in that there is a random priming step which creates some selection bias and results in the types of distribution you see in the FastQC reports.

          For the RNA-Seq problem there is some literature about this and I suspect something very similar will come from the nextera fragmentation.

          You wouldn't expect that the 5' bias is removed by trimming since the bases are not of poor quality and they will be correct sequence from your library. Trimming them would not help with the library, nor would it remove the bias (you'll end up with sequences slightly downstream of biased positions instead of over them).

          The way to tell whether the bias is adversely affecting your results would be to look at the evenness of coverage of your genome in your mapped data. For RNA-Seq, although you see priming bias in the library you still get pretty even coverage of transcripts, so the practical effect doesn't seem to be too bad.

          If would be great if it would be possible to remove this bias in the library prep stage, but for now this is an effect everyone seems to get, and which doesn't seem to have too detrimental an effect on your results.

          Simon.
          Hi Simon,
          Thanks so much for the information. After mapping I will take a look at the coverage distribution to see if the bias is causing an issue. I'm just glad that the problem shouldn't interfere with mapping and assembly too much.

          Thanks again,
          Marisa

          Comment


          • Originally posted by Marisa_Miller View Post
            I tried blasting the short heptamers against the Nucleotide collection (nr/nt) database. Not sure why I don't get any hits back.
            Heptamers are too short for blast searches. You should search with (a few full length) read(s) to get results.



            Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the expect value parameter is set too stringently and the default word size parameter is set too high.

            Comment


            • Originally posted by GenoMax View Post
              Heptamers are too short for blast searches. You should search with (a few full length) read(s) to get results.

              http://www.ncbi.nlm.nih.gov/blast/Why.shtml
              I didn't realize that heptamers would be too short to blast. I have searched with a few full reads and they all come up as wheat or wheat wild relative mitochondrial or plastid sequence, which is exactly what they should be since the libraries were prepared from purified organellar DNA from wheat.

              Comment


              • Perfect. You can move on to the analysis (do run your sequences through bbduk or trimmomatic to ensure that there are no remnants of adapters etc).

                Comment


                • Originally posted by Marisa_Miller View Post
                  I didn't realize that heptamers would be too short to blast. I have searched with a few full reads and they all come up as wheat or wheat wild relative mitochondrial or plastid sequence, which is exactly what they should be since the libraries were prepared from purified organellar DNA from wheat.
                  Hi Marisa:

                  I faced a similar problem some time ago, and as you, I was puzzled about the weird Kmer content.

                  What I did was check if the Kmers would overlap (just by eye) and then I searched for the consensus sequence, which turned out to be sequences matching with Illumina sequences. I don't have the exact information right now but as I recall, these sequences aligned with the sequences from "Process Controls for TruSeq Sample Preparation Kits". There's a PDF of illumina in the web, illumina-customer-sequence-letter describing these sequences.

                  I found that there was over-representation of these sequences in the library. Hence that was why FASTQC was warning about Kmer content.

                  I finally removed those sequences from my fastq files and then performed assembly. They *should not* affect my assembly as they matched only with the adapters AND the identity % between them was very high. But just in case, I removed them

                  Maybe that can help you solve this issue!

                  Cheers,

                  Gabriel

                  Comment


                  • Originally posted by gab0 View Post
                    Hi Marisa:

                    I faced a similar problem some time ago, and as you, I was puzzled about the weird Kmer content.

                    What I did was check if the Kmers would overlap (just by eye) and then I searched for the consensus sequence, which turned out to be sequences matching with Illumina sequences. I don't have the exact information right now but as I recall, these sequences aligned with the sequences from "Process Controls for TruSeq Sample Preparation Kits". There's a PDF of illumina in the web, illumina-customer-sequence-letter describing these sequences.

                    I found that there was over-representation of these sequences in the library. Hence that was why FASTQC was warning about Kmer content.

                    I finally removed those sequences from my fastq files and then performed assembly. They *should not* affect my assembly as they matched only with the adapters AND the identity % between them was very high. But just in case, I removed them

                    Maybe that can help you solve this issue!

                    Cheers,

                    Gabriel
                    Hi Gabriel,
                    Thanks for the info! I will do as you suggest and see if I can overlap those sequences to find out what they are.

                    Thanks!

                    Comment


                    • Originally posted by simonandrews View Post
                      ...
                      Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file? I can add a quick fix to just abandon the module if too many tiles are predicted, but if this is a format which might be more generally about then I should try to cope with this properly.
                      ...
                      I have to come back to the bug regarding the tile module. I just ran a simulation of RNA-Seq with the Flux Simulator. It puts out reads with identifiers that at first glance look like the ones from Illumina (numbers, seperated by ':' ) but have completely different meanings. http://sammeth.net/confluence/displa...ad+Identifiers

                      I simply renamed the reads afterwards using sed, but in general it would be good to have some kind of mechanism as you proposed, that skips the tile module when there are too many tiles. The --limits possibility seems also good (as soon as it functions) but I had a hard time figuring out, why the error occured. So an automatic "overflow" detection would be good, maybe with some warning in the log-file/error-out.

                      Comment


                      • Color Codes of Per Tile Sequence Quality

                        I had a run that clearly had a flow cell problem in that there the per tile sequence quality degraded starting around cycle 35 for a cluster of tiles. In the plot attached, there are 6 clusters but these tiles are all actually contiguous (and on both sides of the flow cell).

                        What I'm trying to understand is how to interpret the colors, i.e., how poor of quality is green/yellow/orange/red? When I look at the fastqc_data file that's generated, there's a "Mean" value for each tile and base but I'm confused as to what this is a mean of? The numbers don't make sense to me as quality scores. For instance for base 1 the mean scores range from -1.6 to 0.73852 (average is 0) for the 96 tiles so this seems to be some sort of variance measure?

                        thanks for any help in interpretation!
                        Attached Files

                        Comment


                        • Originally posted by sahodges View Post
                          I had a run that clearly had a flow cell problem in that there the per tile sequence quality degraded starting around cycle 35 for a cluster of tiles. In the plot attached, there are 6 clusters but these tiles are all actually contiguous (and on both sides of the flow cell).

                          What I'm trying to understand is how to interpret the colors, i.e., how poor of quality is green/yellow/orange/red? When I look at the fastqc_data file that's generated, there's a "Mean" value for each tile and base but I'm confused as to what this is a mean of? The numbers don't make sense to me as quality scores. For instance for base 1 the mean scores range from -1.6 to 0.73852 (average is 0) for the 96 tiles so this seems to be some sort of variance measure?

                          thanks for any help in interpretation!
                          The tiles on the FastQC plot are sorted by number, so discontiguous numbers might actually be adjacent on the actual flowcell - you'd need to look at the illumina documentation for the flowcell version you used (there's more than one).

                          The numbers you get are Phred score differences from the mean Phred score for that cycle of sequencing. For example if you had a chemistry cycle where the mean Phred score across all tiles was 30 and in tile 1234 the average Phred score was 25 then the per-tile measure for that tile would be -5.

                          The colour scale goes from 0 to 0 - whatever error threshold you have set in your config file (limits.txt), which is 10 by default.

                          Hope this helps.

                          Comment


                          • Yes that does help. Thanks very much! Really appreciate your fast response too!

                            I had figured out the tile location coding - the bad tiles are, in fact, all adjacent.

                            Comment


                            • FastQC Exception in thread "main" java.awt.HeadlessException

                              Hi,
                              A quick question before Christmas, I get the following error:

                              java -version

                              java version "1.6.0_33"
                              OpenJDK Runtime Environment (IcedTea6 1.13.5) (6b33-1.13.5-1ubuntu0.12.04)
                              OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)

                              ./fastqc

                              Exception in thread "main" java.awt.HeadlessException
                              at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
                              at java.awt.Window.<init>(Window.java:547)
                              at java.awt.Frame.<init>(Frame.java:419)
                              at java.awt.Frame.<init>(Frame.java:384)
                              at javax.swing.JFrame.<init>(JFrame.java:174)
                              at uk.ac.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:71)
                              at uk.ac.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)

                              Thanks,
                              Chris

                              Comment


                              • @chris: Simon will no doubt swing by but it appears that you are trying to run fastqc in interactive mode on a server where the X11 is not set/not present.

                                You can get around this by running fastqc non-interactively. Just specify the name of the sequence file(s) on the command like this:

                                Code:
                                $ fastqc seq_file1 seq_file2

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin


                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                  Yesterday, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                50 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                43 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                55 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X