Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • You're the man! Thank you so much. The command line and --threads 8 really helps for running multiple samples and so much faster both in setup and run time than clicking through with the interactive mode.

    Comment


    • Hello,

      I get the following error trying to run Fastqc (v 0.11.2) on some of my files:
      fastqc --outdir Fastqc/ --noextract ctcf.cont.fq
      Started analysis of ctcf.cont.fq
      Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
      at uk.ac.babraham.FastQC.Utilities.QualityCount.<init>(QualityCount.java:13)
      at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:258)
      at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88)
      at java.lang.Thread.run(Thread.java:662)
      I had this problem with v0.11.1, thought updating would fix as memory issues were mentioned in the release notes, but I'm still getting the same problem. The files are not unusually large (around 2GB gzipped), and other files of similar size have been fine. Any ideas?

      I can't figure out how to run Fastqc so that I can specify the memory (I don't really know anything about java). I've tried various things I found in the thread archives, along the lines of the command below, but get errors along the lines of "Could not find the main class"

      java -Xmx500m -cp /path/to/FastQC

      Comment


      • Originally posted by liz_is View Post
        Hello,
        I can't figure out how to run Fastqc so that I can specify the memory (I don't really know anything about java). I've tried various things I found in the thread archives, along the lines of the command below, but get errors along the lines of "Could not find the main class"
        The most likely cause of this unless your sequence file is really odd is that for some reason the program is trying to read the whole file as a single line. We've seen this happen when we have a fastq file with mac line endings (\r) which is then read on a linux host. The linux host doesn't recognise the end of line and reads everything in at once and dies. If this is the case then messing around with memory settings won't help. The only immediate fix would be to uncompress the file and run mac2unix [filename] to fix the line endings.

        I guess odd things could also happen if you had some really long sequences, but they would have to be *very* long to cause problems.

        Could the line endings thing be what's happening in your case?

        Comment


        • I just tried unzipping a couple of the files and converting the line endings using mac2unix, and I get the same error for one of them. The other gives a different but presumably related error:

          Code:
          fastqc --outdir Fastqc/ --noextract ctcf.chip.fq 
          Started analysis of ctcf.chip.fq
          Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
                  at java.lang.String.toCharArray(String.java:2725)
          This is data from a published paper and other fastq files from the same paper have worked fine...

          I have just noticed that for these two files, at least at the top of the file, the records have quality scores that are all "B". I checked another file that did work, and that has more varied quality scores. This suggests to me there might be another problem with the files themselves.

          Edit: Update: my colleague tried with v0.10.1 and it finished! There's a lot of poor-quality reads... So I guess I can use an older version but ideally I'd like to get this working.

          I also tried with a subset of the reads - the head/tail 100,000 reads it runs fine, taking 1million it crashes ~20% of the way in. Taking 200,000 it says "Analysis complete for test.fq" but then also prints errors.

          Code:
          Approx 95% complete for test.fq
          Analysis complete for test.fq
          Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
                  at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
                  at java.lang.StringCoding.encode(StringCoding.java:272)
                  at java.lang.StringCoding.encode(StringCoding.java:284)
                  at java.lang.String.getBytes(String.java:986)
                  at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:144)
                  at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:163)
                  at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
                  at java.lang.Thread.run(Thread.java:662)
          Last edited by liz_is; 10-01-2014, 06:33 AM.

          Comment


          • Originally posted by liz_is View Post
            I just tried unzipping a couple of the files and converting the line endings using mac2unix, and I get the same error for one of them. The other gives a different but presumably related error:

            Code:
            fastqc --outdir Fastqc/ --noextract ctcf.chip.fq 
            Started analysis of ctcf.chip.fq
            Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
                    at java.lang.String.toCharArray(String.java:2725)
            This is data from a published paper and other fastq files from the same paper have worked fine...

            I have just noticed that for these two files, at least at the top of the file, the records have quality scores that are all "B". I checked another file that did work, and that has more varied quality scores. This suggests to me there might be another problem with the files themselves.

            Edit: Update: my colleague tried with v0.10.1 and it finished! There's a lot of poor-quality reads... So I guess I can use an older version but ideally I'd like to get this working.

            I also tried with a subset of the reads - the head/tail 100,000 reads it runs fine, taking 1million it crashes ~20% of the way in. Taking 200,000 it says "Analysis complete for test.fq" but then also prints errors.

            Code:
            Approx 95% complete for test.fq
            Analysis complete for test.fq
            Exception in thread "Thread-2" java.lang.OutOfMemoryError: Java heap space
                    at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
                    at java.lang.StringCoding.encode(StringCoding.java:272)
                    at java.lang.StringCoding.encode(StringCoding.java:284)
                    at java.lang.String.getBytes(String.java:986)
                    at uk.ac.babraham.FastQC.Report.HTMLReportArchive.<init>(HTMLReportArchive.java:144)
                    at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:163)
                    at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
                    at java.lang.Thread.run(Thread.java:662)
            The errors being different isn't really a surprise, it's running out of memory and the exact operation which triggers that might be different in different cases. If it's happening with 100k reads then something really weird is going on.

            Could you possibly put a file which triggers this somewhere I can see it? If I can have a look at the data which causes this I stand a better chance of getting to the bottom of it. If you don't have a site you can upload to then drop me a mail to [email protected] and I'll send you login details for an FTP server you can push to.

            Comment


            • The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073

              The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.

              Thanks!

              Comment


              • Originally posted by liz_is View Post
                The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073

                The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.

                Thanks!
                That's great - I managed to download that and could reproduce the error on our cluster.

                I'll have a look now to see if I can find anything obvious, but unfortunately I'm away from the office for the rest of this week so I might not get to the bottom of this until next week when I can do some proper profiling to figure out what's going wrong on this data.

                Comment


                • Hi Simon,
                  Can u please explain FastQC tile report in more detail?

                  I found this page:


                  I am not able to understand the meaning of
                  "This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tile"

                  What is the meaning of "mean Phred score more than 2 less than the mean for that base across all tile "?

                  Kindly help me out.

                  Thanks

                  Comment


                  • Originally posted by srikant_verma View Post
                    Hi Simon,
                    Can u please explain FastQC tile report in more detail?

                    I found this page:


                    I am not able to understand the meaning of
                    "This module will issue a warning if any tile shows a mean Phred score more than 2 less than the mean for that base across all tile"

                    What is the meaning of "mean Phred score more than 2 less than the mean for that base across all tile "?

                    Kindly help me out.

                    Thanks
                    It means that it's looking for cases where one tile looks much worse than the other tiles on the flowcell lane for a given sequencing chemistry cycle. If you had a cycle where the average phred score across the whole flowcell was 20, but on one particular tile the average phred score was only 17 then this tile would be flagged up.

                    The idea is that it shouldn't matter if the whole flowcell is good or bad, but all of the tiles should look roughly the same. If one is worse than the rest then this indicates that there is a specific problem which might need to be looked at.

                    Comment


                    • Thanks Simon...

                      Comment


                      • Originally posted by liz_is View Post
                        The data is available on ENA here: http://www.ebi.ac.uk/ena/data/view/PRJEB3073

                        The first couple of files (which are the CTCF chip and input) are examples of files which are giving these errors. Some of the other files in this dataset work fine though, e.g the scc2 chip.

                        Thanks!
                        Hi Liz - Sorry for taking a while to have a proper look at this, other things have been getting in the way. I've tracked down the problem and it's the per-tile quality module which was causing the runaway memory usage (which is why it worked in the old version since that module wasn't there).

                        The problem seems to be that these files use a variant of the Illumina header format, which is close enough to the ones we've seen before that the program tries to parse it, but then the field it extracts for the tile number is wrong and it predicts an enormous number of tiles, which makes everything die!

                        The formats we've seen before are either:

                        Code:
                        @HWI-1KL136:211:D1LGAACXX:1:1101:18518:48851 3:N:0:ATGTCA
                        ..where the 4th field is the tile, or

                        Code:
                        @HWUSI-EAS493_0001:2:1:1000:16900#0/1
                        ..where the second field is the tile.

                        The ids in the file you found looked like:

                        Code:
                        @HWI-EAS212_1:8:1:4130:3711:0:1
                        ..where the format should be like my second example, except that the # and / have been replaced by :, which makes FastQC treat it like the first variant and pull out the wrong field.

                        The quick fix is that if you edit your limits.conf file in your fastqc installation (in the Configuration directory) you can turn off the per-tile quality module and you should be able to process these files.

                        Does anyone here know if this format is something which is actually generated by an Illumina sequencer, or is it something an individual or maybe the ENA have done to the file? I can add a quick fix to just abandon the module if too many tiles are predicted, but if this is a format which might be more generally about then I should try to cope with this properly.

                        Cheers

                        Simon.
                        Last edited by simonandrews; 10-09-2014, 04:49 AM. Reason: Added code tags to remove smilies from illumina ids!

                        Comment


                        • Thanks for the reply.

                          I've tried what you suggested but it doesn't help! I've tried both specifying a limits file using --limits and editing 'limits.txt' in the Configuration directory of the installed FastQC to include the line
                          Code:
                          tile                            ignore          1
                          I think that the change in the configuration isn't working to stop the per tile module being used, as the error message still makes reference to it:

                          Code:
                          Started analysis of ctcf.cont.fq
                          Exception in thread "Thread-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
                                  at uk.ac.babraham.FastQC.Modules.PerTileQualityScores.processSequence(PerTileQualityScores.java:258)
                                  at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:88)
                                  at java.lang.Thread.run(Thread.java:745)

                          Comment


                          • Originally posted by liz_is View Post
                            Thanks for the reply.

                            I've tried what you suggested but it doesn't help! I've tried both specifying a limits file using --limits and editing 'limits.txt' in the Configuration directory of the installed FastQC to include the line
                            Code:
                            tile                            ignore          1
                            Aaargh - I'd forgotten that one of the other pending fixes for the next release was that the disable didn't work for the per-tile module (it will actually disable it if you turn of the adapter module as it was reading the wrong parameter).

                            I've just put up a development snapshot at http://www.bioinformatics.babraham.a...11.3_devel.zip which contains the fix for both of these issues. You should be able to use that to process these files.

                            Comment


                            • Thanks, that version is working fine now!

                              Comment


                              • Kmer overrepresentation and per base sequence content in Nextera XT libraries

                                Hi all,
                                After reading around on the forums and elsewhere on the internet, it seems like seeing weird results for Kmer overrepresentation and per base sequence content after running FastQC on Nextera XT libraries is common.

                                The data I have here are sequencing data (MiSeq V3, 300 bp reads) of mitochondrial genomes from wheat. The Nextera XT libraries were prepared from purified organellar DNA (~450 kb genome) so the coverage is really high (~400X after trimming).

                                The files with the no_trim_prefix are the raw data. You can see that the "per base sequence content" looks weird for the first few bases. Also, the Kmer content is high in the first few bases. I have tried blasting these sequences and get no hits. The "Sequence Duplication Levels" are high most likely because of the high coverage of a small genome. I suspect this because another library I sequenced has only 60X coverage and the duplication levels are fine.

                                The files with the trim_prefix are the trimmed data. The data were quality and length trimmed (min. length 250 bp) with Trimmomatic. Unfortunately the trimming did not make a difference in the per base content or the Kmer overrepresentation.

                                My question is, will this matter for mapping and assembly? I plan on mapping these reads to already available mitochondrial genomes, as well as performing de novo assembly with Geneious.

                                Thanks in advance for any suggestions you all may have!
                                Attached Files

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X