Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by ganygan25 View Post
    I am using FastQC as part of a workflow analysis pipeline and running from commandline. A single workflow would result in numerous fastq files. I note from the documentation that FastQC takes several filenames as arguments and runs a single run.

    cmd:- fastqc filename1.fq filename2.fq filename 3.fq

    How does the above command scale for large number of files? Is it better than to run the analysis for each file separately?
    It will be fine. It runs the files sequentially so it's exactly the same as doing

    for i in *fq; do fastqc $i ; done

    However, you can add the -t parameter to say how many files can be processed in parallel which will allow you to spread the load across multiple cores, which is probably the most efficient way to get through a large batch of files (and is what we do here).

    Comment


    • Many Thanks

      Comment


      • Simon, I have a feature suggestion for FastQC. As you know, when analysing bisulfite-sequencing data the nucleotide distribution is very different to standard DNA-seq, with only about 1% of sequenced nucleotides being cytosines. This has ramifications for the "Kmer Content" module, since any kmer involving multiple C's has a very low expected count and thus the results of this module are flooded by kmers involving multiple C's which have a massive observed/expected ratio. This makes the Kmer Content output for bisulfite sequencing data difficult to interpret (compared to the other FastQC modules).

        Might it be possible to have a --bisulfite mode that excludes kmers involving cytosines when computing the Kmer Content module? Alternatively (and perhaps more useful), could the results be stratified by whether the kmer contains a cytosine, with a plot and table for each case? I'm unsure how difficult this would be to implement and whether it might need further tweaking, e.g. to exclude kmers involving "CG" since these are likely due to methylation but to retain kmers that involve "CA" since these are perhaps more likely to be artefacts [failed bisulfite conversion, adaptor sequence, etc.] than real methylation.

        What do you think?
        Pete

        Comment


        • Originally posted by PeteH View Post
          Simon, I have a feature suggestion for FastQC. As you know, when analysing bisulfite-sequencing data the nucleotide distribution is very different to standard DNA-seq, with only about 1% of sequenced nucleotides being cytosines. This has ramifications for the "Kmer Content" module, since any kmer involving multiple C's has a very low expected count and thus the results of this module are flooded by kmers involving multiple C's which have a massive observed/expected ratio. This makes the Kmer Content output for bisulfite sequencing data difficult to interpret (compared to the other FastQC modules).

          Might it be possible to have a --bisulfite mode that excludes kmers involving cytosines when computing the Kmer Content module? Alternatively (and perhaps more useful), could the results be stratified by whether the kmer contains a cytosine, with a plot and table for each case? I'm unsure how difficult this would be to implement and whether it might need further tweaking, e.g. to exclude kmers involving "CG" since these are likely due to methylation but to retain kmers that involve "CA" since these are perhaps more likely to be artefacts [failed bisulfite conversion, adaptor sequence, etc.] than real methylation.

          What do you think?
          Pete
          Pete,

          This is a generic problem with the assumptions made in the Kmer analysis. The basic problem is that we take global composition values and then assume that these are evenly distributed over the whole dataset. In reality poorly represented bases tend to occur in clumps, which get assigned a very low probability of occurring by chance (which would be right if bases were randomly chosen), and therefore get picked out as significantly enriched even if they're happening at fairly low levels.

          I don't really want to include a specific 'bisulphite' mode since I'm generally wary of application (or technology) specific modifications, and since bisulphite is just an examplar of a wider problem.

          I guess one way to fix this would be to calculate two p-values for each Kmer. Have one based on the actual observed distribution of bases and a second based on the GC content of the library (so the probabilities of G and C are averaged), or even on a flat distribution of bases. You could then have a low level filter on the GC based p-value and only if that came out significant did you move on to test the current value. Your p-values for enriched C-rich regions would still look stupid, but they would probably mostly be excluded by the initial test. Any thoughts about whether this is viable or useful (or suggestions for a better way to do this) are most welcome.

          Comment


          • Originally posted by simonandrews View Post
            Pete,

            This is a generic problem with the assumptions made in the Kmer analysis. The basic problem is that we take global composition values and then assume that these are evenly distributed over the whole dataset. In reality poorly represented bases tend to occur in clumps, which get assigned a very low probability of occurring by chance (which would be right if bases were randomly chosen), and therefore get picked out as significantly enriched even if they're happening at fairly low levels.

            I don't really want to include a specific 'bisulphite' mode since I'm generally wary of application (or technology) specific modifications, and since bisulphite is just an examplar of a wider problem.

            I guess one way to fix this would be to calculate two p-values for each Kmer. Have one based on the actual observed distribution of bases and a second based on the GC content of the library (so the probabilities of G and C are averaged), or even on a flat distribution of bases. You could then have a low level filter on the GC based p-value and only if that came out significant did you move on to test the current value. Your p-values for enriched C-rich regions would still look stupid, but they would probably mostly be excluded by the initial test. Any thoughts about whether this is viable or useful (or suggestions for a better way to do this) are most welcome.
            Thanks for your reply, Simon. I appreciate your reasons for not wanting to modify the code for every application- or technology-specific artefact. Your two-pass strategy might be useful and I'll keep thinking about the problem.

            My current solution is to parse the fastqc_data.txt file to look for any "non-C" kmers and it works okay-ish. But I can only identify the mode of the spatial-distribution of such kmers and cannot produce line plots similar to those generated by FastQC for the top 6 kmers (plots that I find particularly useful).
            Pete

            Comment


            • fastqc error

              hi all,

              When I use the linux version, there is a problem like this:

              jingjing@Chua-Server:~/software/FastQC$ ./fastqc
              This is the source distribution of FastQC. You need to get the compiled version if you want to run the program


              Can someone give me some suggestions where the wrong is and what should I do for this error?

              Thanks!

              Jingjing

              Comment


              • Originally posted by jjjscuedu View Post

                This is the source distribution of FastQC. You need to get the compiled version if you want to run the program
                There was a clue already. On the FastQC download page you can get the following files:

                FastQC v0.10.0 (Win/Linux zip file) - this is the right one
                Source Code for FastQC v0.10.0 (zip file) - this is the wrong one

                Hope this helps.

                Comment


                • fastqc error

                  Hi all,

                  I have downloaded the FastQC v0.10.0 (Win/Linux zip file) version.

                  However, when I install it according to the manual, there are still some problems like this:

                  jingjing@Chua-Server:~/software/FastQC$ chmod 755 fastqc
                  jingjing@Chua-Server:~/software/FastQC$ ./fastqc
                  Exception in thread "main" java.awt.HeadlessException:
                  No X11 DISPLAY variable was set, but this program performed an operation which requires it.
                  at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
                  at java.awt.Window.<init>(Window.java:437)
                  at java.awt.Frame.<init>(Frame.java:419)
                  at java.awt.Frame.<init>(Frame.java:384)
                  at javax.swing.JFrame.<init>(JFrame.java:174)
                  at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:70)
                  at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)


                  Then, I found someone also have this problem and solve it by:

                  jingjing@Chua-Server:~/software/FastQC$ java -Xmx250m -classpath /home/jingjing/software/FastQC:$CLASSPATH uk.ac.bbsrc.babraham.FastQC.FastQCApplication
                  Exception in thread "main" java.awt.HeadlessException:
                  No X11 DISPLAY variable was set, but this program performed an operation which requires it.
                  at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
                  at java.awt.Window.<init>(Window.java:437)
                  at java.awt.Frame.<init>(Frame.java:419)
                  at java.awt.Frame.<init>(Frame.java:384)
                  at javax.swing.JFrame.<init>(JFrame.java:174)
                  at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:70)
                  at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)
                  jingjing@Chua-Server:~/software/FastQC$ ./fastqc
                  Exception in thread "main" java.awt.HeadlessException:
                  No X11 DISPLAY variable was set, but this program performed an operation which requires it.
                  at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:173)
                  at java.awt.Window.<init>(Window.java:437)
                  at java.awt.Frame.<init>(Frame.java:419)
                  at java.awt.Frame.<init>(Frame.java:384)
                  at javax.swing.JFrame.<init>(JFrame.java:174)
                  at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.<init>(FastQCApplication.java:70)
                  at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:324)


                  However, there are still some problems.

                  Can anyone give me some suggestions?

                  Jingjing

                  Comment


                  • FastQC failed to launch the graphical user interface because it requires you to enable X11 tunneling (check e.g. here).

                    "No X11 DISPLAY variable was set, but this program performed an operation which requires it."

                    You should still be able to run FastQC on the command line by typing

                    ./fastqc filename.fastq

                    or fastqc --help for more options.

                    Comment


                    • FastQC v0.10.1 has just been released onto the project web site. This version adds a work round for the problem that the java gzip decompressor can't handle concatenated gzip files and only processed the first block. This version should now read to the end of the file, so there's no need to decompress and recompress fastq files coming out the Illumina pipeline.

                      This version also adds a fix for a bug which was triggered when the program was installed in a directory whose path contained characters which needed to be quoted in URLs. It also adds an extra command line option which allows you to specify the location of a java interpreter where this isn't in your path.

                      Please note that the projects URL has now changed to http://www.bioinformatics.babraham.a...ojects/fastqc/, and that this means that the launchers distributed with the program will no longer work, and you'll need to use the ones which come with this one.

                      If you find any problems with this version please report them in our bugzilla at:

                      http://www.bioinformatics.babraham.ac.uk/bugzilla/

                      Comment


                      • Simon,
                        I am a newcomer to NGS and FastQC . I love your software.
                        My 10 FastQ files have been generated by Illumina HighScan. They are 100bp PE reads. In the report I get lots of green ticks, a scattering of gold and 1 consistent red (for every sample R1 and R2). It is the duplicated sequences. Duplicates are off the charts in every case. What is going on? My target is small (exons for ~170 genes). This was a custom capture DNA project using Agilent Sure select. Also what are the units on the Y-axis in this report graph? Also does this one bad mark doom all the samples in terms of usefulness?
                        patrick

                        Comment


                        • If you're capturing a very small region and sequencing this to huge depth then the warning about duplication is probably spurious since you might well be expecting that every sequence will be present multiple times. More details about how to interpret the duplicate plot, and when it's OK to ignore duplication can be found here.

                          Comment


                          • Originally posted by simonandrews View Post
                            But it also had a bug in it :-)

                            This version should work on all systems (if they have perl installed), and will let you set both java arguments and pass in files as arguments. I may add it to the next release.

                            Code:
                            #!/usr/bin/perl
                            use warnings;
                            use strict;
                            use FindBin qw($Bin);
                            
                            
                            if ($ENV{CLASSPATH}) {
                            	$ENV{CLASSPATH} .= ":$Bin";
                            }
                            else {
                            	$ENV{CLASSPATH} = $Bin;
                            }
                            
                            my @java_args = '-Xmx250m';
                            my @files;
                            
                            foreach (@ARGV) {
                              if (/^\-/) {
                                push @java_args,$_;
                              }
                              else {
                                push @files,$_;
                              }
                            }
                            
                            
                            exec "java",@java_args, "uk.ac.bbsrc.babraham.FastQC.FastQCApplication", @files;

                            Hi,
                            This is my fastqc code, after placing the above content into it

                            Code:
                            #!/usr/bin/perl
                            use warnings;
                            use strict;
                            use FindBin qw($RealBin);
                            use Getopt::Long;
                            
                            # Check to see if they've mistakenly downloaded the source distribution
                            # since several people have made this mistake
                            
                            if (-e "$RealBin/uk/ac/babraham/FastQC/FastQCApplication.java") {
                                    die "This is the source distribution of FastQC.  You need to get the compiled version if you want to run the program\n";
                            }
                            
                            my $delimiter = ':';
                            
                            if ($^O =~ /Win/) {
                                    $delimiter = ';';
                            }
                            
                            if ($ENV{CLASSPATH}) {
                                    $ENV{CLASSPATH} .= "$delimiter$RealBin$delimiter$RealBin/sam-1.32.jar$delimiter$RealBin/jbzip2-0.9.jar";
                            }
                            else {
                                    $ENV{CLASSPATH} = "$RealBin$delimiter$RealBin/sam-1.32.jar$delimiter$RealBin/jbzip2-0.9.jar";
                            }
                            
                            
                            my @java_args = '-Xmx250m';
                            my @files;
                            
                            
                            foreach (@ARGV) {
                              if (/^\-/) {
                                push @java_args,$_;
                              }
                              else {
                                push @files,$_;
                              }
                            }
                            
                            
                            exec "java",@java_args, "uk.ac.bbsrc.babraham.FastQC.FastQCApplication", @files;
                            I am hit with an error now.


                            Code:
                            FASTQ type: Sanger or Phred+33 (standard, --phred33-quals)
                            Total reads processed: 40743144
                            Quality score range: (2, 41)
                            Converting to Sanger FASTQ...
                            Conversion done!
                            Statement unlikely to be reached at /home/bin/fastqc line 47.
                                    (Maybe you meant system() when you said exec()?)
                            Unrecognized option: -Xt
                            Error: Could not create the Java Virtual Machine.
                            Error: A fatal exception has occurred. Program will exit.
                            Any pointers would be of great help.

                            Comment


                            • I'm not exactly sure what you're trying to do with the code you posted. But in the context of the code you quoted I think all of the changes in there made it into the most recent FastQC release, so you should check the launcher distributed with the latest FastQC to see if it does what you need.

                              Comment


                              • Adapter sequences for new fastqc module

                                I've been working on a new analysis module for FastQC which will specifically plot out the occurrences of a small number of adapter sequences so you can easily tell what benefit you would derive from trimming your data. I've attached an example so you can see what it will look like.

                                At the moment I only have 2 adapter sequences which I search for, these are the common start sequence to most illumina libraries and the Illumina smallRNA adapter. This covers all of the sequences we routinely see but I suspect there are other sequences which may commonly be seen on libraries and which would be removed by adapter trimmers. My sequences are below:

                                Illumina Universal Adapter AGATCGGAAGAG
                                Illumina Small RNA Adapter ATGGAATTCTCG

                                ..if you know of any others could you please post them here - preferably with a link to a dataset which contains them so I can check the detection is working. You can also email them directly to me ([email protected]) if you prefer.

                                Thanks.
                                Attached Files

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X