Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Galaxy Sanger Fastq Groomer frustration!

    How long is data expected to take to groom on Galaxy? I started grooming about 6 hours ago on two 20Gb data sets, and it's still working. I also started four sets (also 20Gb each) 28 hours ago, and nothing. Does anybody use this or have a faster way to process it?

    Btw, the grooming process is one that takes a fastq file and transforms the quality scores so that it can be put in to Cufflinks. I wouldn't expect this to take so long.

  • #2
    I guess that's going to depend on how overloaded your galaxy server is. You say you've got 6 datasets processing concurrently? Since grooming is going to be IO limited you're probably thrashing the disk which holds your sequences since you're reading and writing 240GB of data simultaneously. That can do really bad things to a server and may well take a while. I know we tend to get complacent about the volumes of data we work with, but 240GB is a serious amount of data to process (think of it as 400 CDs!) and trying to do it all at once may not go smoothly.

    Get your server admin to see what the machine is doing in case there's a problem.

    Comment


    • #3
      Thanks Simon.
      I took off all but two of the data sets (20 Gb), and ran a simple Perl script on our machines to convert the data. Hopefully it's not rejected by Galaxy!
      A

      Comment


      • #4
        Are you running this on the public galaxy server or your own local instance?

        Comment


        • #5
          It's on the public server at PSU. Here's an update: after waiting for a couple days for the server to groom the .fq, I decided to look for an alternative. I found a Perl script that ended up doing a 20Gb file in 10 minutes. I think something might be wrong with their scripting. The files are uploading now.
          As an aside, here at the Pellegrini Lab at UCLA, we're planning on setting up a local instance on our servers, but it might take a while. Oh, deadlines..

          Comment


          • #6
            There is a new disk quota that was recently instituted on the public instance of galaxy. You may have bumped up against that. See here: http://wiki.g2.bx.psu.edu/News/Galax...Usage%20Quotas

            Comment


            • #7
              I thought it might have been that, but I had a maximum of about 120Gb in my history, and each data set at about 20Gb. Unless when it was processing the data, it added to my quota without updating what I see and bumped me.
              And unless I'm mistaken, that link says that the quotas are only for the Galaxy Test Instance, and only suggestions for storage space on the main instance. Maybe they're keeping track though.
              Thanks for your reply =]

              Comment


              • #8
                I traded emails with one of the Galaxy developers: The Groomer is slow, excruciatingly slow sometimes. It does, however, handle all the known corner cases, which most scripts don't need to worry about. It will run faster if you set "Do not summarize Input (faster)" under "Summarize input data" under advanced options, but could still be considered slow.

                Hope this helps!

                Comment


                • #9
                  Thanks tnabtaf,
                  It definitely helps inasmuch as now I know not to use the groomer. It definitely seems more efficient to transform the data on out own, though maybe we don't think of as many possible problems with the formatting. Anyway, here's a link to the thread containing the Perl script I ultimately used: http://seqanswers.com/forums/showthr...groomer+script
                  Hopefully we'll have the Galaxy instance running on our servers soon, and we'll find a way to personalize the groomer.
                  For now, I'm choosing not to use Galaxy at all since I'm on a deadline. Thanks for all the help!
                  A

                  Comment


                  • #10
                    This question is best asked on the Galaxy User mailing list:



                    Lots of experienced Galaxy users will see the question there.

                    For some reason Galaxy is not recognizing the files as FASTQ. You can manually override this by changing the file type, but that may or may not have the desired effect since there is something in the files that Galaxy appears not to like.

                    Comment


                    • #11
                      Wikipedia has a description of the quality value ranges.

                      Comment


                      • #12
                        fastQ groomer

                        I am using galaxy in biocloud central to analyze my RNA-seq data. I have successfully transferred 4 datasets(biggest is 10GB) from S3 to galaxy. Then I tried to perform the fastQ groomer. But it said the job is waiting to run for a couple hours. Could someone tell me why the job can't start to run?
                        Thank you so much.

                        Comment


                        • #13
                          I would never use the public server for files over 1GB.
                          Install your own instance (its easy), and then follow instructions to get admin privileges, and install Fastq parallel groomer tool..much faster.

                          One other question...Why are you looking to groom these files?
                          I know MiSeq has gone back to the sanger fastq quality coding (which galaxy uses) which requires no grooming..
                          Maybe Hi-Seq hasn't?

                          Comment


                          • #14
                            fastQ groomer problem

                            The instance of Galaxy on the cloud behaves just like a local instance of Galaxy, not a public one

                            Comment


                            • #15
                              Originally posted by JackieBadger View Post
                              and install Fastq parallel groomer tool..much faster.
                              A colleague of mine wrote the parallel version.
                              I've made it a standalone script which you can use by itself now without Galaxy.

                              https://bitbucket.org/gvl/fastq-groomer-parallel

                              Cheers,
                              Last edited by kevyin; 09-16-2013, 08:46 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Today, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              37 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X