Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MaSuRCA error

    Hi everyone,

    I've been trying to run MaSuRCA to de novo assemble my Illumina fastq files. However after I generate the assemble.sh script I get the following error:

    ./assemble.sh
    Processing pe library reads
    Unsupported input format for file '/home/frag_2.fastq'
    Unsupported input format for file '/home/frag_1.fastq'
    awk: fatal: division by zero attempted
    Average PE read length
    Illegal division by zero at -e line 1.
    choosing kmer size of for the graph
    MIN_Q_CHAR: 64
    Creating mer database for Quorum.
    Error correct PE.
    Cutoff computation failed. Pass it explicitly with -p switch.
    Error correction of PE reads failed. Check pe.cor.log.

    My config file looked like this:

    DATA
    PE= pe 250 37 /home/frag_1.fastq /home/frag_2.fastq
    END

    PARAMETERS
    GRAPH_KMER_SIZE=auto
    USE_LINKING_MATES=1
    NUM_THREADS=64
    JF_SIZE=600000000
    END

    Any help would be greatly appreciated!

    James

  • #2
    How doyour fastq files look like? Seems that the assembler stumbles upon your input files .. maybe you can post a few lines ..

    Comment


    • #3
      Yes here is my fastq file:

      @M00534:14:000000000-A47HD:1:1101:15878:1324 2:N:0:1
      ATCCCGCAGGATGTGGAAAGCGCCGTCATGCCCGCAGAGCTACGGCCATTAACGCCAACCCGCGCCACACAAACCTACCCTTCTCCGGCCTCGGTCGAAAGCGCCGTGGCGCTGTTGCGTGCCGCGCGCAACCCGGGGATTCTGGCCGGGCACGGGGTTGCCAGAACCGGGCATGCGCCGGCGCTGGACGCCTTTGACTGGGGTTACGTCGTTTCGGTCGCACCACCTGTTATTGGGACGGGGGTGAGTCA
      +
      CBBCCCCCCCBFGGGGGGGGGGGGGEGDHHHHFGGGCECGCBGGGGGCHGGHHGGGGGEAAEE@EGGGGDGGHHGFHGFHGHHHBFD@DGGHGGDFCDEDDGD?@C<CCC>.C@.<DFCCDFHD-@@???CFF.9B-;-./BBBBBF-@B:B..-:@--;9;//./9.9---;:;/..:------;:..---..9;////:9-.-.:./.-;../.-..9----..../;//////.;99-@;@@;../;/
      @M00534:14:000000000-A47HD:1:1101:16163:1327 2:N:0:1
      GATAGCCAAACGCTCAACCTTTGGGTACAACACCCCGGCCCGCCAGACAGAGGTCCCTTATACACACCCCACCCCGCCAAGCATACCGAAGATTGCCATCTCGGCGGCCGCCGCGTCGTTAAAAAAAAAACGCACGGCGCCCACAAACTGCTCCTCGGCCGGGTAGAACAGAATCCGCACCCGCCCGGGCGGTGGCGCTTCGCGACGTTGTAGGCATAGGTATGGTAGCACTAGCACAGTGTGTAGGGTAA
      +
      A1>A1F1111111A111100A1D000111D100BEA/AA//////A0/0000BF01F11112221B/B/>////>>/////>B1110/////B1B1B1BG1F0/////>-<><<-:;:;:..00;A-A?-.-------9-;--;9AF-;/9BFEBB----9----99BFF//B///-:-9--;---------;-------------9-;9-//////////;//-///////;9//9--////-/9/-///
      @M00534:14:000000000-A47HD:1:1101:17525:1332 2:N:0:1
      ATATGCTATCGCGCCAAAACCGTCGCTGAACGGCGTAAGCACATCACGCTTGATTTCAACCCGGAAGGCCTTGGTAATCAGGCAGCGATTGAACGTGCGATTGAATACTTCTTAAAAGACAGCCTGTCCGTGCATCACCCAC

      I thought this might be the cause of the error message so I tried the test data from the MaSuRCA ftp site but got the same error.

      Comment


      • #4
        Looks like standard MiSeq output. But if you say that even their test data fails to assemble with masurka, you might have a problem with your installation of masurka?
        You may want to check the assemble.sh script to see what masurka is trying to do.

        If there is nothing obvious you should probably contact the authors ..

        Comment


        • #5
          I replace 2:N:0:1 with \2 and it worked well for one genome. So it worth a try.
          However, for different genome it generated similar error.

          If you have able to resolve error, please post the possible solution here.

          Comment


          • #6
            Solved

            Hi,

            I contacted developers regarding this and they suggested that read_names does not matter during pre-processing of data. He suggested me to perform a test with my fastq file:

            Code:
            file -b -i jumps.A.fastq
            This gave me the results like:

            Code:
            text/x-python; charset=us-ascii
            I emailed results to developers and they suggested that - the operating system thinks that your fastq file is a python code. This is not correct. The type should be text/plain.

            Code:
            The simple way to fix this:
            
            Look at expand_fastq script under masurca bin folder and replace the line:
            
                (text/plain*)
            with
                (text/*)
             
            everything should work afterward.
            After this change, I was able to run the assembler correctly with setting JF_SIZE=1800000000 value very high.

            Thanks
            Sagar

            Comment


            • #7
              Hi Sagar,

              Thanks for your solution. This worked well on my local system. However when I amended the file on my univesities linux server I got an error. This was a different error to the original and is printed below:

              Creating k-unitigs with k=81
              [Wed Mar 26 12:34:33 GMT 2014] Computing super reads from PE
              Linking PE reads 329036
              [Wed Mar 26 12:35:06 GMT 2014] Celera Assembler
              ovlMerThreshold=75
              Overlap/unitig failed, check output under CA/ and runCA1.out

              I have tried this several times, varying the parameters each time and using the masurca test data but I continue to get this error.

              James

              Comment


              • #8
                I am not sure how to handle this specific error. However, I received some additional help from MaSuRCA as below:

                Code:
                MaSuRCA works best with coverage up to 150x.  To use 200x+ coverage data you need to set KMER_COUNT_THRESHOLD parameter in the config file to 2 or 3, or simply use part of the data.
                You can check this parameters or write to authors with more specific questions.

                Thanks
                Sagar
                Last edited by sagarutturkar; 03-26-2014, 09:49 AM. Reason: minor

                Comment


                • #9
                  Hello people,
                  I have any problem with Masurca assembler. I think that problem is in Jellyfish step.
                  The error is related with close gaps: "Gap close failed, you can still use pre-gap close files under CA/9-terminator/. Check gepClose.err for problems"

                  Cheking gepClose.err ...

                  mkdir CA/10-gapclose
                  outputDirectory = CA/10-gapclose
                  /usr/local/MaSuRCA-2.0.3.1/bin/getEndSequencesOfContigs.perl /home/lays/jatoba/dados_norm/masurca/CA/9-terminator 51 200
                  /usr/local/MaSuRCA-2.0.3.1/bin/create_end_pairs.perl /home/lays/jatoba/dados_norm/masurca/CA/9-terminator 51 > /home/lays/jatoba/dados_norm/masurca/CA/10-gapclose/contig_end_pairs.51.fa
                  /usr/local/MaSuRCA-2.0.3.1/bin/create_end_pairs.perl /home/lays/jatoba/dados_norm/masurca/CA/9-terminator 200 > /home/lays/jatoba/dados_norm/masurca/CA/10-gapclose/contig_end_pairs.200.fa
                  /usr/local/MaSuRCA-2.0.3.1/bin/getMeanAndStdevForGapsByGapNumUsingCeleraAsmFile.perl /home/lays/jatoba/dados_norm/masurca/CA/9-terminator --contig-end-seq-file /home/lays/jatoba/dados_norm/masurca/CA/10-gapclose/contig_end_pairs.51.fa > gap.insertMeanAndStdev.txt
                  echo "cc 600 200" > meanAndStdevByPrefix.cc.txt
                  jellyfish count -s 7000000000 -C -t 12 -m 21 -L 100 -o restrictKmers /home/lays/jatoba/dados_norm/masurca/pA.renamed.fastq /home/lays/jatoba/dados_norm/masurca/pB.renamed.fastq /home/lays/jatoba/dados_norm/masurca/pC.renamed.fastq /home/lays/jatoba/dados_norm/masurca/pD.renamed.fastq
                  ln -sf restrictKmers_0 restrictKmers
                  jellyfish dump -L 1000 restrictKmers -c > highCountKmers.txt
                  jellyfish count -s 1 -C -t 12 -m 21 -o fishingAll /home/lays/jatoba/dados_norm/masurca/CA/10-gapclose/contig_end_pairs.200.fa
                  terminate called after throwing an instance of 'jellyfish::file_parser::FileParserError'
                  what(): Empty input file '/home/lays/jatoba/dados_norm/masurca/CA/10-gapclose/contig_end_pairs.200.fa'
                  child died with signal 6, with coredump

                  The algorithm doesn't create the file "contig_end_paird.200.fa". What it means?
                  Someone can help me?

                  Thanks a lot ...

                  Comment


                  • #10
                    Originally posted by bsp017 View Post
                    Hi Sagar,

                    Thanks for your solution. This worked well on my local system. However when I amended the file on my univesities linux server I got an error. This was a different error to the original and is printed below:

                    Creating k-unitigs with k=81
                    [Wed Mar 26 12:34:33 GMT 2014] Computing super reads from PE
                    Linking PE reads 329036
                    [Wed Mar 26 12:35:06 GMT 2014] Celera Assembler
                    ovlMerThreshold=75
                    Overlap/unitig failed, check output under CA/ and runCA1.out

                    I have tried this several times, varying the parameters each time and using the masurca test data but I continue to get this error.

                    James

                    You can solve the problem ?? Im stopped in the same point.

                    regards

                    Comment


                    • #11
                      Hello,

                      I am using MaSurCA to assemble plant genome, with 454, pacbio and Illumina PE and MP data. When I run the assembly i get this error:

                      [Mon May 2 13:51:44 IST 2016] Processing pe library reads
                      [Mon May 2 13:51:44 IST 2016] Processing sj library reads
                      Invalid fastq format (Unexpected end of file) in file '/dev/fd/62' around position -1
                      Invalid fastq format (Unexpected end of file) in file '/dev/fd/62' around position -1
                      [Mon May 2 14:32:31 IST 2016] Processing pe library reads
                      [Mon May 2 14:32:31 IST 2016] Processing sj library reads
                      Average PE read length 166
                      choosing kmer size of 70 for the graph
                      MIN_Q_CHAR: 33
                      [Mon May 2 19:59:07 IST 2016] Creating mer database for Quorum.
                      [Mon May 2 21:00:21 IST 2016] Error correct PE.
                      [Wed May 4 23:54:47 IST 2016] Error correct JUMP.
                      [Thu May 5 03:34:52 IST 2016] Estimating genome size.
                      Estimated genome size: 1469569156
                      [Thu May 5 04:52:19 IST 2016] Creating k-unitigs with k=70
                      panic: memory wrap at -e line 1, <> line 3749709508.
                      END failed--call queue aborted, <> line 3749709508.
                      [Thu May 5 21:52:26 IST 2016] Creating k-unitigs with k=31
                      [Fri May 6 11:54:54 IST 2016] Filtering JUMP.
                      Assuming outtie orientation
                      Chimeric/Redundant jump reads:
                      43056 chimeric_sj.txt
                      5337014 redundant_sj.txt
                      5380070 total
                      Found extra chimeric mates:
                      37278 work2.1/readsToExclude.txt
                      [Sun May 8 01:29:48 IST 2016] Creating FRG files
                      [Sun May 8 03:20:10 IST 2016] Computing super reads from PE
                      Super reads failed, check super1.err and files in ./work1/

                      As mentioned in this blog i also checked file -b -i option which is text/plain ; charset=us-ascii for all fastq files.

                      Can anyone help me out here?

                      Comment


                      • #12
                        My config file looks like this:

                        DATA
                        PE= pa 250 100 300bp_R1.fastq 300bp_R2.fastq
                        PE= pb 300 150 500bp_R1.fastq 500bp_R2.fastq
                        PE= pc 400 300 Miseq_R1.fastq Miseq_R2.fastq

                        JUMP= sa 1700 1000 2kb1_R1.fastq 2kb1_R2.fastq
                        JUMP= ha 1700 1000 2kb2_R1.fastq 2kb2_R2.fastq
                        JUMP= ga 1700 1000 2kb3_R1.fastq 2kb3_R2.fastq
                        JUMP= sb 1700 1000 2kb4_R1.fastq 2kb4_R2.fastq
                        JUMP= hb 1700 1000 2kb5_R1.fastq 2kb5_R2.fastq
                        JUMP= gb 1700 1000 2kb6_R1.fastq 2kb6_R2.fastq
                        JUMP= sc 1700 1000 2kb7_R1.fastq 2kb7_R2.fastq
                        JUMP= hc 1700 1000 2kb8_R1.fastq 2kb8_R2.fastq

                        JUMP= gc 2500 1000 4kb1_R1.fastq 4kb1_R2.fastq
                        JUMP= sd 2500 1000 4kb2_R1.fastq 4kb2_R2.fastq
                        JUMP= hd 2500 1000 4kb3_R1.fastq 4kb3_R2.fastq
                        JUMP= gd 2500 1000 4kb4_R1.fastq 4kb4_R2.fastq

                        JUMP= se 3000 1000 6kb1_R1.fastq 6kb1_R2.fastq
                        JUMP= he 3000 1000 6kb2_R1.fastq 6kb2_R2.fastq
                        JUMP= ge 3000 1000 6kb3_R1.fastq 6kb3_R2.fastq

                        JUMP= mc 5500 1000 8kb1_R1.fastq 8kb1_R2.fastq
                        JUMP= md 5500 1000 8kb2_R1.fastq 8kb2_R2.fastq
                        JUMP= me 11500 1000 20kb1_R1.fastq 20kb1_R2.fastq
                        JUMP= mf 11500 1000 20kb2_R1.fastq 20kb2_R2.fastq


                        OTHER=SG_combined.frg
                        OTHER=IIWSK1V01.frg
                        OTHER=IIWSK1V02.frg
                        OTHER=IJK1LD202.frg
                        OTHER=pacbio.frg
                        END

                        PARAMETERS
                        #this is k-mer size for deBruijn graph values between 25 and 101 are supported, auto will compute the optimal size based on the read data and GC content
                        GRAPH_KMER_SIZE=auto
                        #set this to 1 for Illumina-only assemblies and to 0 if you have 1x or more long (Sanger, 454) reads, you can also set this to 0 for large data sets with high jumping clone coverage, e.g. >50x
                        USE_LINKING_MATES=0
                        #this parameter is useful if you have too many jumping library mates. Typically set it to 60 for bacteria and something large (300) for mammals
                        LIMIT_JUMP_COVERAGE= 300
                        #these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically. for mammals do not set cgwErrorRate above 0.15!!!
                        CA_PARAMETERS = ovlMerSize=30 cgwErrorRate=0.15 ovlMemory=4GB
                        #minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if coverage >100
                        KMER_COUNT_THRESHOLD = 1
                        #auto-detected number of cpus to use
                        NUM_THREADS=58
                        #this is mandatory jellyfish hash size
                        JF_SIZE=35000000000
                        #this specifies if we do (1) or do not (0) want to trim long runs of homopolymers (e.g. GGGGGGGG) from 3' read ends, use it for high GC genomes
                        DO_HOMOPOLYMER_TRIM=0
                        END

                        Comment


                        • #13
                          Well, looks funny ;-)

                          What system you are running on? Mac/Linux? How much memory?
                          Is your data stored locally or on a NFS mount?
                          What is your expected coverage?

                          Is this the very first output containing an error message?

                          The first error seems to be related to reading fastq data from the pipes.
                          Check the fastq files that these are not abnormally truncated.

                          Something like
                          Code:
                          tail -n 4 FILE.fq
                          The second error (3 days later!?) implies that there is something that wants to allocate more memory than is available on your system. Check if you have enough memory. -> top
                          It still continues ...

                          Another three days later it finally dies in another step of the pipeline ...

                          hmm, no solution, bu try to search at the very beginning ...

                          Comment


                          • #14
                            Hi sklages,

                            I am running the script on linux system. The data is stored on Cluster. Expected coverage is 50x.

                            Yes this is the very first output error message.

                            I checked the fastq files. They are not abnormally truncated. Regarding memory, the requirements specified in the manual are matched. So I not sure whether that will be an issue.

                            Comment


                            • #15
                              How are you accessing the data? Via NFS? I am asking as we had some issues with I/O in our cluster environment.

                              Just have a look at how much memory your job consumes. You seem to run 58 threads, .. no idea if the unitigging step is multi-threaded and how much RAM it really needs in total. Then you'll see if this is an issue.

                              You should probably start from step "Creating k-unitigs with k=70" .. all intermediate files should be available ..

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X