Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOAPdenovo and FASTA files - Newbie question

    I'm a newbie when it comes to NGS bioinformatics. I'm testing SOAPdenovo with a small FASTA file (3300 kb). The FASTA file is an unpaired set of exported reads viewed in Tablet but originally generated in a BWA alignment, i.e. I extracted a subset of reads from my MiSeq run based on a reference sequence for denovo assembly in SOAPdenovo. For an unknown reason SOAPdenovo fails to read the FASTA file. (see below)

    Has anyone else seen this and is there a fix out there?

    Version 2.04: released on July 13th, 2012
    Compile Apr 25 2013 16:59:53

    ********************
    Pregraph
    ********************

    Parameters: pregraph -s soap-fasta.config -K 63 -R -o fastatest

    In soap-fasta.config, 1 lib(s), maximum read length 260, maximum name length 256.

    8 thread(s) initialized.
    Import reads from file:
    /home/fasta/1700-sorted-bam_140k.txt
    --- 100000000th reads.
    --- 200000000th reads.

    ...and on and on to "kill -9"

  • #2
    Hi

    What were your:

    1. command line invocations
    2. config file contents
    3. fasta header description

    Comment


    • #3
      fasta, config, and command line

      FASTA headers (original, modified 1, and modified 2):

      >M00542:7:000000000:A3C86:1:2103:28825:15020 pos=33 len=89

      or
      >M00542_7_000000000-A3C86_1_1102_26169_15631_pos-125_len-90

      or
      >1101_11975_19083_pos-230_len-42




      .config file (I've also hashed-out avg_ins and had asm_flags set to 1 or 3)


      #maximal read length
      max_rd_len=260
      [LIB]
      #average insert size
      avg_ins=420
      #if sequence needs to be reversed
      reverse_seq=0
      #in which part(s) the reads are used
      asm_flags=1
      #in which order the reads are used while scaffolding
      rank=1
      #fasta file (unpaired ends)
      f=/home/fasta/1700-sorted_bam.fasta


      Command line: (note that the executable was renamed to 'soapdenovo')

      >soapdenovo all -s soap-fasta.config -K 63 -R -o testfasta 1>fastaass.log 2>fastaass.err

      Comment


      • #4
        There appears to be occasional astericks (*) in the Tablet data that caused the fault. Removing these seemed to do the trick.

        Thanks for your time.

        Comment


        • #5
          There appears to be occasional astericks (*) in the Tablet data that caused the fault. Removing these seemed to do the trick.
          Gregory, where in your data did the asterisks appear? Among the nucleotides? Sequence names?

          I am experiencing a similar problem with soap denovo. My input is a pair of gzipped fastq files, 100 million 150 bp reads in each. Soap denovo ran for 60 hours and was reporting "--- 18600000000th reads" before I killed it.

          I extracted a small subset of the pairs (a couple thousand), and soap denovo ran fine for those.

          An online search revealed your post of a similar problem. It sounds like I need to "fix" my reads to remove asterisks but I want to understand better what you actually changed.

          Thanks.

          Comment


          • #6
            Tablet asterisks in nucleotide sequence

            The asterisk symbols were located in the actual DNA nucleotide sequence and from what I was able to determine were being used as 'gap' fillers by Tablet (as opposed to dashes or periods). Removing them using grep or other find-and-replace software did the trick for me.

            Comment


            • #7
              Thanks, Gary,

              I think my case triggered the same problem, but for different reasons.

              I think I must have experienced a transient corruption of my input files. I wasn't able to find any non ACGTN characters among my nucleotide sequences. Eventually I decided to just running the same thing a second time, and it ran without any problems.

              We do occasionally have I/O issues on the machine I was running on, usually these manifest as something like a "network time out" being reported to standard error. If this is what caused the problem with SOAPdenovo, the manifestation is quite unusual, there was no such report to stderr.

              Thanks for your response,
              T. Hattum

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              59 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              57 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              56 views
              0 likes
              Last Post seqadmin  
              Working...
              X