Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • boetsie
    Senior Member
    • Feb 2010
    • 245

    building scaffolds using a contig and mate pair

    Hi all,

    We are currently performing a de novo assembly using Illumina mate-pairs. we have assembled them using CLCBio, though with CLCBio no scaffolds can be produced, only contigs. Now we have mate pairs, so we would like to use them to make a scaffold.

    The problem is that assembly programs like SOAPdenovo or SSAke etc. use files which where produced during contig assembling. They don't have a stand-alone program for just scaffolding a contig file.

    Is there any software/algorithm available which has the contigs file (in .fasta format) and mate pair files as input, and can produce a scaffold? Or has someone a solution?

    Kind regards,
    Marten
  • Zigster
    Jeremy Leipzig
    • May 2009
    • 117

    #2
    I'm pretty sure MIRA can do that. Set aside a couple days to read the manual though (it is very long)
    --
    Jeremy Leipzig
    Bioinformatics Programmer
    --
    My blog
    Twitter

    Comment

    • niazi84@hotmail.com
      Member
      • Jan 2010
      • 25

      #3
      Originally posted by boetsie View Post
      Hi all,

      We are currently performing a de novo assembly using Illumina mate-pairs. we have assembled them using CLCBio, though with CLCBio no scaffolds can be produced, only contigs. Now we have mate pairs, so we would like to use them to make a scaffold.

      The problem is that assembly programs like SOAPdenovo or SSAke etc. use files which where produced during contig assembling. They don't have a stand-alone program for just scaffolding a contig file.

      Is there any software/algorithm available which has the contigs file (in .fasta format) and mate pair files as input, and can produce a scaffold? Or has someone a solution?

      Kind regards,
      Marten
      Marten, you can use Bambus 2.33 by AMOS. It takes contig and mate file as input. I am also trying to use it but i dont have mate file as required by Bambus. DO you know how to create mate file? I have paired end reads from illumina
      ~Adnan~

      Comment

      • boetsie
        Senior Member
        • Feb 2010
        • 245

        #4
        Thanks for the reply's, but I don't think you answers work..

        MIRA uses Bambus for scaffolding (if i'm correct?).
        Though, Bambus doesn't read in a .fasta file for scaffolding, it needs a .contig file, which i don't have. In addition, i can't put in the two mate-pair files i have (one for each read end), only a regular expression of how the two pairs are mated.

        So, my input is;

        - 1 .fasta file containing contigs
        - 2 .fasta files containing the mate pairs

        Is there a way to do this?

        Kind regards,
        Marten

        Comment

        • gabriel.lichtenstein
          Junior Member
          • Dec 2009
          • 7

          #5
          any updates on this....
          Last edited by gabriel.lichtenstein; 04-07-2010, 05:06 AM.

          Comment

          • boetsie
            Senior Member
            • Feb 2010
            • 245

            #6
            Originally posted by gabriel.lichtenstein View Post
            any updates on this....
            well, not yet. We also had the problem that we couldn't generate an .ace file for Bambus. We are currently working on a script for this problem, since none of the existing programs today can do this.

            Comment

            • mack
              Junior Member
              • Oct 2009
              • 4

              #7
              I believe CLCBio export assemblies as ace file.

              Comment

              • boetsie
                Senior Member
                • Feb 2010
                • 245

                #8
                Originally posted by mack View Post
                I believe CLCBio export assemblies as ace file.
                For large datasets, somehow no .ace files are produced.

                Comment

                • mack
                  Junior Member
                  • Oct 2009
                  • 4

                  #9
                  Originally posted by boetsie View Post
                  For large datasets, somehow no .ace files are produced.
                  How big is your dataset? I were able to export my dataset as ace with 17k contigs + 250k singletons.

                  Comment

                  • boetsie
                    Senior Member
                    • Feb 2010
                    • 245

                    #10
                    Originally posted by mack View Post
                    How big is your dataset? I were able to export my dataset as ace with 17k contigs + 250k singletons.
                    more than 1 million contigs

                    Comment

                    • danix
                      Junior Member
                      • Apr 2010
                      • 7

                      #11
                      building sacaffold using bambus - .mates problem

                      Hi, I'm trying to run bambus but I don't have any .mates. Does anyone know how can I create this files?
                      I have a 454 output (fasta + sff) from a bacteria genome and I assembled it with phrap, I already convert the .ace to .contig, using ace2contig from AMOS.
                      Thanx!

                      Comment

                      • boetsie
                        Senior Member
                        • Feb 2010
                        • 245

                        #12
                        Originally posted by danix View Post
                        Hi, I'm trying to run bambus but I don't have any .mates. Does anyone know how can I create this files?
                        I have a 454 output (fasta + sff) from a bacteria genome and I assembled it with phrap, I already convert the .ace to .contig, using ace2contig from AMOS.
                        Thanx!
                        This script i got from Sergey Koren from AMOS, (which i adapted a bit):

                        cat my.fasta |grep ">" |sed s/\>//g |sed 's/\/1*$/./g;s/\/2*$/./g'|awk -F "." '{print $1}' |sort |uniq -c |awk '{if ($1 == 2) print $2"/1\t"$2"/2\tsmall"}' > mates.txt

                        You need to put in the fasta file with the read names as 'my.fasta'.

                        The file 'my.fasta' requires filenames to end with /1 and /2.
                        If you have other file names, like .x and .y. You should replace;

                        sed 's/\/1*$/./g;s/\/2*$/./g'

                        to for example;

                        sed 's/.x*$/./g;s/.y*$/./g'

                        in the code above.

                        If you have two fasta files. Just insert one and change;
                        if ($1 == 2) to if ($1 == 1)
                        in the code, this way you only have to run it for one file.

                        This will print the names to 'mates.txt'. Only thing to do is to set your library name and insert sizes on the top of this file.

                        Bambus will probably generate a lot of errors, because some names are not found in the .contig file. But this shouldn't be a problem.

                        Hope this works otherwise ask me.

                        Comment

                        • danix
                          Junior Member
                          • Apr 2010
                          • 7

                          #13
                          building sacaffold using bambus - .mates problem

                          Thanx boetsie for your quick answer.
                          But I can't use your script in this project because the 454 outputs I have 454Reads.01.MID4.fna and 454Reads.02.MID4.fna, have sequences with different names, so all id is unique and it creates a mates.txt empty.
                          Besides, the other bacteria I'm working with has only one fasta from 454.

                          Both fasta are like this:
                          >F35ERS102DJ7GS rank=0000002 x=1343.0 y=826.0 length=56
                          ATCAGACACGGAGGCGTACGCGCCGCTGTTCCAGGTGATGCTGGCATTCCAGAACA
                          >F35ERS102DBYUE rank=0000006 x=1249.0 y=1428.0 length=69
                          ATCAGACACGCCGCCGGCACCTTCGCCGCTGCCGCGCTCGCCACCGGTGGCACCCGTCGT
                          GCTGTGGTC
                          >F35ERS102C47FN rank=0000036 x=1172.0 y=1361.0 length=68
                          ATCAGACACGAGGTGAAGACCGGTTTCCGTCGCGGCGGAGAATAGCCGAACATCAGCGCG
                          CGATCGGG

                          I'm wondering if there is a way to create the .mates from the data I have. Any other idea?

                          Thanx

                          Comment

                          • danix
                            Junior Member
                            • Apr 2010
                            • 7

                            #14
                            Originally posted by danix View Post
                            Thanx boetsie for your quick answer.
                            But I can't use your script in this project because the 454 outputs I have 454Reads.01.MID4.fna and 454Reads.02.MID4.fna, have sequences with different names, so all id is unique and it creates a mates.txt empty.
                            Besides, the other bacteria I'm working with has only one fasta from 454.

                            Both fasta are like this:
                            >F35ERS102DJ7GS rank=0000002 x=1343.0 y=826.0 length=56
                            ATCAGACACGGAGGCGTACGCGCCGCTGTTCCAGGTGATGCTGGCATTCCAGAACA
                            >F35ERS102DBYUE rank=0000006 x=1249.0 y=1428.0 length=69
                            ATCAGACACGCCGCCGGCACCTTCGCCGCTGCCGCGCTCGCCACCGGTGGCACCCGTCGT
                            GCTGTGGTC
                            >F35ERS102C47FN rank=0000036 x=1172.0 y=1361.0 length=68
                            ATCAGACACGAGGTGAAGACCGGTTTCCGTCGCGGCGGAGAATAGCCGAACATCAGCGCG
                            CGATCGGG

                            I'm wondering if there is a way to create the .mates from the data I have. Any other idea?

                            Thanx
                            Complementing the information I gave before:
                            454Reads.01.MID4.fna is like this:
                            >FZ92HC101CZUHH length=41 xy=1111_1155 region=1 run=R_2009_08_04_12_33_02_
                            CGCGCGTTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
                            >FZ92HC101DJEHD length=46 xy=1334_0127 region=1 run=R_2009_08_04_12_33_02_
                            GTCTCGCGTCGTGTCTTCGCGTCGTATGCGGTACTGGTCAGGCGTT

                            454Reads.02.MID4.fna is like this:
                            >FZ92HC102IDBLW length=40 xy=3315_0370 region=2 run=R_2009_08_04_12_33_02_
                            CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
                            >FZ92HC102JYG94 length=40 xy=3966_0618 region=2 run=R_2009_08_04_12_33_02_
                            CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC

                            Can I extract any information from these fastas to create a .mates?
                            Thanx

                            Comment

                            • boetsie
                              Senior Member
                              • Feb 2010
                              • 245

                              #15
                              Originally posted by danix View Post
                              Complementing the information I gave before:
                              454Reads.01.MID4.fna is like this:
                              >FZ92HC101CZUHH length=41 xy=1111_1155 region=1 run=R_2009_08_04_12_33_02_
                              CGCGCGTTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
                              >FZ92HC101DJEHD length=46 xy=1334_0127 region=1 run=R_2009_08_04_12_33_02_
                              GTCTCGCGTCGTGTCTTCGCGTCGTATGCGGTACTGGTCAGGCGTT

                              454Reads.02.MID4.fna is like this:
                              >FZ92HC102IDBLW length=40 xy=3315_0370 region=2 run=R_2009_08_04_12_33_02_
                              CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
                              >FZ92HC102JYG94 length=40 xy=3966_0618 region=2 run=R_2009_08_04_12_33_02_
                              CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC

                              Can I extract any information from these fastas to create a .mates?
                              Thanx
                              Hmmm i see it, it's 454, that doesn't have a prefix like .x or /1. (sorry, i have never worked with 454 data before )

                              Can you tell me how your .contig file looks like?

                              The mate file should have the same name as the first string after the "#" line in the .contig file. This line represents which read has mapped to the contig (starting with ##).

                              So if the line with "#" starts with e.g. FZ92HC102IDBLW, followed by the offset in parantheses, like;

                              #FZ92HC102IDBLW(0)

                              you should extract the names out of both files and put them in the same file

                              If this is indeed the case, you can use my script i attached.
                              Use it with;

                              perl testmates.pl file1 file2

                              It will generate a txt file with the mates. Only thing to do is put the library sizes at the top of the file.

                              more info about .contig file at http://www.cbcb.umd.edu/research/con...entation.shtml

                              Hope this helps.
                              Attached Files
                              Last edited by boetsie; 04-15-2010, 05:25 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • GATTACAT
                                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by GATTACAT
                                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                                07-01-2026, 11:43 AM
                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 11:08 AM
                              0 responses
                              7 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-30-2026, 05:37 AM
                              0 responses
                              12 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              20 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              54 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...