SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
BWA and mate pair bouhassi Bioinformatics 0 12-07-2011 07:33 AM
Mate-Pair sequencing versa Bioinformatics 0 02-09-2011 11:51 PM
Difference between mate pair and pair end bassu General 2 06-19-2010 06:13 AM
Mate pair, high GC chen Sample Prep / Library Generation 3 05-25-2010 08:45 AM
mate pair sequencing Chien-Yuan Chen Illumina/Solexa 8 03-25-2010 07:55 PM

Reply
 
Thread Tools
Old 02-22-2010, 06:29 AM   #1
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default building scaffolds using a contig and mate pair

Hi all,

We are currently performing a de novo assembly using Illumina mate-pairs. we have assembled them using CLCBio, though with CLCBio no scaffolds can be produced, only contigs. Now we have mate pairs, so we would like to use them to make a scaffold.

The problem is that assembly programs like SOAPdenovo or SSAke etc. use files which where produced during contig assembling. They don't have a stand-alone program for just scaffolding a contig file.

Is there any software/algorithm available which has the contigs file (in .fasta format) and mate pair files as input, and can produce a scaffold? Or has someone a solution?

Kind regards,
Marten
boetsie is offline   Reply With Quote
Old 02-22-2010, 10:29 AM   #2
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

I'm pretty sure MIRA can do that. Set aside a couple days to read the manual though (it is very long)
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 02-24-2010, 04:22 AM   #3
niazi84@hotmail.com
Member
 
Location: Uppsala

Join Date: Jan 2010
Posts: 25
Default

Quote:
Originally Posted by boetsie View Post
Hi all,

We are currently performing a de novo assembly using Illumina mate-pairs. we have assembled them using CLCBio, though with CLCBio no scaffolds can be produced, only contigs. Now we have mate pairs, so we would like to use them to make a scaffold.

The problem is that assembly programs like SOAPdenovo or SSAke etc. use files which where produced during contig assembling. They don't have a stand-alone program for just scaffolding a contig file.

Is there any software/algorithm available which has the contigs file (in .fasta format) and mate pair files as input, and can produce a scaffold? Or has someone a solution?

Kind regards,
Marten
Marten, you can use Bambus 2.33 by AMOS. It takes contig and mate file as input. I am also trying to use it but i dont have mate file as required by Bambus. DO you know how to create mate file? I have paired end reads from illumina
__________________
~Adnan~
niazi84@hotmail.com is offline   Reply With Quote
Old 02-25-2010, 03:01 AM   #4
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Thanks for the reply's, but I don't think you answers work..

MIRA uses Bambus for scaffolding (if i'm correct?).
Though, Bambus doesn't read in a .fasta file for scaffolding, it needs a .contig file, which i don't have. In addition, i can't put in the two mate-pair files i have (one for each read end), only a regular expression of how the two pairs are mated.

So, my input is;

- 1 .fasta file containing contigs
- 2 .fasta files containing the mate pairs

Is there a way to do this?

Kind regards,
Marten
boetsie is offline   Reply With Quote
Old 04-07-2010, 04:51 AM   #5
gabriel.lichtenstein
Junior Member
 
Location: Buenos Aires

Join Date: Dec 2009
Posts: 7
Default

any updates on this....

Last edited by gabriel.lichtenstein; 04-07-2010 at 05:06 AM.
gabriel.lichtenstein is offline   Reply With Quote
Old 04-08-2010, 05:07 AM   #6
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by gabriel.lichtenstein View Post
any updates on this....
well, not yet. We also had the problem that we couldn't generate an .ace file for Bambus. We are currently working on a script for this problem, since none of the existing programs today can do this.
boetsie is offline   Reply With Quote
Old 04-08-2010, 10:07 AM   #7
mack
Junior Member
 
Location: Vancouver

Join Date: Oct 2009
Posts: 4
Default

I believe CLCBio export assemblies as ace file.
mack is offline   Reply With Quote
Old 04-09-2010, 12:20 AM   #8
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by mack View Post
I believe CLCBio export assemblies as ace file.
For large datasets, somehow no .ace files are produced.
boetsie is offline   Reply With Quote
Old 04-09-2010, 08:51 AM   #9
mack
Junior Member
 
Location: Vancouver

Join Date: Oct 2009
Posts: 4
Default

Quote:
Originally Posted by boetsie View Post
For large datasets, somehow no .ace files are produced.
How big is your dataset? I were able to export my dataset as ace with 17k contigs + 250k singletons.
mack is offline   Reply With Quote
Old 04-14-2010, 01:11 AM   #10
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by mack View Post
How big is your dataset? I were able to export my dataset as ace with 17k contigs + 250k singletons.
more than 1 million contigs
boetsie is offline   Reply With Quote
Old 04-14-2010, 02:56 AM   #11
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default building sacaffold using bambus - .mates problem

Hi, I'm trying to run bambus but I don't have any .mates. Does anyone know how can I create this files?
I have a 454 output (fasta + sff) from a bacteria genome and I assembled it with phrap, I already convert the .ace to .contig, using ace2contig from AMOS.
Thanx!
danix is offline   Reply With Quote
Old 04-14-2010, 03:58 AM   #12
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by danix View Post
Hi, I'm trying to run bambus but I don't have any .mates. Does anyone know how can I create this files?
I have a 454 output (fasta + sff) from a bacteria genome and I assembled it with phrap, I already convert the .ace to .contig, using ace2contig from AMOS.
Thanx!
This script i got from Sergey Koren from AMOS, (which i adapted a bit):

cat my.fasta |grep ">" |sed s/\>//g |sed 's/\/1*$/./g;s/\/2*$/./g'|awk -F "." '{print $1}' |sort |uniq -c |awk '{if ($1 == 2) print $2"/1\t"$2"/2\tsmall"}' > mates.txt

You need to put in the fasta file with the read names as 'my.fasta'.

The file 'my.fasta' requires filenames to end with /1 and /2.
If you have other file names, like .x and .y. You should replace;

sed 's/\/1*$/./g;s/\/2*$/./g'

to for example;

sed 's/.x*$/./g;s/.y*$/./g'

in the code above.

If you have two fasta files. Just insert one and change;
if ($1 == 2) to if ($1 == 1)
in the code, this way you only have to run it for one file.

This will print the names to 'mates.txt'. Only thing to do is to set your library name and insert sizes on the top of this file.

Bambus will probably generate a lot of errors, because some names are not found in the .contig file. But this shouldn't be a problem.

Hope this works otherwise ask me.
boetsie is offline   Reply With Quote
Old 04-15-2010, 02:53 AM   #13
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default building sacaffold using bambus - .mates problem

Thanx boetsie for your quick answer.
But I can't use your script in this project because the 454 outputs I have 454Reads.01.MID4.fna and 454Reads.02.MID4.fna, have sequences with different names, so all id is unique and it creates a mates.txt empty.
Besides, the other bacteria I'm working with has only one fasta from 454.

Both fasta are like this:
>F35ERS102DJ7GS rank=0000002 x=1343.0 y=826.0 length=56
ATCAGACACGGAGGCGTACGCGCCGCTGTTCCAGGTGATGCTGGCATTCCAGAACA
>F35ERS102DBYUE rank=0000006 x=1249.0 y=1428.0 length=69
ATCAGACACGCCGCCGGCACCTTCGCCGCTGCCGCGCTCGCCACCGGTGGCACCCGTCGT
GCTGTGGTC
>F35ERS102C47FN rank=0000036 x=1172.0 y=1361.0 length=68
ATCAGACACGAGGTGAAGACCGGTTTCCGTCGCGGCGGAGAATAGCCGAACATCAGCGCG
CGATCGGG

I'm wondering if there is a way to create the .mates from the data I have. Any other idea?

Thanx
danix is offline   Reply With Quote
Old 04-15-2010, 03:53 AM   #14
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default

Quote:
Originally Posted by danix View Post
Thanx boetsie for your quick answer.
But I can't use your script in this project because the 454 outputs I have 454Reads.01.MID4.fna and 454Reads.02.MID4.fna, have sequences with different names, so all id is unique and it creates a mates.txt empty.
Besides, the other bacteria I'm working with has only one fasta from 454.

Both fasta are like this:
>F35ERS102DJ7GS rank=0000002 x=1343.0 y=826.0 length=56
ATCAGACACGGAGGCGTACGCGCCGCTGTTCCAGGTGATGCTGGCATTCCAGAACA
>F35ERS102DBYUE rank=0000006 x=1249.0 y=1428.0 length=69
ATCAGACACGCCGCCGGCACCTTCGCCGCTGCCGCGCTCGCCACCGGTGGCACCCGTCGT
GCTGTGGTC
>F35ERS102C47FN rank=0000036 x=1172.0 y=1361.0 length=68
ATCAGACACGAGGTGAAGACCGGTTTCCGTCGCGGCGGAGAATAGCCGAACATCAGCGCG
CGATCGGG

I'm wondering if there is a way to create the .mates from the data I have. Any other idea?

Thanx
Complementing the information I gave before:
454Reads.01.MID4.fna is like this:
>FZ92HC101CZUHH length=41 xy=1111_1155 region=1 run=R_2009_08_04_12_33_02_
CGCGCGTTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC101DJEHD length=46 xy=1334_0127 region=1 run=R_2009_08_04_12_33_02_
GTCTCGCGTCGTGTCTTCGCGTCGTATGCGGTACTGGTCAGGCGTT

454Reads.02.MID4.fna is like this:
>FZ92HC102IDBLW length=40 xy=3315_0370 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC102JYG94 length=40 xy=3966_0618 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC

Can I extract any information from these fastas to create a .mates?
Thanx
danix is offline   Reply With Quote
Old 04-15-2010, 04:50 AM   #15
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by danix View Post
Complementing the information I gave before:
454Reads.01.MID4.fna is like this:
>FZ92HC101CZUHH length=41 xy=1111_1155 region=1 run=R_2009_08_04_12_33_02_
CGCGCGTTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC101DJEHD length=46 xy=1334_0127 region=1 run=R_2009_08_04_12_33_02_
GTCTCGCGTCGTGTCTTCGCGTCGTATGCGGTACTGGTCAGGCGTT

454Reads.02.MID4.fna is like this:
>FZ92HC102IDBLW length=40 xy=3315_0370 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC102JYG94 length=40 xy=3966_0618 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC

Can I extract any information from these fastas to create a .mates?
Thanx
Hmmm i see it, it's 454, that doesn't have a prefix like .x or /1. (sorry, i have never worked with 454 data before )

Can you tell me how your .contig file looks like?

The mate file should have the same name as the first string after the "#" line in the .contig file. This line represents which read has mapped to the contig (starting with ##).

So if the line with "#" starts with e.g. FZ92HC102IDBLW, followed by the offset in parantheses, like;

#FZ92HC102IDBLW(0)

you should extract the names out of both files and put them in the same file

If this is indeed the case, you can use my script i attached.
Use it with;

perl testmates.pl file1 file2

It will generate a txt file with the mates. Only thing to do is put the library sizes at the top of the file.

more info about .contig file at http://www.cbcb.umd.edu/research/con...entation.shtml

Hope this helps.
Attached Files
File Type: pl testmates.pl (820 Bytes, 91 views)

Last edited by boetsie; 04-15-2010 at 05:25 AM.
boetsie is offline   Reply With Quote
Old 04-15-2010, 05:38 AM   #16
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default

Hi boetsie, thanx again for your quick reply.
Here is a part of my .contig file. It was created by ace2contig (AMOS pack) and the input was the .ace that phrap generated after the assembly.
I'll try to use the script u attached.
Thank you so much again!

##Contig1 1 458 bases, 00000000 checksum.
agttcggcatggggtcaggtggttccactgcgctattgccgccaggcaaattcttcaatc
tgagaaagctgatgtaagtaattcgttcattcgctacaaggccagaaacacttcttgggt
gttgtatggttaagcctcacgggtaattagtatgggttagctcaacgtatcgctacgctt
acacaccccacctatcaacgttgtggtctccaacggccctttaggaccctcaaggggtca
gggatgactcatctcagggctcgcttcccgcttagatgctttcagcggttatcgattccg
aacttagctaccgggcagtgccactggcgtgacaacccgaacaccagaggttcgttcact
ccggtcctctcgtactaggagcaactcccttcaatcatccaacgcccacggcagataggg
accgaactgtctcacgacgttctgaacccagctcgcgt
#FZ92HC101BPK62(0) [] 458 bases, 00000000 checksum. {1 458} <1 459>
agttcggcatggggtcaggtggttccactgcgctattgccgccaggcaaattcttcaatc
tgagaaagctgatgtaagtaattcgttcattcgctacaaggccagaaacacttcttgggt
gttgtatggttaagcctcacgggtaattagtatgggttagctcaacgtatcgctacgctt
acacaccccacctatcaacgttgtggtctccaacggccctttaggaccctcaaggggtca
gggatgactcatctcagggctcgcttcccgcttagatgctttcagcggttatcgattccg
aacttagctaccgggcagtgccactggcgtgacaacccgaacaccagaggttcgttcact
ccggtcctctcgtactaggagcaactcccttcaatcatccaacgcccacggcagataggg
accgaactgtctcacgacgttctgaacccagctcgcgt
##Contig2 1 379 bases, 00000000 checksum.
ttctgagggaacacgcgttctgcgcgggttgtcttggtgctcactgttttccgccccgga
gtttgtggggtgttgggggtggtgggtgtgtgttgtttgagaagtgcatagtggatgcga
gcatctagcccggcgagttccttggtgttcttgttgggttgtgtgttctgcaatttcgat
tctggtttgtgcgatcgcgtgttgtgatcgttgatttttgtttgttgtccgcattcgcgt
ctcgggcactgtttggtgtgtggggtgtgtttgtgggtgttgttgtaagtgtttgagggc
gttcggtggatgccttggtaccaggagccgatgaaggacggccgtgcggtgggtcagtga
taaatcgacatgttaggtg
#FZ92HC101BFQDN(0) [] 379 bases, 00000000 checksum. {1 379} <1 380>
ttctgagggaacacgcgttctgcgcgggttgtcttggtgctcactgttttccgccccgga
gtttgtggggtgttgggggtggtgggtgtgtgttgtttgagaagtgcatagtggatgcga
gcatctagcccggcgagttccttggtgttcttgttgggttgtgtgttctgcaatttcgat
tctggtttgtgcgatcgcgtgttgtgatcgttgatttttgtttgttgtccgcattcgcgt
ctcgggcactgtttggtgtgtggggtgtgtttgtgggtgttgttgtaagtgtttgagggc
gttcggtggatgccttggtaccaggagccgatgaaggacggccgtgcggtgggtcagtga
taaatcgacatgttaggtg
danix is offline   Reply With Quote
Old 04-15-2010, 06:38 AM   #17
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default

Hi, I forgot to mention that I also have the .sff if I can use them to create .mates it'll be great.
Can I? If so, how?
danix is offline   Reply With Quote
Old 04-15-2010, 06:47 AM   #18
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by danix View Post
Hi, I forgot to mention that I also have the .sff if I can use them to create .mates it'll be great.
Can I? If so, how?
I have no idea... I've never used a .sff file. How does it look like? why do you want to use it, does it contain additional data?

If the mates that are present in the .contig file, are all present in the two .fasta files, you can just use the two fasta files to create the .mates file.
boetsie is offline   Reply With Quote
Old 04-15-2010, 07:10 AM   #19
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default

Hi, the 454 output is sff (looks like a binary file), but we use a script called sff_extract to convert this data in fasta, xml and quality files. I was just reading now that "The 454 paired-end protocol will generate reads which contain the forward and reverse direction in one read, separated by a linker."
So I think the key to generate .mates is .sff, but I don't know how.
I think I shouldn't be so complicated...
danix is offline   Reply With Quote
Old 04-15-2010, 07:17 AM   #20
danix
Junior Member
 
Location: spain

Join Date: Apr 2010
Posts: 7
Default

Quote:
Originally Posted by boetsie View Post
I have no idea... I've never used a .sff file. How does it look like? why do you want to use it, does it contain additional data?

If the mates that are present in the .contig file, are all present in the two .fasta files, you can just use the two fasta files to create the .mates file.
How do I create the .mates? I tried with the script u send me and the output isn't fine. Besides I don't understand why FZ92HC101CZUHH.1 and FZ92HC102IDBLW.2 are in the same line. How can I tell that they are mates? I'm really lost and confused now...

FZ92HC101CZUHH.1 FZ92HC102IDBLW.2 libname
FZ92HC101DJEHD.1 FZ92HC102JYG94.2 libname
FZ92HC101DUWKQ.1 FZ92HC102HS1LU.2 libname
FZ92HC101CUUV5.1 FZ92HC102G8H4Z.2 libname
FZ92HC101EMKQX.1 FZ92HC102HOD38.2 libname
FZ92HC101CE653.1 FZ92HC102HO0J7.2 libname
FZ92HC101ECTBB.1 FZ92HC102IBNJJ.2 libname
FZ92HC101DXMSC.1 TGATCCGGCGCAGGCGTATCTGGGCTCGGATCGTGCCTGGTGCCGACGGCGATGAACGAC
libname
FZ92HC101C587C.1 FZ92HC102F3E16.2 libname
FZ92HC101BZ63S.1 CGGTCGGCCGCGGCCGATCTCGGGATTGCGCGGCGTGTGCAT
libname
FZ92HC101DEODE.1 CCGCGTGGACATGCCGTTCGAGGAACCGTGGACGCAACC
libname
FZ92HC101DP9HX.1 ATCGGCTATGCACAGGTCATCGAGTATCTCGACGGCG
libname
FZ92HC101EE90B.1 ACGTCCGACGTGATCAGGAGCGAGTCGGTGACGGCGCTTCGCACTCCGAGGG
libname
TTTGATGATCGACATCAAT GCGTTCGACTACCAGTTCGTCGGACCATCCGGGTAGCGTGTCGCAAGGGTCGGTTCCGAA
libname
CGTTCGCTGAGCACCGCCGAATCGAGCAGTTCGCGGATCTCGTCGAACGTCCNCGA FZ92HC102GE3MB.2 libname
CGTACGGATGTAGCTGGTGAAGAGGTCCCTTGCGGGCGGAGAAGTCGAGTCGTTCCGTCG TCGAGAGGCCGCGGAAGCGGCCGGAAAGGACGGCAACGATGTTTGACCGTTTCAACTCAG
libname
FZ92HC101DBOTK.1 FZ92HC102GVOHT.2 libname
FZ92HC101BEEQB.1 TCTGCGTGGAGACCGTGACGGCTGATCTACGGCCNCCTCGGCCGATGATCGCCGCCT
danix is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:16 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO