SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to make HiSeq indexed paired-end library with homemade oligos? ostrakon Illumina/Solexa 6 03-16-2012 05:22 AM
GS FLX Mate Paired End library Construction with MID elly 454 Pyrosequencing 0 05-25-2011 01:26 AM
NEBNext GAII Paired End Prep TonyBrooks Sample Prep / Library Generation 1 03-25-2010 03:40 PM
Mate pairs contaminated with paired ends - impact on assembly? reithme Bioinformatics 2 12-14-2009 12:35 AM
Illumina GAII paired end length differs cb4smit Bioinformatics 2 06-16-2009 11:06 PM

Reply
 
Thread Tools
Old 11-28-2011, 06:43 AM   #1
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 18
Default scaffolding GAII paired-end library with Hiseq mate-pairs

Hello,
we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
First off, the data-sets:
1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
Thank you for any remarks/suggestions!
stevebaeyen is offline   Reply With Quote
Old 11-28-2011, 07:43 AM   #2
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi Steve,

Just wanted to say that it is important to set the insert size for SSPACE as good as possible. There are tools to determine the median/mean insert size and its devation. One of them is within the SSPACE premium package. Since you are a BaseClear customer, you can get the SSPACE premium version for free if you do not have this already.

Furthermore, the assembly will not improve with the matepairs, they will sometimes even be worse. You already have enough coverage with your paired-end sequences. Main reason is that with matepair sequencing there is a bias in coverage at some regions along the genome, some regions are covered more than others. I would suggest not to include the matepair for the initial assembly. Only use the matepairs for scaffolding, as well as the paired-end reads used at the initial assembly.

What might improve the assembly is trimming of low-quality nucleotides and removing reads of low quality using the CLCBio's trimmer.

Once obtained the scaffolds, you can fill the gaps (N's) with tools like SOAP's GapClosure from BGI, or IMAGE. We are currently also working on a tool do this.

Regards,
Marten Boetzer
BaseClear

Quote:
Originally Posted by stevebaeyen View Post
Hello,
we are a Belgian research team studying the +/- 4Mbp genome of a bacterial plant pathogen (and newbies in NGS data analysis). We are getting some unexpected results during de novo assembly of our target genome using a combined paired-end and mate-pair library. No good reference genome is available, so de novo assembly is our only option. We would like to share some of our results for your consideration. Maybe some of you can tell us if this is a normal result, or if we are doing something wrong here…
First off, the data-sets:
1. One Illumina GA, paired-end short read set (50bp reads, 350Mb, 375bp insert), which gives us a theoretical 70x coverage.
2. One Illumina Hiseq, Mate-pair short read set (100bp reads, 500Mb, 5kb insert), which gives us a combined 160x coverage.
When we used the PE-set alone for de novo assembly in CLC-Bio, we get 478 contigs with an N50 of +/-20kbp. When looking at the contigs, we saw repetitive fragments (IS-sequences) were the major cause for the contig break-up. Based on the literature, we thought most of these gaps could be closed if we combined the PE-set with an extra MP-dataset.
However, if we combine both sets in a de novo assembly in CLC-Bio’s Beta-assembler (plugin in v4.8), we get 493 contigs and an N50 of 22kb.
When we try to scaffold the 478 contigs of the PE-only assembly with the MP-set in SSPACE, we can reduce them to 63 scaffolds, but the program has to introduce some 300.000 N’s in the sequence (total 4.2 Mb) to accomplish it. DNAStar also have problems with the Illumina 1.9 format from Hiseq2000...does anybody has experience using Hiseq data on this software?
Does anybody here have a clue what we are doing wrong and how we could improve this, or is there a logical explanation why the MP-set is not giving us a better gap closure?
Thank you for any remarks/suggestions!
boetsie is offline   Reply With Quote
Old 11-28-2011, 08:06 AM   #3
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

You are using mate-pair data to scaffold (join) contigs together rather than actually closing gaps, so what you are seeing is not unusual. The Ns represent repeat sequences of known length.

If you want to attempt to close gaps within those scaffolds, one option is to use a local assembly approach like GapCloser (part of SOAPdenovo) which I have had good results with but be aware that you won't be able to close all (or maybe even many) of the gaps this way.

But in fact I'd try your assembly with Velvet or SOAPdenovo rather than CLC-Bio first off and see if it does a better job.
nickloman is offline   Reply With Quote
Old 11-28-2011, 09:00 AM   #4
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,316
Default

So, the process of replacing those "N's" between scaffolds with actual sequence using PE and ME data is apparently called "gap closing". Since you are using a commercial software package to do your assembly, etc, you might want to ask them if they include a "gap closing" module.

Otherwise, there are liberated programs available for doing gap closing. (At least one is mentioned elsewhere in this thread.)
--
Phillip
pmiguel is offline   Reply With Quote
Old 11-28-2011, 09:14 AM   #5
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Quote:
Originally Posted by pmiguel View Post
So, the process of replacing those "N's" between scaffolds with actual sequence using PE and ME data is apparently called "gap closing".
That's what I call it anyway. One point about scaffolding (that perhaps is not well recognised) is that you don't usually end up with fewer gaps, just that the gaps become better characterised, e.g. you now know that contig A joins to contig B with a gap of N bases.
nickloman is offline   Reply With Quote
Old 12-14-2011, 06:47 AM   #6
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 18
Wink IMAGE2 gap closing

Quote:
Originally Posted by boetsie View Post
Hi Steve,
Once obtained the scaffolds, you can fill the gaps (N's) with tools like SOAP's GapClosure from BGI, or IMAGE. We are currently also working on a tool do this.
BaseClear
Hi Boetsie,
we obtained very nice scaffolds using your SSPACE Premium v2 software (up to 937kb and N50=275kb). I tried using IMAGE2 but there is no 'readme' or 'install' file and I can't find any information that helps me to run the software on the example provided with the program (program runs but does not close the gaps). I tried to contact Jason Tsai but no reply so far.
This is what i did:
I downloaded the Dec., 2 version (v2.3) from Sourceforge, copied the precompiled binaries to /usr/local/bin and made them executable on a Linux Ubuntu 11.10 64-bit distro. I looked at the scripts run.sh and saw some variables that have to be declared (such as paths to velvet, ssaha, etc.) but i still do not get the gaps closed in iteration 10 (see output in attachment imagetest.txt).

# software path
# this is the path where the IMAGE path is
# Please change it accordingly
VELPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
SSAHADIR=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
WALKPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
Then i did:
cd /home/sbaeyen/Bio/IMAGE/IMAGE_version2/example
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ home/sbaeyen/Bio/IMAGE/IMAGE_version2/image.pl -prefix 76bp -iteration 1 -all_iteration 10 -dir_prefix iteration > imagetest.txt
When I run the 'image_run_summary.pl' script , I get:
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ perl /home/sbaeyen/Bio/IMAGE/IMAGE_version2/image_run_summary.pl iteration
The prefix is : iteration
iteration Starting_gaps Gap_closed Gap_extend_oneside Gap_extend_bothside
1 5 0 0 0
2 5 0 0 0
3 5 0 0 0
4 5 0 0 0
5 5 0 0 0
6 5 0 0 0
7 5 0 0 0
8 5 0 0 0
9 5 0 0 0
10 5 0 0 0
Do you have any clue what i need to adapt to get this program running/what i did wrong?
Best regards and thanks (again) for any advice!
Steve
ps if you want i can send you the program output imagetest.txt
stevebaeyen is offline   Reply With Quote
Old 12-14-2011, 07:19 AM   #7
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Hi Steve,

I've tried to run IMAGE too, but did not succeed. The input is very complex and I even had to change the code to get it running, though it did not close any gap. I've asked one of the authors but did not get any reply. I would go for GapClosure from SOAP, which is very good but does not include the remaining gaps and seems to join repeated areas. We have finished our tool, but are working on a publication, after that it will be released.

Regards,
Boetsie

Quote:
Originally Posted by stevebaeyen View Post
Hi Boetsie,
we obtained very nice scaffolds using your SSPACE Premium v2 software (up to 937kb and N50=275kb). I tried using IMAGE2 but there is no 'readme' or 'install' file and I can't find any information that helps me to run the software on the example provided with the program (program runs but does not close the gaps). I tried to contact Jason Tsai but no reply so far.
This is what i did:
I downloaded the Dec., 2 version (v2.3) from Sourceforge, copied the precompiled binaries to /usr/local/bin and made them executable on a Linux Ubuntu 11.10 64-bit distro. I looked at the scripts run.sh and saw some variables that have to be declared (such as paths to velvet, ssaha, etc.) but i still do not get the gaps closed in iteration 10 (see output in attachment imagetest.txt).

# software path
# this is the path where the IMAGE path is
# Please change it accordingly
VELPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
SSAHADIR=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
WALKPATH=~/home/sbaeyen/Bio/IMAGE/IMAGE_version2/
Then i did:
cd /home/sbaeyen/Bio/IMAGE/IMAGE_version2/example
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ home/sbaeyen/Bio/IMAGE/IMAGE_version2/image.pl -prefix 76bp -iteration 1 -all_iteration 10 -dir_prefix iteration > imagetest.txt
When I run the 'image_run_summary.pl' script , I get:
sbaeyen@PXLSEQ:~/Bio/IMAGE/IMAGE_version2/example$ perl /home/sbaeyen/Bio/IMAGE/IMAGE_version2/image_run_summary.pl iteration
The prefix is : iteration
iteration Starting_gaps Gap_closed Gap_extend_oneside Gap_extend_bothside
1 5 0 0 0
2 5 0 0 0
3 5 0 0 0
4 5 0 0 0
5 5 0 0 0
6 5 0 0 0
7 5 0 0 0
8 5 0 0 0
9 5 0 0 0
10 5 0 0 0
Do you have any clue what i need to adapt to get this program running/what i did wrong?
Best regards and thanks (again) for any advice!
Steve
ps if you want i can send you the program output imagetest.txt
boetsie is offline   Reply With Quote
Old 12-15-2011, 05:14 AM   #8
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 18
Default

Hi Boetsie,
thanks for the advice of using SOAP's GapCloser ! Using the PE reads, i was able to close 161 of 400 gaps (of N's) in the scaffolds. Do you think the performance of Gapfiller would be even better?
Regards,
Steve
stevebaeyen is offline   Reply With Quote
Old 12-15-2011, 03:09 PM   #9
ragowthaman
Member
 
Location: Seattle, USA

Join Date: Nov 2009
Posts: 12
Default

stevebaeyen: I recently started to use IMAGE2. It seems to work well with me. At least it finished the example well and closed gaps. But, when it comes to my own genome, it did extend the ends but did not close very many gaps. May be a problem with data not IMAGE2...

Did you make sure, velveth,velvetg,smalt etc are in path?
ragowthaman is offline   Reply With Quote
Old 01-19-2012, 03:41 AM   #10
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Has anyone ever succeed to run IMAGE on his own data?

I want to run it with my scaffolds, but i'm having trouble to make the input files required by IMAGE. Does anyone have a script to automatically generate these files based on the original scaffolds?

Regards,
Boetsie
boetsie is offline   Reply With Quote
Old 01-19-2012, 04:13 AM   #11
Stegger
Member
 
Location: Copenhagen

Join Date: Nov 2008
Posts: 21
Default

Hi,
I had similar NGS data on a 3 Mbp bacterial pathogen, PE and MP Illumina data, and at least with an older version of the CLC GW assembler I also got much worse results with combined assembly even though I thought adding the MP would significantly reduce the number of contigs. I have not tried combining these with the new beta assembler CLC has, although it works better on my PE Illumina data alone.Have you tried that?
The solution for us was to use Velvet on both datasets and that brought our number of contigs down from approx. 70 to something like 15. These were verified by optical mapping, and we only saw one major error in these Velvet contigs... perhaps it is worth a try?
Stegger is offline   Reply With Quote
Old 01-19-2012, 05:08 AM   #12
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 18
Default

Quote:
Originally Posted by Stegger View Post
Hi,
I had similar NGS data on a 3 Mbp bacterial pathogen, PE and MP Illumina data, and at least with an older version of the CLC GW assembler I also got much worse results with combined assembly even though I thought adding the MP would significantly reduce the number of contigs. I have not tried combining these with the new beta assembler CLC has, although it works better on my PE Illumina data alone.Have you tried that?
The solution for us was to use Velvet on both datasets and that brought our number of contigs down from approx. 70 to something like 15. These were verified by optical mapping, and we only saw one major error in these Velvet contigs... perhaps it is worth a try?
Hi , I tried to denovo assemble PE+MP datasets on the new CLC scaffolder but didn't get a huge improvement compared to the PE dataset alone. A succesfull scaffolding with a +/-70% reduction was performed with SSPACE Premium v2 and and gaps closed with SOAP Gapcloser. Thanks for the Velvet tip, I'll give it a try! Do you have a good reference concerning optical mapping?
stevebaeyen is offline   Reply With Quote
Old 01-19-2012, 05:39 AM   #13
Stegger
Member
 
Location: Copenhagen

Join Date: Nov 2008
Posts: 21
Default

My pleasure!
and yes I had a very good reference..
Stegger is offline   Reply With Quote
Old 01-19-2012, 06:00 AM   #14
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 18
Default

Quote:
Originally Posted by Stegger View Post
My pleasure!
and yes I had a very good reference..
and can you give me with a link to a review article about optical mapping ?
stevebaeyen is offline   Reply With Quote
Old 07-09-2012, 08:07 AM   #15
hylei
Member
 
Location: Maryland

Join Date: Dec 2010
Posts: 12
Default How to close the CLC-bio contigs according to the reference genome sequence?

Hi, I used the CLC-bio de novo assembly to analyze the Miseq 150bp PE data, and I have the 150 contigs. I also have the 170kb reference genome seq; I tried to use the IMAGE to close the gap, and it did not work for me. Can anyone suggest me how to close the gap? Thank you very much.
hylei is offline   Reply With Quote
Old 07-09-2012, 08:09 AM   #16
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

I've just posted our new tool called GapFiller, which should be able to do this. See this thread;

http://seqanswers.com/forums/showthread.php?t=21493

Boetsie

Quote:
Originally Posted by hylei View Post
Hi, I used the CLC-bio de novo assembly to analyze the Miseq 150bp PE data, and I have the 150 contigs. I also have the 170kb reference genome seq; I tried to use the IMAGE to close the gap, and it did not work for me. Can anyone suggest me how to close the gap? Thank you very much.
boetsie is offline   Reply With Quote
Old 02-26-2013, 01:20 AM   #17
sivasubramani
Member
 
Location: India

Join Date: Apr 2011
Posts: 14
Default

I just completed reading the first(parent) post. There I got some doubts and where the data could have gone in different direction.
1. What is the kmer length you used for assembly. Becuase in such a high coverage data, if we use inappropriate kmer length(for example 23 for 72bp reads) of course you will get lot of false contigs and it will result you drive in wrong direction.
2. you din mention anywhere about the filtering and contamination check or quality filters. Because If it an alignment then you don need to worry about these factors but when you do denovo assembly with high coverage data each copy of read will affect your assembly.

better rather than using a default setup in clc workbench, you can try using different paraeters based on the data after filtering contaminated and low quality data.

Thanks
sivasubramani is offline   Reply With Quote
Old 02-27-2013, 02:45 AM   #18
stevebaeyen
Member
 
Location: Belgium

Join Date: Aug 2011
Posts: 18
Default sivasubramani

Thank you for your excellent tips sivasubramani, but the problem is solved by now . I indeed used a very strict trimming, duplicate removal, etc. before doing de novo assembly wit a k-mer parameter sweep.
stevebaeyen is offline   Reply With Quote
Reply

Tags
illumina hiseq 2000 reads, mate-pair, scaffolding

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:48 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO