SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome Res De novo bacterial genome sequencing: millions of very short reads assembly b_seite Literature Watch 1 10-04-2017 11:26 PM
de novo... beginner questions. Beginner De novo discovery 11 04-11-2013 08:13 PM
Cleanup and de novo assembly of a 2.9 Gb genome stvos Bioinformatics 0 08-01-2011 01:11 AM
Assessing quality and accuracy of de novo genome assembly rwness Bioinformatics 5 01-31-2011 03:13 PM
The sequence and de novo assembly of the giant panda genome dan Literature Watch 0 12-21-2009 01:12 AM

Reply
 
Thread Tools
Old 03-14-2012, 12:11 PM   #1
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default [help] de novo Genome Assembly : beginner

Hi there.

I am completely new in the world of (de novo) genome assembly and I don't know what to begin with. When I asked help at the department they said "go to seqanswers", so here I am to have some help...

I have been given some sequencing data about an insect (colza pollen beetle) and have to make a genome assembly. This is Illumina data in paired-end format.

There are 3 fastq files :
- lane 5/1 : 11 423 167 reads of length 76
- lane 5/2 : 11 423 167 reads of length 76
- lane 7 : 9 294 857 reads of length 152

An average beetle genome size is said to be about 650Mbp.

Apparently "we" have a server with 192GB RAM where SOAPdenovo is/will be installed.

I have been told to first control the sequences quality so after a few surfing I found "FASTQC" (with a good Youtube tutorial). I don't know what I have to do after... at all.

I am not here to ask you to do the job in my place & I know a will have a lot of reading & research, but i would know what is the main guide-line to follow, what are the things to mind about, the traps to prevent, etc.

Thank you in advance for any kind of help,

M.

(PS: accordingly to the FASTQC tutorial, data quality are quite poor, i can post output on demand)
Meligethes is offline   Reply With Quote
Old 03-14-2012, 01:03 PM   #2
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Default

Hey Meli,

The first thing would be to trim the primers/adapter/barcodes. I do this by mapping the know sequences (primers/adapter/barcodes) to the reads and then trimming them with a perl script or something like that.

Next would be to get the closest possible reference sequence (if know and/or available) and map your paired reads to them to filter out the good, the bad, and the ugly. If the reference is not known or close enough then it may be worthwhile to skip this step.

After that I generally filter my reads based on quality score. Trimming the actual reads to a shorter size has also produced very good results, so if you're not getting the assemblies you want with the full reads, I STRONGLY recommend to try it out.

Soap is a good program but there are many others and, as is usually the case, you really have to pick an assembler that fits your data. I will recommend Velvet and ABySS for starters. There are also a lot of good papers about how assemblers perform. Here are some of my favorites:

http://www.ncbi.nlm.nih.gov/pubmed/22147368
http://www.plosone.org/article/info%...l.pone.0031002
http://www.ncbi.nlm.nih.gov/pubmed/20724458

Also, it would be beneficial to install AMOScmp so that you can use its tools to help analyze your assemblies. This technique has a learning curve but it's so fun! Be patient and don't be scared to ask question... there's a lot of data out there.

I hope this helps and good luck!
twaddlac is offline   Reply With Quote
Old 03-14-2012, 01:07 PM   #3
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

thanks you very much, I will take a look at all this tomorrow because for the very moment i am a bit upset about all that :/
Meligethes is offline   Reply With Quote
Old 03-14-2012, 02:46 PM   #4
rahularjun86
Member
 
Location: Frankfurt(M), Germany

Join Date: Jan 2011
Posts: 58
Default

Hi,
1). you can use the Sickle tool(https://github.com/najoshi/sickle) for data preprocessing, and then view the data statistics with FastQC or use FastX tools(http://hannonlab.cshl.edu/fastx_toolkit/).
2). You can try velvet assembler(http://www.ebi.ac.uk/~zerbino/velvet/) from k-mer 21 to 65 with increment of 2. and Expected coverage you can use Auto or can calculate using R as explained in the manual and coverage cutoff from 2 to 15. Or try other Assemblers like Soapdenovo or Abyss.
3). Choose the assembly with best N50 and other parameters(Genome size, Largest Contigs, Reads used, Number of contigs).
4). Use Minimus2 or Minimus2_blat(http://sourceforge.net/apps/mediawik...etting_Started) for merging assemblies And Bambus2/SSPACE(http://www.baseclear.com/landingpages/sspacev12/) for scaffolding. SSPACE is very easy to use with very simple input options.
5). Check the Completeness of the genome using CEGMA pipeline(http://korflab.ucdavis.edu/Datasets/cegma/).
6). RepeatMasker(http://www.repeatmasker.org/) or other tools for repeat elements prediction and AUGUSTUS(http://augustus.gobics.de/) or other tools Genescan, GeneId for gene predictions.
7). Finally MUMMER(http://mummer.sourceforge.net/) for comparative analysis.

Best Wishes,
Rahul
__________________
Rahul Sharma,
Ph.D
Frankfurt am Main, Germany
rahularjun86 is offline   Reply With Quote
Old 03-16-2012, 08:36 AM   #5
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

OK thanks for all this help !

I asked to have primers and adapters sequences in order to cut them off (I though this was already done when i received fastq files but actually i have so high percentage of sequence duplication (92%!!) that i suppose there are still in the reads).

I have been told to try to find a reference genome close enough to rely on it for assembly.
I am currently on NCBI taxonomy browser but i still can't find anything close to any insect.

The softwares indicated for this kind of assembly are
- Velvet
- Mira
- SOAPdenovo
- Bowtie (?)

I am looking for installing them.
Meligethes is offline   Reply With Quote
Old 03-16-2012, 10:35 PM   #6
mjp
Member
 
Location: USA

Join Date: Mar 2011
Posts: 25
Default why don't you have a look at wiki

http://seqanswers.com/wiki/How-to/de_novo_assembly
mjp is offline   Reply With Quote
Old 03-17-2012, 03:15 AM   #7
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

Quote:
Originally Posted by mjp View Post
+1, thank you
Meligethes is offline   Reply With Quote
Old 03-17-2012, 11:24 AM   #8
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

j'attend d'avoir les séquences des primers et adapters ainsi que les codes d'accès pour le serveur distant (un genre de supercalculateur : UPPMAX, UPNEXT)

en attendant je suis un peu "coincé" quelles autres types d'informations (en dehors des analyses qualité fournies par Fastx Toolkit et FASTQC) puis-je obtenir de mes "simples" fichiers FASTQ ?

Merci encore pour votre aide
Meligethes is offline   Reply With Quote
Old 03-19-2012, 02:19 AM   #9
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

Oops I just figured out I wrote in French, sorry, whatever it was not important.

I just cannot understand why all reads are EXACTLY the same length (76).
Reads come from lane 5, but I have file "lane-5-1" and "lane-5-2", why is this splitted in 2 ? Because of the paired-end ? I mean one is 5'-3' and the other 3'-5' ?
All reads from lane 5-1 and lane 5-2 are same length and numbers of reads are equals... ?
Meligethes is offline   Reply With Quote
Old 03-19-2012, 04:56 AM   #10
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Quote:
Originally Posted by Meligethes View Post

I just cannot understand why all reads are EXACTLY the same length (76).
Reads come from lane 5, but I have file "lane-5-1" and "lane-5-2", why is this splitted in 2 ? Because of the paired-end ? I mean one is 5'-3' and the other 3'-5' ?
All reads from lane 5-1 and lane 5-2 are same length and numbers of reads are equals... ?
Illumina (and SOLiD) technology inherently generate reads of exactly the same length, unless you have trimmed them. The machine reads the data in cycles, and each cycle can acquire one and only one base.

If the two lanes are paired ends, then the identifiers should be the same or very similar (perhaps with /1 /2 or such as difference); look at the first read identifier in each file.
krobison is offline   Reply With Quote
Old 03-19-2012, 05:06 AM   #11
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

Ok thank you I got it, but how does the machine manage to know that "sequence xx" in this position is the same as "sequence xx" in this other position on other lane ??

I searched on the internet but it didn't help me about this...
Meligethes is offline   Reply With Quote
Old 03-19-2012, 01:39 PM   #12
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Quote:
Originally Posted by Meligethes View Post
Ok thank you I got it, but how does the machine manage to know that "sequence xx" in this position is the same as "sequence xx" in this other position on other lane ??

I searched on the internet but it didn't help me about this...
Optically -- the system uses high-precision imagery & aligns images between the first read & the second read. Indeed, it takes a set of images for each cycle and must align these to call the bases for a single end.
krobison is offline   Reply With Quote
Old 03-19-2012, 01:44 PM   #13
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

Do you mean that the machine has 2 main cycles :
1 ) only forward cycles in each cluster position
2 ) only reverse cycles in each cluster position

Then "align" images and same points are from the same cluster so the same fragment ?

Sorry I feel bit an idiot about this but I really don't figure out how this works and "because this is paired-end technique or because this is high end optical lasers" is really not sufficent for me

Last edited by Meligethes; 03-19-2012 at 01:48 PM.
Meligethes is offline   Reply With Quote
Old 03-20-2012, 10:36 AM   #14
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Yes.

The system runs through all of read 1. Then there is a clever molecular biology scheme which flips things around and then read 2 is generated.

http://seqanswers.com/forums/showthread.php?t=21
krobison is offline   Reply With Quote
Old 04-05-2012, 11:38 AM   #15
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by Meligethes View Post
Hi there.

I am completely new in the world of (de novo) genome assembly and I don't know what to begin with. When I asked help at the department they said "go to seqanswers", so here I am to have some help...

I have been given some sequencing data about an insect (colza pollen beetle) and have to make a genome assembly. This is Illumina data in paired-end format.

There are 3 fastq files :
- lane 5/1 : 11 423 167 reads of length 76
- lane 5/2 : 11 423 167 reads of length 76
- lane 7 : 9 294 857 reads of length 152

An average beetle genome size is said to be about 650Mbp.

Apparently "we" have a server with 192GB RAM where SOAPdenovo is/will be installed.

I have been told to first control the sequences quality so after a few surfing I found "FASTQC" (with a good Youtube tutorial). I don't know what I have to do after... at all.

I am not here to ask you to do the job in my place & I know a will have a lot of reading & research, but i would know what is the main guide-line to follow, what are the things to mind about, the traps to prevent, etc.

Thank you in advance for any kind of help,

M.

(PS: accordingly to the FASTQC tutorial, data quality are quite poor, i can post output on demand)
Hello,


You may want to try Ray, a easy to use distributed assembler.

http://denovoassembler.sf.net
seb567 is offline   Reply With Quote
Old 05-02-2012, 02:29 AM   #16
shal
Junior Member
 
Location: china

Join Date: Apr 2012
Posts: 4
Default De novo genome assembly: beginner

Hi folks!!

I too am completely new to the NGS. and I am struggling with a question:

What decides the 'amount of data' needed for a de novo genome assembly for a particular organism??????? What decides the insert size during the library construction???

How to decide important parameters such as coverage, size, accuracy, and sensitivity; library type (fragment or mate paired?); and read length.


Any one can please help me!!

many thanks in advance!!
shal
shal is offline   Reply With Quote
Old 05-02-2012, 06:38 AM   #17
Linnea
Member
 
Location: Uppsala, Sweden

Join Date: Mar 2010
Posts: 23
Default

Hi shal,

It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

Good luck!
Linnea is offline   Reply With Quote
Old 05-03-2012, 01:13 AM   #18
shal
Junior Member
 
Location: china

Join Date: Apr 2012
Posts: 4
Default

Quote:
Originally Posted by Linnea View Post
Hi shal,

It is very hard to say in advance how much data you will need for a making a denovo assembly of a particular organism.. It not only depends on the genomic content (especially repeats, polymorphism, GC content) and the assembly software you're using, but also on the properties of the reads you get (in terms of for example quality and distribution).

I have assembled a 1 Gb genome and got the best result when I used just a subset of my data (less than 30X coverage), for others it works better using 50X or 80X. You can always start with a smaller amount (but probably never below 20X) and then sequence more if you are unsatisfied with the results.

Also, I would say that after you have decided on to what coverage you would like to have, sequence at least 1.5 times more (or even 2 times more), since you will loose some in the filtering steps (some reads will be duplicated, some will have too poor quality etc).

For the insert size, you should preferrably have a mix of short and long libraries. The shorter paired end (ins <1000bp) are used for building contigs, and the longer mate pairs are used for joining the contigs into scaffolds. For mate-pairs I would say "the longer the better" - longer insert size in mate-pair libraries will certainly give you larger N50 of the assembly. But it's usually the costs that sets the limit... Note that some assemblers (like Allpaths-LG) have certain recomendations for setting up the libraries.

Paired/Mate-pair reads are better than single end reads for denovo assembly. With Illumina (I suppose you intend to use this since you chose this forum) the read length isn't very variable, they go up to ~150bp. Most of our libraries were 100bp (which worked fine), when we tried longer reads it seemed that the read quality was much poorer the last 50 bp, so we ended up trimming them anyway.

I'm not sure if I understood your questions regarding sensitivity and accuracy (in reads or assembly?), but hope this helps a bit!

Good luck!
Dear Linnea,

Thank you so much for sharing your knowledge and experience. Your reply was helpful for me and it answered my queries.

Thanks again
Shal
shal is offline   Reply With Quote
Old 05-17-2012, 05:31 PM   #19
Meligethes
Member
 
Location: Belgium

Join Date: Mar 2012
Posts: 23
Default

Hi ! Me again

I was just wondering all possible clues for scaffolding that are used, I mean I know we can map contigs agains reference genomes, use long paired-end reads , but is/are there other way(s) to find such things like orientation, order, distances ??
Meligethes is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO