SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Making sense of tophat output "align_summary.txt" yangjr Bioinformatics 3 11-03-2016 08:20 AM
problem with low coverage genome sequencing wrch General 4 05-02-2014 10:56 AM
Making sense of single gene sequence SirMagnus Bioinformatics 1 08-13-2012 08:14 AM
cuffdiff: making sense of gene_id, p_id, transcript_id and tss_id Aurelien Mazurie Bioinformatics 2 04-19-2011 08:44 AM

Reply
 
Thread Tools
Old 10-30-2014, 06:54 AM   #1
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Default Making sense of low coverage plant genome

In our lab we have de novo assembled non-model plant WGS ( 1 paired-end library - 2x101 bp) of insert size 240 bp, with N50 around 1 kbp. The estimated genome size is around
2GB. Through read mapping we found coverage around 5x. I would like to get the idea to make possible of publication with this data.

I have some idea in mind to make use of this low coverage genome

1. Calling variant - finding SNP, heterozygosity & homozygosity (samtools, GATK)
2. Finding microsatellites (MISA etc)
3. Finding repeats using repeatmasker
4. Extracting and assembling mitochondria and chloroplast genome

Please add me if any ideas or related papers which can make of this low coverage genome.
bioman1 is offline   Reply With Quote
Old 10-30-2014, 08:10 AM   #2
scami
Member
 
Location: italy

Join Date: Sep 2010
Posts: 55
Default

Hi bioman1,

if I were you I would proceed with annotation first, that is finding the coding sequences in your assembled genome. Number of found genes will give you an idea of how good is your assembly. On genes you can then run some gene ontology analysis.
scami is offline   Reply With Quote
Old 10-30-2014, 08:27 AM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It does not make a lot of sense to me to try to publish (or spend lots of time on) such a low-coverage assembly. It would be be much more cost-effective and useful to the rest of the world if you generated more coverage, and hence a better assembly, before going forward with further analysis.
Brian Bushnell is offline   Reply With Quote
Old 10-30-2014, 08:32 AM   #4
scami
Member
 
Location: italy

Join Date: Sep 2010
Posts: 55
Default

I agree with Brian actually. I would not trust snps and indels called with such a low coverage in the absence of a reference genome
scami is offline   Reply With Quote
Old 10-31-2014, 12:43 AM   #5
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

In your original post on the quality metrics of your assembly (http://seqanswers.com/forums/showthread.php?t=45673) we already discussed that your data is not good enough for publication. If the backbone of your analysis (i.e. the genome reference) is not in an adequate shape, how can any downstream analysis (#1-3) be?
You might have sufficient coverage to assemble the mitochondria or chloroplast genome, but unless they are extremely unusual, I doubt that this alone will suffice for a publication.
WhatsOEver is offline   Reply With Quote
Old 10-31-2014, 09:31 AM   #6
SNPsaurus
Registered Vendor
 
Location: Eugene, OR

Join Date: May 2013
Posts: 521
Default

You would not be able to call heterozygosity with any accuracy. Think of a region with 5X read depth (your average). This means you are sampling the two chromosomes (if diploid) with 5 reads. What is the chance of not ever sampling one of the chromosomes? It would be 0.5^5 or 3%, or 6% chance of missing one or the other. You also couldn't call a SNP with just 1 read, and you would get 1X coverage of a chromosome 30% of the time.

At 3X coverage you miss a chromosome 26% of the time, and the best case is that one chromosome gets 1 read and the other 2, so would never be able to call a SNP.
__________________
Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com
SNPsaurus is offline   Reply With Quote
Old 10-31-2014, 10:59 AM   #7
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by bioman1 View Post
The estimated genome size is around
2GB. Through read mapping we found coverage around 5x.
If you know the genome size, a more accurate estimate of coverage could be obtained by simply counting the total length of bases produced, rather than trying to infer this from mapping. This is really a minor point but it may make some difference.

As others have said, anything close to 5X is way too low for producing an assembly but you still have plenty of data for exploring a number of interesting questions. For example, you have more than enough coverage for assembling the organelle genomes and for describing repeat properties in the genome (I can offer specific advice for each of these tasks if that is of interest).
SES is offline   Reply With Quote
Old 10-31-2014, 10:49 PM   #8
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Default

Thank you all for suggestiona. We have budget constraint, we can proceed gor further funding by making one publication with available data. Do I open to any kind of suggestind.
SES please let me know your advice regarding organelle genome assembling and repeat properties identification.
bioman1 is offline   Reply With Quote
Old 11-03-2014, 09:23 AM   #9
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by bioman1 View Post
Thank you all for suggestiona. We have budget constraint, we can proceed gor further funding by making one publication with available data. Do I open to any kind of suggestind.
SES please let me know your advice regarding organelle genome assembling and repeat properties identification.
I recommend trying Chloro for chloroplast genome assembly, and the same program can be used for mitochondrial genomes given a database (just a fasta file) of mitochondrial genomes to screen against. Transposome is a tool for identifying repeats from WGS reads, so the input would be your unassembled sequence reads. Please let me know if you have questions about either tool, perhaps via email or message would be more appropriate since this would be getting a bit off topic of this thread.
SES is offline   Reply With Quote
Old 11-03-2014, 12:17 PM   #10
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Default

Thanks SES. I will try and I will contact you if any difficulties.
bioman1 is offline   Reply With Quote
Reply

Tags
bioinformatics, bioinformatics analysis, genome analyzer

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO