SEQanswers

Go Back   SEQanswers > Applications Forums > Genomic Resequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Complete Genomics BAM files? eric54321 Bioinformatics 2 02-20-2014 02:32 PM
Complete Genomics user rworthi Introductions 4 07-01-2011 11:19 AM
PubMed: High-throughput sequencing of complete human mtDNA genomes from the Philippin Newsbot! Literature Watch 0 12-15-2010 11:20 AM
Complete Genomics dbailey Introductions 4 12-02-2009 01:55 PM
Hi from Complete Genomics! thondeboer Introductions 5 03-27-2009 07:12 AM

Reply
 
Thread Tools
Old 02-03-2011, 06:01 AM   #1
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default Complete Genomics Releasing 60 Human Genomes

Complete Genomics has announced plans for public release of 60 human genomes -- 40 now and 20 more next month. 55X mean read coverage. 17 are from a single CEPH 3-generation pedigree; there are two trios and the rest are unrelated. Samples include people of northern European, African (African American, Kenya - Maasai, Kenya - Luhya, Yoruban, Chinese, Japanese, Mexican, and Italian.

Release is via Bionimbus (a cloud service I hadn't heard about before) and on Complete Genomics' website. An open source suite of tools (CGA Tools) will enable access to the data and conversion to other formats (code; quick start; README).
krobison is offline   Reply With Quote
Old 02-03-2011, 05:47 PM   #2
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

I wish CG could release alignments in the BAM format. I am impressed by the accuracy of their variant calls, but for certain things we need alignment as well; for SNPs alone, 1000g is probably a better resource. Personally I am mostly interested in the 3-generation pedigree, but it has not been released right now.
lh3 is offline   Reply With Quote
Old 02-07-2011, 04:14 PM   #3
anoopmandaher
Junior Member
 
Location: California

Join Date: Nov 2010
Posts: 5
Default BAM files from CGI alignments

While CGI doesn't provide alignments in BAM format, the CGAtools software package does contain a map2sam tool and an evidence2sam tool which allow conversion of the data to SAM format which can be processed by SAM Tools.

For example, this command pipeline creates an indexed, reference-sorted BAM file for our evidence mappings:
cgatools evidence2sam \
--beta \
--evidence-dnbs=/path/to/evidenceDnbs-chrN-XXX.tsv.bz2 \
--reference=/path/to/build36.crr | \
samtools view -uS - | \
samtools sort - result && samtools index result.bam

Download CGAtools and related documentation from http://cgatools.sourceforge.net/

Anoop Grewal
Complete Genomics Technical Support
anoopmandaher is offline   Reply With Quote
Old 02-08-2011, 09:30 AM   #4
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Many thanks for the reply. I need the whole-genome alignment, not just the alignments around variants. While I can convert alignments, that will take quite a while. Alternatively, do you provide a BED file indicating the regions where SNPs can be called (sorry I have not read through the documentation)?
lh3 is offline   Reply With Quote
Old 02-08-2011, 03:29 PM   #5
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by lh3 View Post
Many thanks for the reply. I need the whole-genome alignment, not just the alignments around variants. While I can convert alignments, that will take quite a while. Alternatively, do you provide a BED file indicating the regions where SNPs can be called (sorry I have not read through the documentation)?
It's my impression that the REF files contain the alignments over non-variant positions and the EVIDENCE files contain the de novo assemblies over the variants.

You can use their evidence2sam tool in CGAtools to make BAM files from the EVIDENCE files.

You can use map2sam to make BAM files from the REF files.

(If this is wrong, please correct.)
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 02-08-2011, 06:28 PM   #6
Rick Tearle
Junior Member
 
Location: UK

Join Date: Feb 2011
Posts: 1
Default

Quote:
Originally Posted by Michael.James.Clark View Post
It's my impression that the REF files contain the alignments over non-variant positions and the EVIDENCE files contain the de novo assemblies over the variants.

You can use their evidence2sam tool in CGAtools to make BAM files from the EVIDENCE files.

You can use map2sam to make BAM files from the REF files.

(If this is wrong, please correct.)
Michael,

You are correct when you say that the EVIDENCE files contain the de novo assemblies over the variants. However the alignments over (mostly) non-variant positions can be found in the MAP directory ie all of our reads and their mappings to the reference genome are found here.

But please note that there will be some information in the EVIDENCE files that is missing from the MAP files eg where a region in the sample genome contains a 5bp deletion, reads across this region will not initially map to the reference genome, but will be aligned correctly following local de novo assembly.

This is why we provide two tools, map2sam to convert our initial mappings to sam format, and evidence2sam which converts all of the mappings across variant regions to sam format.

Hope this helps,

Rick Tearle
Complete Genomics
Senior Applications Specialist - Europe
Rick Tearle is offline   Reply With Quote
Old 02-08-2011, 06:39 PM   #7
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Thanks Rick. I had confused the REF sub-directory with the MAP sub-directory in your file structure.
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 02-08-2011, 07:13 PM   #8
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,352
Default

Quote:
Originally Posted by anoopmandaher View Post
Anoop Grewal
Complete Genomics Technical Support
Hi Anoop!

Wanted to welcome you to the site, and thank you for offering help. I really like seeing companies engage the community...and I can imagine that everyone appreciates talking to someone inside CGI!
ECO is offline   Reply With Quote
Old 02-09-2011, 05:26 AM   #9
slincoln
Junior Member
 
Location: Seat 6C

Join Date: Feb 2011
Posts: 4
Default

To the question asked by lh3:

> Alternatively, do you provide a BED file indicating the regions where SNPs can be called?

First background: CG's local de novo assembly pipeline does make a firm distinction between a called region (whether called homozygous reference or variant) and a region which is "no-called". No-calls can either be due to thin coverage at a spot or due to difficulty in accurately calling the region due (for example) to repetitive or low complexity sequence. "Called" is thus a more stringent metric than "covered", although the two are often confused. FYI Typically we call >95% of each sample's genome, and our minimum spec is 90%.

Now the answer: The masterVar files indicate called vs. no-called regions by genome coordinates. One could easily make a BED track from that file with a short script. The BED track would not just indicate SNP callability (short indels and subs are included in the masterVar files as calls as well) but that sounds close to what you want.

There are a few complexities you may wish to consider in how you count: At some sites the assembler can determine partial information (for example an allele sequence containing some N's) and we do report that result, although it is flagged as a no-call in the interest of being conservative. Similarly at some sites we may determine one but not both of the diploid alleles, which we flag as a "half-call".
__________________
Steve Lincoln
VP, Scientific Applications
Complete Genomics, Inc.
slincoln is offline   Reply With Quote
Old 02-09-2011, 07:55 AM   #10
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

CGers: Do your tool have, or have you considered adding, the ability to access data on remote HTTP/FTP sites the way SAMTools can? This is a useful feature for folks focused on particular regions of the genome who might not want to slurp the entire data structures.

Also, I haven't looked at your data but was curious how you handle simple tandem repeats that cannot be resolved given your technology? Is there a marker in the assembly to note ambiguity in repeat array lengths?
krobison is offline   Reply With Quote
Old 02-09-2011, 09:24 AM   #11
slincoln
Junior Member
 
Location: Seat 6C

Join Date: Feb 2011
Posts: 4
Default

> Does your tool have, or have you considered adding, the ability to access data on remote HTTP/FTP sites the way SAMTools can? This is a useful feature for folks focused on particular regions of the genome who might not want to slurp the entire data structures.

Yup. The CGA Tools for genome-genome comparisons operate on ~40GB per genome assembly results (and often on ~1GB/genome variation files), so it's less of an issue in that case.

Hosting genome-wide BAMs via HTTP is a idea we have thought about. For the public data we might need to find a partner or two who would be able to do that (any volunteers? email us!) For customer data this is a feature request we're looking into. Obviously security is a big concern in that case.

SAM/BAM is a great thing, but one of the challenges is that CG data do not map perfectly into it. The format has some limitations, not just for our read structure but also for the semantics of our mapping and assembly pipeline results. Also BAM files tend to be much larger than the CG native bz2 files. Thus, BAM is very useful for visualizing CG data and for doing some computations on, but these limitations make BAM not as useful for other purposes with CG data, so we can't quite use it as our native format in its current form. That said, we continue to work on this in collaboration with outside groups who use BAM more heavily than we do.

- Steve L
__________________
Steve Lincoln
VP, Scientific Applications
Complete Genomics, Inc.
slincoln is offline   Reply With Quote
Old 02-09-2011, 09:31 AM   #12
slincoln
Junior Member
 
Location: Seat 6C

Join Date: Feb 2011
Posts: 4
Default

Quote:
Also, I haven't looked at your data but was curious how you handle simple tandem repeats that cannot be resolved given your technology? Is there a marker in the assembly to note ambiguity in repeat array lengths?
That would be call vs. no-call.

Some no-calls are length known but the specific bases are not (N's).

Others have ? in the allele sequence, which means that we don't know the exact length. Unfortunately we don't presently distinguish the case of (say) +1 bp from +much_more_than_that.
__________________
Steve Lincoln
VP, Scientific Applications
Complete Genomics, Inc.
slincoln is offline   Reply With Quote
Old 02-09-2011, 09:01 PM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Thanks a lot slincoln. The masterVarBeta file seems what I wants. On the other hand, in my experiences, I think the callable region from CG is overestimated. The evidence is the heterozygosity (#hets/#callable) from CG is lower than other estimates inferred in various ways. That is why I prefer to use alignment. Nonetheless, probably this may be only important for myself, not a big issue for you.

As for SAM, I agree that for internal uses, specialized formats are easier. But when the data are released to the public domain, conforming to a standard would make users (at least me) much easier. The similar might be true for variants. In my opinion, releasing a BED+VCF pair would seem more friendly to us.

On more technical comments, I am a little suspicious that CG's alignment file can be "much" smaller than BAM if you do not duplicate sequences/qualities for multiple hits (though I agree it is harder to get the read structure from SAM). Also, bzip2 helps compression ratio, but on decompression, gzip is >7X faster than bzip2, which is the single reason why BAM adopts gzip/zlib.

If policies permit, perhaps you may consider to dump variants to UCSC/Ensembl (probably they do not host alignment).
lh3 is offline   Reply With Quote
Old 02-10-2011, 07:34 AM   #14
slincoln
Junior Member
 
Location: Seat 6C

Join Date: Feb 2011
Posts: 4
Default

Quote:
Thanks a lot slincoln.
No worries. We are happy to help.

You had many good points in your post so this may be a multi-part reply. Here's a quick start:

Quote:
On the other hand, in my experiences, I think the callable region from CG is overestimated. The evidence is the heterozygosity (#hets/#callable) from CG is lower than other estimates inferred in various ways.
Well, that doesn't sound entirely consistent with some other data we and our users have, but of course the devil is always in the details on such comparisons. Drilling into it is not the easiest conversation to have in a bulletin-board format unfortunately so feel free to contact us at [email protected] and we'd be happy set up a phone call to trade results and observations back and forth with you. Obviously any input you have which would help us improve is always welcome.

Certainly you are correct in at least two important senses:

(A) Our calls are made at moderately stringent thresholds chosen to provide high accuracy yet retain sensitivity. Depending on the application, one of course may wish to be more stringent or apply additional filters to shift the FP/FN (or more properly, accuracy/no-call) trade off, either broadly or on a case-by-case (say, variant-type) basis. Good methodologies for doing so change considerably, and become far more powerful, in genome-genome comparisons as opposed to single-genome analysis*. We know a fair bit about how to do this on our data, so feel free to contact us for more info. However again external eyes are always valuable.

* Translation for those less familiar with CG data: Look into using the referenceScore as a measure of confidence in calls of homozygous reference. And please consider using CGA Tools (cgatools.sourceforge.net) for comparisons.

(B) As you well know, as one learns how to make calls increasingly accurately, the distribution of remaining errors can prove increasingly "interesting" to look into. For example Roach et al (Science 2010) detected 535 regions in that family of four's genomes which contained a very disproportionate fraction of the Mendelian errors. Upon investigation these regions include cases like undetected hemizygous deletions (falsely called as homozygotes) and larger, highly-conserved duplications in the sample vs. reference (presumably causing mis-mapping which our de novo assembler was not able to rectify or detect). Some of this has improved since with newer assembly algorithms and newer genome builds, and as well we've since added CNV and SV analysis which can provide more information in some cases. Nevertheless the basic notion still applies and these kinds of regions you might wish to consider not callable for some purposes as you suggest. The $64,000 question is how conserved these regions are between individuals, ethnicities, cell-lines vs. bloods, technologies, etc. We know a few parts of that story but certainly not all.

Quote:
That is why I prefer to use alignment. Nonetheless, probably this may be only important for myself, not a big issue for you.
Indeed that's why we provide them .

Just always keep in mind the distinction between the rough initial mappings and the more refined (however localized) de novo assemblies in CG BAMs.

Aslo remember that our mapper has some different behaviors than MAQ or BWA, particularly when reads can map to multiple locations.

- Steve L
__________________
Steve Lincoln
VP, Scientific Applications
Complete Genomics, Inc.

Last edited by slincoln; 02-10-2011 at 08:24 AM.
slincoln is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:08 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO