Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Complete Genomics Releasing 60 Human Genomes

    Complete Genomics has announced plans for public release of 60 human genomes -- 40 now and 20 more next month. 55X mean read coverage. 17 are from a single CEPH 3-generation pedigree; there are two trios and the rest are unrelated. Samples include people of northern European, African (African American, Kenya - Maasai, Kenya - Luhya, Yoruban, Chinese, Japanese, Mexican, and Italian.

    Release is via Bionimbus (a cloud service I hadn't heard about before) and on Complete Genomics' website. An open source suite of tools (CGA Tools) will enable access to the data and conversion to other formats (code; quick start; README).

  • #2
    I wish CG could release alignments in the BAM format. I am impressed by the accuracy of their variant calls, but for certain things we need alignment as well; for SNPs alone, 1000g is probably a better resource. Personally I am mostly interested in the 3-generation pedigree, but it has not been released right now.

    Comment


    • #3
      BAM files from CGI alignments

      While CGI doesn't provide alignments in BAM format, the CGAtools software package does contain a map2sam tool and an evidence2sam tool which allow conversion of the data to SAM format which can be processed by SAM Tools.

      For example, this command pipeline creates an indexed, reference-sorted BAM file for our evidence mappings:
      cgatools evidence2sam \
      --beta \
      --evidence-dnbs=/path/to/evidenceDnbs-chrN-XXX.tsv.bz2 \
      --reference=/path/to/build36.crr | \
      samtools view -uS - | \
      samtools sort - result && samtools index result.bam

      Download CGAtools and related documentation from http://cgatools.sourceforge.net/

      Anoop Grewal
      Complete Genomics Technical Support

      Comment


      • #4
        Many thanks for the reply. I need the whole-genome alignment, not just the alignments around variants. While I can convert alignments, that will take quite a while. Alternatively, do you provide a BED file indicating the regions where SNPs can be called (sorry I have not read through the documentation)?

        Comment


        • #5
          Originally posted by lh3 View Post
          Many thanks for the reply. I need the whole-genome alignment, not just the alignments around variants. While I can convert alignments, that will take quite a while. Alternatively, do you provide a BED file indicating the regions where SNPs can be called (sorry I have not read through the documentation)?
          It's my impression that the REF files contain the alignments over non-variant positions and the EVIDENCE files contain the de novo assemblies over the variants.

          You can use their evidence2sam tool in CGAtools to make BAM files from the EVIDENCE files.

          You can use map2sam to make BAM files from the REF files.

          (If this is wrong, please correct.)
          Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
          Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
          Projects: U87MG whole genome sequence [Website] [Paper]

          Comment


          • #6
            Originally posted by Michael.James.Clark View Post
            It's my impression that the REF files contain the alignments over non-variant positions and the EVIDENCE files contain the de novo assemblies over the variants.

            You can use their evidence2sam tool in CGAtools to make BAM files from the EVIDENCE files.

            You can use map2sam to make BAM files from the REF files.

            (If this is wrong, please correct.)
            Michael,

            You are correct when you say that the EVIDENCE files contain the de novo assemblies over the variants. However the alignments over (mostly) non-variant positions can be found in the MAP directory ie all of our reads and their mappings to the reference genome are found here.

            But please note that there will be some information in the EVIDENCE files that is missing from the MAP files eg where a region in the sample genome contains a 5bp deletion, reads across this region will not initially map to the reference genome, but will be aligned correctly following local de novo assembly.

            This is why we provide two tools, map2sam to convert our initial mappings to sam format, and evidence2sam which converts all of the mappings across variant regions to sam format.

            Hope this helps,

            Rick Tearle
            Complete Genomics
            Senior Applications Specialist - Europe

            Comment


            • #7
              Thanks Rick. I had confused the REF sub-directory with the MAP sub-directory in your file structure.
              Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
              Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
              Projects: U87MG whole genome sequence [Website] [Paper]

              Comment


              • #8
                Originally posted by anoopmandaher View Post
                Anoop Grewal
                Complete Genomics Technical Support
                Hi Anoop!

                Wanted to welcome you to the site, and thank you for offering help. I really like seeing companies engage the community...and I can imagine that everyone appreciates talking to someone inside CGI!

                Comment


                • #9
                  To the question asked by lh3:

                  > Alternatively, do you provide a BED file indicating the regions where SNPs can be called?

                  First background: CG's local de novo assembly pipeline does make a firm distinction between a called region (whether called homozygous reference or variant) and a region which is "no-called". No-calls can either be due to thin coverage at a spot or due to difficulty in accurately calling the region due (for example) to repetitive or low complexity sequence. "Called" is thus a more stringent metric than "covered", although the two are often confused. FYI Typically we call >95% of each sample's genome, and our minimum spec is 90%.

                  Now the answer: The masterVar files indicate called vs. no-called regions by genome coordinates. One could easily make a BED track from that file with a short script. The BED track would not just indicate SNP callability (short indels and subs are included in the masterVar files as calls as well) but that sounds close to what you want.

                  There are a few complexities you may wish to consider in how you count: At some sites the assembler can determine partial information (for example an allele sequence containing some N's) and we do report that result, although it is flagged as a no-call in the interest of being conservative. Similarly at some sites we may determine one but not both of the diploid alleles, which we flag as a "half-call".
                  Steve Lincoln
                  VP, Scientific Applications
                  Complete Genomics, Inc.

                  Comment


                  • #10
                    CGers: Do your tool have, or have you considered adding, the ability to access data on remote HTTP/FTP sites the way SAMTools can? This is a useful feature for folks focused on particular regions of the genome who might not want to slurp the entire data structures.

                    Also, I haven't looked at your data but was curious how you handle simple tandem repeats that cannot be resolved given your technology? Is there a marker in the assembly to note ambiguity in repeat array lengths?

                    Comment


                    • #11
                      > Does your tool have, or have you considered adding, the ability to access data on remote HTTP/FTP sites the way SAMTools can? This is a useful feature for folks focused on particular regions of the genome who might not want to slurp the entire data structures.

                      Yup. The CGA Tools for genome-genome comparisons operate on ~40GB per genome assembly results (and often on ~1GB/genome variation files), so it's less of an issue in that case.

                      Hosting genome-wide BAMs via HTTP is a idea we have thought about. For the public data we might need to find a partner or two who would be able to do that (any volunteers? email us!) For customer data this is a feature request we're looking into. Obviously security is a big concern in that case.

                      SAM/BAM is a great thing, but one of the challenges is that CG data do not map perfectly into it. The format has some limitations, not just for our read structure but also for the semantics of our mapping and assembly pipeline results. Also BAM files tend to be much larger than the CG native bz2 files. Thus, BAM is very useful for visualizing CG data and for doing some computations on, but these limitations make BAM not as useful for other purposes with CG data, so we can't quite use it as our native format in its current form. That said, we continue to work on this in collaboration with outside groups who use BAM more heavily than we do.

                      - Steve L
                      Steve Lincoln
                      VP, Scientific Applications
                      Complete Genomics, Inc.

                      Comment


                      • #12
                        Also, I haven't looked at your data but was curious how you handle simple tandem repeats that cannot be resolved given your technology? Is there a marker in the assembly to note ambiguity in repeat array lengths?
                        That would be call vs. no-call.

                        Some no-calls are length known but the specific bases are not (N's).

                        Others have ? in the allele sequence, which means that we don't know the exact length. Unfortunately we don't presently distinguish the case of (say) +1 bp from +much_more_than_that.
                        Steve Lincoln
                        VP, Scientific Applications
                        Complete Genomics, Inc.

                        Comment


                        • #13
                          Thanks a lot slincoln. The masterVarBeta file seems what I wants. On the other hand, in my experiences, I think the callable region from CG is overestimated. The evidence is the heterozygosity (#hets/#callable) from CG is lower than other estimates inferred in various ways. That is why I prefer to use alignment. Nonetheless, probably this may be only important for myself, not a big issue for you.

                          As for SAM, I agree that for internal uses, specialized formats are easier. But when the data are released to the public domain, conforming to a standard would make users (at least me) much easier. The similar might be true for variants. In my opinion, releasing a BED+VCF pair would seem more friendly to us.

                          On more technical comments, I am a little suspicious that CG's alignment file can be "much" smaller than BAM if you do not duplicate sequences/qualities for multiple hits (though I agree it is harder to get the read structure from SAM). Also, bzip2 helps compression ratio, but on decompression, gzip is >7X faster than bzip2, which is the single reason why BAM adopts gzip/zlib.

                          If policies permit, perhaps you may consider to dump variants to UCSC/Ensembl (probably they do not host alignment).

                          Comment


                          • #14
                            Thanks a lot slincoln.
                            No worries. We are happy to help.

                            You had many good points in your post so this may be a multi-part reply. Here's a quick start:

                            On the other hand, in my experiences, I think the callable region from CG is overestimated. The evidence is the heterozygosity (#hets/#callable) from CG is lower than other estimates inferred in various ways.
                            Well, that doesn't sound entirely consistent with some other data we and our users have, but of course the devil is always in the details on such comparisons. Drilling into it is not the easiest conversation to have in a bulletin-board format unfortunately so feel free to contact us at [email protected] and we'd be happy set up a phone call to trade results and observations back and forth with you. Obviously any input you have which would help us improve is always welcome.

                            Certainly you are correct in at least two important senses:

                            (A) Our calls are made at moderately stringent thresholds chosen to provide high accuracy yet retain sensitivity. Depending on the application, one of course may wish to be more stringent or apply additional filters to shift the FP/FN (or more properly, accuracy/no-call) trade off, either broadly or on a case-by-case (say, variant-type) basis. Good methodologies for doing so change considerably, and become far more powerful, in genome-genome comparisons as opposed to single-genome analysis*. We know a fair bit about how to do this on our data, so feel free to contact us for more info. However again external eyes are always valuable.

                            * Translation for those less familiar with CG data: Look into using the referenceScore as a measure of confidence in calls of homozygous reference. And please consider using CGA Tools (cgatools.sourceforge.net) for comparisons.

                            (B) As you well know, as one learns how to make calls increasingly accurately, the distribution of remaining errors can prove increasingly "interesting" to look into. For example Roach et al (Science 2010) detected 535 regions in that family of four's genomes which contained a very disproportionate fraction of the Mendelian errors. Upon investigation these regions include cases like undetected hemizygous deletions (falsely called as homozygotes) and larger, highly-conserved duplications in the sample vs. reference (presumably causing mis-mapping which our de novo assembler was not able to rectify or detect). Some of this has improved since with newer assembly algorithms and newer genome builds, and as well we've since added CNV and SV analysis which can provide more information in some cases. Nevertheless the basic notion still applies and these kinds of regions you might wish to consider not callable for some purposes as you suggest. The $64,000 question is how conserved these regions are between individuals, ethnicities, cell-lines vs. bloods, technologies, etc. We know a few parts of that story but certainly not all.

                            That is why I prefer to use alignment. Nonetheless, probably this may be only important for myself, not a big issue for you.
                            Indeed that's why we provide them .

                            Just always keep in mind the distinction between the rough initial mappings and the more refined (however localized) de novo assemblies in CG BAMs.

                            Aslo remember that our mapper has some different behaviors than MAQ or BWA, particularly when reads can map to multiple locations.

                            - Steve L
                            Last edited by slincoln; 02-10-2011, 08:24 AM.
                            Steve Lincoln
                            VP, Scientific Applications
                            Complete Genomics, Inc.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            9 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X