Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by gtyrelle View Post
    Hi ritzriya,

    Using samtools mpileup to call variants off CG evidence reads is going to generate worse small variant calls calls than the native CG calls in the var and masterVar files. I would strongly advise using the native CG calls, unless you have very specific reasons for using samtools mpileup on the evidence reads.
    Hi gtyrelle,

    So you mean that its not advisable to use samtools mpileup for CG data to find variants? The command "generatemasterVar" (beta) from cgatools ONLY must be used to detect SNPs and INDELs?

    Comment


    • #17
      Hi ritzriya,

      if this is the exact command you used then awk is missing the input and didn't remove anything... I would use samtools view to pipe the output to the awk command and then write the output file, like this:

      samtools view -S myfile.sam | awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' > myeditedfile.sam

      Comment


      • #18
        Originally posted by ritzriya View Post
        So you mean that its not advisable to use samtools mpileup for CG data to find variants?
        That is exactly what I mean. We have our own pipeline for small variants that uses local de novo assembly, and exports the results in both the var and masterVar files. The generate masterVar function in cgatools does not call variants. It aggregates information from a genome assembly into a single file that is easy to parse and process. For example, small variant calls (multiple lines per variant) along with scores and cross-references are exported to the var file. The masterVar file includes the same variants present in the var file (one variant per line), but has additional columns with information like allele read counts, annotations etc.

        Nora's commands will help you use the evidence files in samtools for visualization. I don't recommend using these files for variant calling, one reason is that there is a post-processing filter on the exported evidence reads, after we call variants, such that not all reads will be present for re-calling. In addition read depth is calculated by weighting reads that map in multiple locations, there is no mechanism in samtools to consider read depth calculations with multiple weighted mappings (at least as far as I know).
        Bioinformatics Applications, Europe
        Lifetech Inc. http://www.lifetech.com/

        Comment


        • #19
          Originally posted by gtyrelle View Post
          We have our own pipeline for small variants that uses local de novo assembly, and exports the results in both the var and masterVar files.
          Can you please throw some light on what kind of "pipeline" you are offering to call variants using CGAtools? Or other open source softwares like GATK need to be used to call variants only?

          Thanks and Regards,
          R.

          Comment


          • #20
            Originally posted by nora View Post
            Hi ritzriya,

            if this is the exact command you used then awk is missing the input and didn't remove anything... I would use samtools view to pipe the output to the awk command and then write the output file, like this:

            samtools view -S myfile.sam | awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' > myeditedfile.sam
            I converted the sam file using the awk command mentioned above. But for downstream analysis, I have to convert it into BAM file using samtools view. When I do the conversion, I get the following error, nora:

            >samtools view -bh -S new.sam -o new.bam
            [samopen] no @SQ lines in the header.
            [sam_read1] missing header? Abort!
            How to resolve this one?

            Comment


            • #21
              The paper linked in my previous comment is a good starting point for understanding our assembly approach. Our assembly pipeline is not implemented in cgatools, again: cgatools does not call variants. The assembly pipeline is developed and run in-house on sequenced genomes, and the results (small variant calls, CNV, SV, MEIs) provided to customers. In other words, you don't need to "call variants" from CG data using open source software, as the variant calls are done for customers as part of the CG service data deliverables.

              Currently there is no open source software that will provide equivalent quality variant calls from CG data. My advice would be to find the var and masterVar files (containing SNPs/Indels/Subs) from the CG genome assemblies you are working with and start from there.
              Bioinformatics Applications, Europe
              Lifetech Inc. http://www.lifetech.com/

              Comment


              • #22
                Thank you gtyrelle for the clear guidance about the CG data.

                Comment


                • #23
                  Hi gtyrelle,

                  I have understood that variants are a part of your CG deliverables. But also I want to know that the evidence2sam command which produces the SAM file, contains records/lines where a read maps to multiple locations, such as:

                  GS27657-FS3-L02-8:1660459 179 chr14 19089795 16 12M1I1P4I6M6N5M1I4M = 19090178 383 CCTAATTCTTATTTTTATTTTTTTATTTATTTT 9::::656877887;<<<:::<;6-47737783 RG:Z:NA19238-L2-200-37-ASM-chr14 GC:Z:3S2G28S GS:Z:AAAA GQ:Z:::4-
                  GS27657-FS3-L02-8:1660459 115 chr14 19090178 16 10M5N23M = 19089795 -383 TTCATGAGAGGGTCCACTATTTTTCCCTTGTTA .08877587857;*;1<<9778877871;;::7 RG:Z:NA19238-L2-200-37-ASM-chr14 GC:Z:28S2G3S GS:Z:TGTG GQ:Z:77;;
                  Here, the read 'GS27657-FS3-L02-8:1660459' maps to the same chromosome (chr14) at two adjacent locations. Is there a way where we can obtain only unique mapping - i.e. one read maps to only one location in the genome using evidence2sam?

                  The reason why I am asking you about this is: that it causes a problem in the downstream analysis using GATK. GATK claims that it handles and finds polymorphisms from CG data. But it produced the following error on more than one test dataset I have tried:

                  ERROR MESSAGE: SAM/BAM file SAMFileReader{CGA_test/originalSAM/output_sorted.bam} is malformed: Adjacent I/D events in read GS27657-FS3-L02-8:1660459
                  Please let me know if you think this is feasible. I merely want to understand the difference of the output from a public tool and the one from CG themselves. I hope you got my concern. Thanks in advance.

                  Comment


                  • #24
                    Hi ritzriya,

                    I was going to ask if you could repost this question in the Complete Genomics forum but I see you have already done that, thank you! I'll post my reply to the new thread.

                    Greg
                    Bioinformatics Applications, Europe
                    Lifetech Inc. http://www.lifetech.com/

                    Comment


                    • #25
                      Originally posted by ritzriya View Post
                      I converted the sam file using the awk command mentioned above. But for downstream analysis, I have to convert it into BAM file using samtools view. When I do the conversion, I get the following error, nora:
                      How to resolve this one?
                      Sorry ritzriya, I forgot you need to save the header of your file to a separate file and then concatenate them (e.g. with the cat command) after awk filtering.

                      Comment


                      • #26
                        Originally posted by nora View Post
                        Sorry ritzriya, I forgot you need to save the header of your file to a separate file and then concatenate them (e.g. with the cat command) after awk filtering.
                        Oops, thank you. I observed that there were no headers in the new SAM file formed. Now I am able to create the BAM file successfully. Thanks again, nora.

                        Though this conversion of SAM to BAM is successful, I am back to square ONE! Oh god, I am getting the same error which I started to post with:

                        $./samtools mpileup -uf hg1to24.fa CGA/output_sorted.bam | bcftools/bcftools view -vcg - > mpileup_view.bcf
                        [mpileup] 1 samples in 1 input files
                        <mpileup> Set max per-file depth to 8000
                        samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.
                        [afs] 0:13637.821 1:1240.324 2:329.855
                        Last edited by ritzriya; 03-12-2012, 05:06 AM. Reason: Error not resolved!

                        Comment


                        • #27
                          Why not just fix cga pipeline so that the bam files are generated correctly?

                          Comment


                          • #28
                            Originally posted by ritzriya View Post
                            Oops, thank you. I observed that there were no headers in the new SAM file formed. Now I am able to create the BAM file successfully. Thanks again, nora.

                            Though this conversion of SAM to BAM is successful, I am back to square ONE! Oh god, I am getting the same error which I started to post with:
                            ritzriya - Did you ever figure a workaround for this? I made the above changes and am also getting the same error again.
                            Last edited by chemic91; 02-15-2013, 11:04 AM.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            67 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X