Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exome sequencing analysis manual

    Hi Folks,

    As I was writing a short guide of Exome analysis in our Institute, I thought it might be of some use to others especially for newbies, who need some kind of starting point to get to analysis of exome data (pretty much like the RNA-seq manual I once read in an older thread...). Instead of explaining everything in 100 new threads one could then point to that manual...

    It is the way we do exome analysis at our Institute, but I would be happy if people help improve the manual, add their knowledge and expand it, like a common knowledge base for exome-level analysis.

    I attached the pdf version and a .doc version within a zip folder, as the filesize was too large for uploading the doc file alone.

    The most updated version can be found in the SeqWiki (http://seqanswers.com/wiki/How-to/exome_analysis)
    (just to make it clear, it is not regularly updated and it's only goal is to get people started on the use of tools often used in exome sequencing)

    Any comments highly appreciated!

    P.S. I added a (very) short visualization chapter
    Attached Files
    Last edited by ulz_peter; 04-12-2012, 10:08 PM. Reason: updated manual

  • #2
    Thanks a lot.

    I have learned a lot.

    Comment


    • #3
      You have a weird typo "foolwong" just above the FASTQ example.

      Also your introduction about the different FASTQ encodings is out of date now. Illumina now follow the Sanger convention. They also changed the read naming convention, in particular the old /1 and /2 suffixes are gone

      See this thread for details:
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      Also I've never heard FASTQ called a "FastAlignment and Quality file" (glossary on last page).

      Comment


      • #4
        thanks for the hints. As we do not produce Illumina data in ourlab (yet) I haven't heard of those changes, although they seem to have been implemented a while ago...

        the typo should mean following, I will rewrite that part and repost it...

        Comment


        • #5
          This is a great document.

          Thanks a lot. This is a great document. I wish I had read this document earlier.

          Comment


          • #6
            As the GATK local realignment around indels portion of the website does not explicitly state to "FixMateInformation", I am curious if that will affect downstream analysis in anyway?

            Great document, by the way.

            Comment


            • #7
              Very nice document, thanks for sharing.

              Comment


              • #8
                Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

                I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...

                Comment


                • #9
                  Hi ulz_peter,

                  I have gone through almost the whole process according to your suggestions. However, at the "3.2. Variant quality score recalibration", I encountered some problems. (I used the TATK-1.0.5506 version.)

                  I got the error message: "Argument with name '--cluster_file' is missing." However, I did not put "--cluster_file" at all.

                  I looked at some help documents, and found that this kind of "cluster_file" is supposed to be generated by "GenerateVariantClusters". Have you used GenerateVariantClusters before? Is it necessary?

                  Thanks again for the wonderful manual.

                  Comment


                  • #10
                    Originally posted by NGSfan View Post
                    Great work ulz_peter ! That is exactly what my pipeline is like and I'm glad to see that my choices of tools are also someone else's favorites.

                    I have seen others use different tools but I think the BWA + GATK + ANNOVAR is the best combination of tools so far...
                    We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?

                    Comment


                    • #11
                      Hi guys ,
                      Thanks for all your responses. I must admit that the GATK parts are a little outdated (already). I'm gonna switch to the new version this week and will update the manual accordingly...

                      @pc2009open: I can't find any hint for the use of a cluster_file argument in variant quality score recalibration... Anyone else had seen that?

                      Comment


                      • #12
                        Originally posted by Heisman View Post
                        We use Novoalign + SAMtools... I'm curious if there are any papers out there comparing the methods?
                        Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

                        I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

                        We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

                        GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.

                        Comment


                        • #13
                          Thank you so much for posting this pipeline, I've been doing the same for some time. Tomorrow I will post some comments about my results so far.

                          I think you could sum this pipeline to yours:



                          Let's make from this thread a big reference for who is doing exome sequencing ... Please !!!

                          Comment


                          • #14
                            One question. How many raw snps you are getting after running Unifier Genotyper for the first time ?

                            Here I'm getting about 300 000 snps and I think there is something wrong with this numbers ... Shouldn't it be around 20 000 snps?

                            I'm running my analysis again using a BED file from SeqCap EZ Human Exome Library v2.0 (http://www.nimblegen.com/products/se...tml#annotation) but still ... 300 thousands snps are a lot ...
                            Last edited by raonyguimaraes; 10-10-2011, 05:41 PM. Reason: english mistaken ... :)

                            Comment


                            • #15
                              Originally posted by NGSfan View Post
                              Papers covering all variations and combinations have been hard to find. I did find one under review (Nature proceedings?) where they claim CASAVA 1.8 comes pretty close to GATK.

                              I think Novoalign is an excellent aligner, although it requires some tweaking to increase sensitivity on indels that are missed with default settings.

                              We have done a comparison in our lab with BWA , Stampy, Novoalign, and BFAST. Stampy is the best aligner in our hands (detected more of our SNV and INDEL training set), but Novoalign alignments looked a lot cleaner. I think perhaps with tweaking the gap open penalty for indels, Novoalign might have performed better - just takes some effort to test the parameters more to see if can handle all cases.

                              GATK is definitely ahead of the game for SNV and indel calling (sensitivity and specificity wise). SAMtools is sufficient - probably you can lean on it if you set the parameters to emphasize specificity instead of sensitivity.
                              Interesting. Thank you for your post. We do a pretty good job (I think) using the latest SAMtools mpileup command with the -A and -B options and setting a minimum mapping quality per read at 50, but I haven't done anything rigorous to determine what our sensitivity/specificity is. I may go ahead an look at comparing it with GATK.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              34 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X