Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • processing the outputs of illumina hi seq

    Hi,

    We will be using Illumina HiSeq 2000 to sequence exomes . I have not received the data yet, and I am looking to put a plan together on the steps for analysis.

    Does anyone know what type of files I will be starting with ( the output from the illumine sequencer), would it be in a "fastq" format? is there an outline on how to process the files up to the analysis stage.

    thanks

  • #2
    The output of the sequencer will be fastq files. If the facility where you are getting these from uses the new version (v.1.8) of illumina pipeline, each sample may have multiple gzip-archived files that you will need to merge (or analyze in parallel and then merge). The quality values in the fastq files will be in the "sanger" format (http://en.wikipedia.org/wiki/FASTQ_format). Files are going to be ready to analysis (starting with some QC).

    Are you planning to analyze the data using local computing infrastructure or with an online tool like galaxy.

    Comment


    • #3
      thanks GenoMax for the input

      It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
      It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

      thanks

      Comment


      • #4
        kjaja,

        I doubt you are going to get data in the BCL format. You will need the Illumina pipeline software to process the raw data in BCL format. Last I checked this software was not freely available. If you were doing this only for one experiment then you would not want to spend time on installing CASAVA (assuming you got your hands on a copy).

        In general bcl --> fastq conversion step is generally performed by the facility where you will get your sequence from. Depending on what their policy is, you can request that your sequences be aligned to your "reference" genome using ELAND. ELAND is Illumina's version of short sequence alignment tool. Most commonly used aligners are bwa, bowtie, SOAP (this site has a long list of software for NGS data analysis: http://seqanswers.com/wiki/Software/list).

        Galaxy has tutorials available at the links below for RNA-seq analysis:

        Galaxy is a community-driven web-based analysis platform for life science research.

        Galaxy is a community-driven web-based analysis platform for life science research.


        They also have video tutorials ("live quickies") on the main page of Galaxy (http://main.g2.bx.psu.edu/) to get you started.

        Originally posted by kjaja View Post
        thanks GenoMax for the input

        It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
        It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

        thanks

        Comment


        • #5
          Originally posted by kjaja View Post
          thanks GenoMax for the input

          It looks like I will be getting the raw data probably “.bcl” format. Based on reading some papers, I can use CASAVA to convert into ” fastq” format and then use BWA to align against the reference. I have seen other paper use “Maq” or “ELAND”, does anyone know the difference between BWA, Maq or ELAND?
          It terms of using an online tool such as Galaxy, I have never used it before, is there an online tutorial on how to use it ?

          thanks
          There are a number of aligners out there. ELAND is Illumina's. Maq is pretty old school, bwa is a Burrows-Wheeler Transform algorithm, which for a while have been the preferred algorithm. Speed doesn't matter if you are working on small genomes, like bacteria, but anything larger than 10's of megabases, you will be better off with a bw algorithm. You want your output to be in sam or bam format (bam is binary, compressed sam). This is becoming the standard, it's a file where every line is one read, and all the information about where and how well that read mapped. You are then going to want to do stuff to the bam files. SAMTools is one suite of programs that can help, as is the Broads Genome Analysis ToolKit (GATK). SAMTools is a lot less complex. There are a few different tools for visualization, like Galaxy, and IGV. BEDTools can be useful too. For exome capture, you probably want to align to the whole genome, then filter for just the reads that overlap your exons. BEDTools can do that.

          Comment


          • #6
            Thank you all, that was helpful.

            I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal? How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?

            Comment


            • #7
              Originally posted by kjaja View Post
              Thank you all, that was helpful.

              I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal? How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?
              Yes, a few hours to align tens or hundreds of millions of reads to a mammalian genome is normal. If you can use multiple processors (with the -t option in bwa), that'll speed things up.

              Comment


              • #8
                Originally posted by kjaja View Post
                I have a question related to using galaxy. I have tires to map one sample to the reference using BWA and it took few hours to do that!! Is that normal?
                Yes. That is normal for galaxy. Remember you are sharing the site with tens of other users and jobs. Even if you do this locally on your own hardware, it will take on the order of couple three hours per sample to do alignments for large genomes (human).

                Originally posted by kjaja View Post
                How do people go about processing many samples, would galaxy be the tool to use? can we use command lines or scripts to process data using galaxy?
                Some do not have access to local computer hardware infrastructure so for them galaxy (or galaxy on Amazon cloud) is a good (or only) option.

                If you are comfortable with command line and have access to local compute infrastructure then you do not need public galaxy. But if you still want to use the easy web interface of galaxy then consider setting up a local instance of galaxy (http://wiki.g2.bx.psu.edu/) and use it that way.

                Comment


                • #9
                  CASAVA .bcl to fastq output

                  Hi,

                  We have a control run on new HiSeq Machine installed recently. The fastq files extracted from CASAVA 1.8.2 has a different format (Pasted below)


                  @HWI-ST1072:1440BVUACXX:2:1101:1242:2124 1:N:0:
                  CGGTTTTTATTAAACATATAAACAATTCTTACAGATTGACATCGTACGAGC
                  +
                  ;@@DDD++<CD:2:A<<a@F:333<3AFAC9+1**1:C**11CE0?DGF

                  The manual says that when sequences are filtered they will have "Y" in the header. However, all my sequences (100%) are having "N". I have run FASTQC on these sequences and it shows the quality to be EXCELLENT. I am also attaching the picture of the read quality. Presence of "N" worries me and I want to know if this is good sequence of bad! What actually does "N" mean here!
                  Attached Files

                  Comment


                  • #10
                    CASAVA v.1.8.2 FASTQ files only contain reads that passed filtering (unless you run the analysis "--with-failed-reads" option which then includes reads that would normally be filtered out).

                    "N" here means the sequence is *not* filtered i.e. it is good quality.


                    Originally posted by kalyankpy View Post
                    Hi,

                    We have a control run on new HiSeq Machine installed recently. The fastq files extracted from CASAVA 1.8.2 has a different format (Pasted below)


                    @HWI-ST1072:1440BVUACXX:2:1101:1242:2124 1:N:0:
                    CGGTTTTTATTAAACATATAAACAATTCTTACAGATTGACATCGTACGAGC
                    +
                    ;@@DDD++<CD:2:A<<a@F:333<3AFAC9+1**1:C**11CE0?DGF

                    The manual says that when sequences are filtered they will have "Y" in the header. However, all my sequences (100%) are having "N". I have run FASTQC on these sequences and it shows the quality to be EXCELLENT. I am also attaching the picture of the read quality. Presence of "N" worries me and I want to know if this is good sequence of bad! What actually does "N" mean here!
                    Last edited by GenoMax; 11-23-2011, 09:36 AM. Reason: added info

                    Comment


                    • #11
                      yep ... why simply saying 'passed' if you can say 'not failed' ... keep things complicated ;-)

                      SCNR,
                      Sven

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-27-2024, 06:37 PM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-27-2024, 06:07 PM
                      0 responses
                      12 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      69 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X