Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNAseq Pipeline questions

    We recently ran our FIRST RNAseq run on our new Illumina NextSeq. We ran a paired-end run with 12 indexed samples. The data generated comes in 8 fastq files for each sample (4 lanes, Read 1, Read 2) for a total of 96 fastq files. We are planning on using the following pipeline to analyze our data for differential gene expression.

    Trimmomatic ---> Rockhopper

    We are running that analysis on a Windows 7 machine (I don't have any other options), and I have Geneious installed as well.

    So with all that background here is my question:

    At what point to I combine the 8 fastq files for each sample into 2 fastq files (R1, R2) in the pipeline?

    I was thinking that I would combine the files before I run Trimmomatic, so as to save myself MANY repetitions of the same analysis steps. The only way I have figured out how to combine fastq files on my Windows 7 machine is to use Geneious (which gives me a warning that some of the meta data may be lost).

    I ran a side by side comparison of the 2 combined fastq files with the 8 separate fastq files using my workflow and the output in Rockhopper said I had differential gene expression present in ~35% of genes (which doesn't make ANY sense since these are EXACTLY the same samples, just one set has been combined and the other has not).

    Any guidance would be greatly appreciated!!!!!

  • #2
    You should merge the R1/R2 fastq file pieces for each sample before doing any downstream analysis.

    You can (probably) setup NextSeq to generate a single file for each sample by default so you should not need to do this manually in future.

    Comment


    • #3
      Originally posted by GenoMax View Post
      You should merge the R1/R2 fastq file pieces for each sample before doing any downstream analysis.

      You can (probably) setup NextSeq to generate a single file for each sample by default so you should not need to do this manually in future.
      For clarification, do you mean combining the four lanes for R1 into one file and the four lanes for R2 into one file for a total of 2 files that are paired-end OR combining all 8 files (L1_R1, L1_R2, etc.) into 1 file that has the paired end reads interleaved?

      Comment


      • #4
        Leave the R1/R2 reads in separate files unless you have a program that requires them to be interleaved. In that case you can't just concatenate the files together.

        Look into BBMap suite for your trimming/alignment needs since you will be able to use the program on windows.

        Do you have the option of running a virtual machine/unix on this box?

        Comment


        • #5
          Thank you for the clarification and for the info on BBMap. I will look into it.

          I do have the option to run a virtual machine on this box and I have discussed it with my supervisor multiple times during this process of getting a data analysis pipeline in place. I would really like to concatenate (I think this is the right work, the cat command in unix) the files together and not have to import them into Geneious, combine them (with probable loss of metadata), and then export them again. I will also look into the possibility of having the NextSeq combine the files for me (we are using a BaseSpace onsite and I am not very familiar with it, so I will investigate this possibility as well).

          Thank you SOO much for you help!

          Comment


          • #6
            Follow-up.

            I ended up installing Oracle VirtualBox (https://www.virtualbox.org/) and installing Ubuntu as a virtual machine so I could combine the fastq files using the zcat command in the terminal. You have to make sure that you have the Guest Additions installed so that you can created a "shared" folder that the host machine and the Guest OS both recognize. The user manual for VirtualBox is pretty good at explaining this (and Google anything you don't understand).

            Comment


            • #7
              To put up a fine point about using zcat see post #4 in this thread: http://seqanswers.com/forums/showthread.php?t=51395

              Comment


              • #8
                Thank you for the heads up about the zcat command.

                Just to make sure, I should double check my concatenated files and make sure that all the the reads that I expect to be there, are actually there, right?

                Thank you again for all your guidance on this.

                Comment


                • #9
                  Wouldn't hurt to check the size of the files. You could count the reads too, if you want to be extra careful (count the ^@ characters).

                  Edit: See #11 below for an amendment.
                  Last edited by GenoMax; 11-06-2015, 04:19 AM.

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Wouldn't hurt to check the size of the files. You could count the reads too, if you want to be extra careful (count the ^@ characters).
                    Counting ^@ might give a wrong result as the "@" is a valid ascii char in the qual string. Counting the lines "wc -l" and dividing by 4 is the safest way to be sure that everything is there

                    Comment


                    • #11
                      Good point.

                      @beki.renberg: Use "^@HWI" (or common machine identifier you see in your data) instead.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X