Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find Common Reads between two FASTQ files

    Dear experts,

    I have two fastq files contains RNASeq reads of two technical replicates (at the level of re-run the sequencing machine twice) for one sample. I want to select the reads that appear in both fastq files by comparing the sequence reads between two files. How can I do that? does any of the bioinformatics tools do it?

  • #2
    The easiest way to do this:
    Code:
    comm -12 "file1" "file2"
    comm
    Pipe to get the results in a file
    Code:
    comm -12 "file1" "file2">common_lines.txt
    This command will compare two files and print the common lines.
    Using -23 flag you can get lines unique to file1, using -13 flag you can get lines unique to file2.

    May I ask why do you want to do this?

    Comment


    • #3
      Thanks mknut for your reply.

      Well... comm may not work here because each read has 4 lines and I need to select reads that have the same sequence (but may differ in their quality string). So, wondering if there is any alternative?

      Regarding your question about why I am trying to do so:
      Actually, I have two technical replicates for RNAseq sequencing of my sample. So, I have to library of reads (2 fastq files). In order to represent the sample by one library, I have two options:
      1) to combine reads from both files. This will increase coverage. However, artifact reads generated in any of the two libraries will not be detected
      2) to select those reads that appear in both libraries (the same sequence, but, may differ in sequencing quality). This option is applicable in my case because my technical replicates do have the same library preparation procedure (same fragmentation...etc). The only difference is just in run the illumina sequencing machine twice!

      Maybe there is another option that is better. If you or anyone have another suggestion, I will appreciate if he can reply to this message.

      Comment


      • #4
        I am not entirely sure why do you want to make one library from two technical replicates. If you preserve the reads as technical replicates, you will preserve information about variability introduced by the method - this is the idea behind having technical replicates in the first place. I think that it would be better to just continue with the analysis without any merging of the files, so do QC and mapping for them separately and use software that accommodates replicate usage (majority does) in further analysis (e.g. cufflinks, cuffdiff). What exactly are you investigating, differential gene expression or something else?

        One other thing -
        my technical replicates do have the same library preparation procedure
        Correct me if I'm wrong, but I understand that you had one sample, divided it into two, then they both went through the same library prep protocol and sequencing. This means that In this case you will see not only variability originating from the sequencing, but variability originating from library prep as well. Have a look at this thread.

        Comment


        • #5
          Yes. I am studying differential gene expression between samples. I have n samples where each one has m technical replicates, where m differs from sample to another.

          That thread you referred to is useful. My technical replicates of each sample are just "the same library was sequenced on m different lanes". So as per the thread you referred to, the variance between these replicates introduces the technical variance between illumina sequencing. Consequently, my purpose is to remove such technical sequencing variance between technical replicates (e.g. artifact reads) and focus on the biological variance between samples. That is what I am still convinced to do, but, still looking to criticize it to investigate if this methodology can be replaced by a better one.

          Just found this thread..
          http://seqanswers.com/forums/showthread.php?t=16918

          Comment


          • #6
            Find Common Reads between two FASTQ files

            I would just combine the files for the technical replicates.

            For example, see the vignette (documentation) for the Bioconductor package DESeq:

            Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution

            Comment


            • #7
              Originally posted by Fernas View Post
              My technical replicates of each sample are just "the same library was sequenced on m different lanes".
              I would just merge them then, probably after the mapping step. You can still run some QC on them separately to make sure you observe a nice correlation between these technical replicates but my understanding is that results should not be different from what you would get by sequencing deeper.

              Comment


              • #8
                hello all,

                I have similar problem. I have two fastq files R1 and R2 which are not equal in size.

                I want to separate out equal reads from them for further mappign.


                is there any tool or command.
                Kindly help...

                Comment


                • #9
                  Originally posted by emp View Post
                  hello all,

                  I have similar problem. I have two fastq files R1 and R2 which are not equal in size.

                  I want to separate out equal reads from them for further mappign.


                  is there any tool or command.
                  Kindly help...
                  Have a look at this or this thread on biostars for a number of ways to do this.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  53 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X