Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • removal of unpaired reads

    Hi I would like to know is there any tool for removing the unpaired fastq reads from a huge set of paired end fastq data?

  • #2
    It should be a pretty simple scripting job, but the details will depend on how you have your data (e.g. one single file with read name suffixes like /1 /2, or separate files for forward and reverse reads).

    Comment


    • #3
      Hi It is in one single file having forward and reverse (1 & 2)

      Comment


      • #4
        You said huge, but how many reads are there (roughly)? One million? Ten million? More?

        Would a Python script using Biopython be useful? It would help to show a small sample of the reads (say the first ten) to ensure I understand your read naming convention. You can wrap the example with [ code ] and [ /code ] tags to format it nicely on the forum.

        Comment


        • #5
          hi maubp .. there are around 50 million reads

          Comment


          • #6
            About 50 million reads... that would make any approach using a list of IDs in memory rather tricky.

            Can we assume the read pairs are next to each other, e.g. you started out with this:

            Code:
            read1.f
            read1.r
            read2.f
            read2.r
            read3.f
            read3.r
            read4.f
            read4.r
            read5.f
            read5.r
            and you now have just a subset but in the same order, e.g.
            Code:
            read1.f
            read1.r
            read2.r
            read3.f
            read4.f
            read4.r
            read5.r
            from which the only complete pairs remaining are:
            Code:
            read1.f
            read1.r
            read4.f
            read4.r
            If we can assume this (that pairs if present are consecutive) this should be pretty easy to do with minimal RAM requirements.

            If the reads are in a random order, things will be much harder...

            Comment


            • #7
              hi the read order is in

              0/1
              0/2
              0/1
              0/2
              0/1
              0/2

              Comment


              • #8
                This is a general slow version (you can switch "fastq" to any other supported file format like "fasta" or "qual"):

                Code:
                from Bio import SeqIO
                
                mixed_file = "mixed.fastq"
                paired_file = "paired.fastq"
                
                def get_paired(iterator):
                    prev = None
                    for curr in iterator:
                        if curr.id.endswith("/1"):
                            prev = curr
                        elif not curr.id.endswith("/2"):
                            raise ValueError("Expect IDs to end /1 and /2")
                        elif prev and prev.id == curr.id[:-2] + "/1":
                            yield prev
                            yield curr
                            prev = None
                
                records = get_paired(SeqIO.parse(mixed_file, "fastq"))
                count = SeqIO.write(records, paired_file, "fastq")
                print "Saved %i records (%i pairs)" % (count, count/2)
                This next script should be about four times faster, but is only suitable for FASTQ files:

                Code:
                from Bio.SeqIO.QualityIO import FastqGeneralIterator
                
                mixed_file = "mixed.fastq"
                paired_file = "paired.fastq"
                
                out_handle = open(paired_file, "w")
                prev = None
                for curr in FastqGeneralIterator(open(mixed_file, "rU")):
                    if curr[0].split()[0].endswith("/1"):
                        prev = curr
                    elif not curr[0].split()[0].endswith("/2"):
                        raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0])
                    elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1":
                        out_handle.write("@%s\n%s\n+\n%s\n" % prev)
                        out_handle.write("@%s\n%s\n+\n%s\n" % curr)
                        prev = None
                out_handle.close()
                See also http://news.open-bio.org/news/2009/0...on-fast-fastq/

                This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.

                If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).

                Comment


                • #9
                  Another way to do this: first run tophat, and check the accepted_hits.sam and delete the line with "*"
                  HWI-EAS266_0005:1:32:2465:21083#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ^^[YVRK[bb^bbbb_babbb^bbbbbbaabbcbbb`bbbba NM:i:1
                  HWI-EAS266_0005:1:59:17297:6482#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ORXYaa\__bbbTbabbb`b[bbb``abbbbbb_bbb`b^bb NM:i:1
                  HWI-EAS266_0005:1:3:4093:8164#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG Rbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb NM:i:1
                  HWI-EAS266_0005:1:60:8918:19240#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG BBBBa_\`a_`a[a[_aaaXTa```a_a[a_]]_`T]`__aa NM:i:1
                  HWI-EAS266_0005:1:79:5108:21091#0 73 chr1 1146 1 42M * 0 0 GCGCCCCCTGCTGGCGCCGGGGCGCTGCAGGGCCCTCTTGCT aQ_a`^ca_`a__c]`\cbbbbbb^bbb]ba^abbbbbbbbb NM:i:1
                  HWI-EAS266_0005:1:8:7442:18432#0 73 chr1 1168 1 42M * 0 0 CACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCC _Sb_ab]bbb^_ba^b_b\`V]\b``b_ababbbababbbb` NM:i:0
                  HWI-EAS266_0005:1:35:18364:16230#0 147 chr1 1303 255 42M = 1458 0 TTCTTGCTCCAACAGTAGTGGCGGATTATAGGGAAACACCCG B_bb`bbcbbbbbcbbbbbcbbbbb`bbbbbbbabababbbb NM:i:2
                  HWI-EAS266_0005:1:26:17783:8912#0 137 chr1 1585 0 42M * 0 0 GGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCC BBBBBababaaaa_bb]bb`bbbacbbbcc_bc`bbcbbbbb NM:i:1

                  Is this method reasonable?

                  Comment


                  • #10
                    I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like:
                    File 1:
                    @M00289:3:000000000-A752F:1:1102:13093:3364 1:N:0:20
                    GGCTGTGATCACGGGGCAAAACAGCCACTTCATTAACTTTCAGTAAGGGCTGCAAGGTAACTCCAGCAGGTGCCTTTTGTGTTTCACTCAACCCTAGGTCAAAACGTTGTTCACTCATGGCTTGTTCCAACCAAGGCGACTCTAAGG
                    +
                    BCCCBCFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGHHHGHHHHHHHHHHHHHGGHHHHHHGGHHGHGGGGGHHHHH
                    @M00289:3:000000000-A752F:1:1110:7454:18361 1:N:0:20
                    ATCTCAACGCGTCCAAATGAGGCATCGCTGTATTCAGGTTACTTTACATAAGAGTTTTTATGTTAAATAGGACTAAAAATATACTCTAATTTTAGAGTTTTCTTTTAGGTATGATGTAAAAACATACAAGCCTAAGAGTTTAATTTAAAGG
                    +
                    CCCCCFFFDDDDGGGGGGGGGGGGHHGGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGHHHHHHHHHHGHHHHHHGG

                    FIle 2:

                    @M00289:3:000000000-A752F:1:1112:9370:8062 2:N:0:20
                    CCACACCCCCTGCAGAGCGTTCTCGCAGACACAGTCCGCAAAGCCAGTGCCGACTTGAGCCACCTTGACCAGTGTTTTTATTAGAACTAGAAACTAGAGGATTTGTTGCAC
                    +
                    BBCDDCCEDEEEGGGGGGGGGGGHGGGGGHHHHHHHHGGGGGHHHFHGHHHGCGGGHHHHHHHHHHHGHHHHHGGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

                    @M00289:3:000000000-A752F:1:1114:14694:7357 2:N:0:20
                    TAATTTAACTGTAGTTCATCCGCATTTTCCCCTCTAATGCCCCAGTTTTCTTTAGGGGTTTCAAATATAGTTATCTCAACATCAT
                    +
                    BBBBCFFFFFFFGGGGGGGGGGGGGGHHHHHHHGHHHHHHHHGGGHHHHHHHHHHHGGEGGHHHHHHHHHHHHHHHHHHHHHHFH
                    @M00289:3:000000000-A752F:1:1102:13093:3364 2:N:0:20
                    CCTTAGAGTCGCCTTGGTTGGAACAAGCCATGAGTGAACAACGTTTTGACCTAGGGTTGAGTGAAACACAAAAGGCACCTGCTGGAGTTACCTTGCAGCCCTTACTGAAAGTTAATGAAGTGGCTGTTTTGCCCCGTGATCACAGCC
                    +
                    BBBBCFFFFFCDGGGGGGGGGGHHHHHHGHHHHHHHHHHHHHGHHHGHGHHGHHHHHHGHHGHHHHHHHHGHHHGHHHGHHHHHHHHHHHHHHHHHFHHHGGHHHHHHHHHHHHHHHHHGHHHGHG0FGH03GHGF<DC2FFGHHHH

                    As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.

                    This next script should be about four times faster, but is only suitable for FASTQ files:

                    Code:
                    from Bio.SeqIO.QualityIO import FastqGeneralIterator
                    
                    mixed_file = "mixed.fastq"
                    paired_file = "paired.fastq"
                    
                    out_handle = open(paired_file, "w")
                    prev = None
                    for curr in FastqGeneralIterator(open(mixed_file, "rU")):
                        if curr[0].split()[0].endswith("/1"):
                            prev = curr
                        elif not curr[0].split()[0].endswith("/2"):
                            raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0])
                        elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1":
                            out_handle.write("@%s\n%s\n+\n%s\n" % prev)
                            out_handle.write("@%s\n%s\n+\n%s\n" % curr)
                            prev = None
                    out_handle.close()
                    See also http://news.open-bio.org/news/2009/0...on-fast-fastq/

                    This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.

                    If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).[/QUOTE]
                    Last edited by chariko; 08-05-2014, 07:15 AM.

                    Comment


                    • #11
                      Originally posted by chariko View Post
                      I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like: ...

                      As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.
                      Illumina changed their naming schema. There are scripts out there to convert this back to the /1 and /2 style (even as Unix one-line commands with sed or awk).

                      I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:

                      Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


                      Do you have the original Illumina FASTQ files before whatever processing was done to them?

                      Peter

                      Comment


                      • #12
                        Originally posted by maubp View Post
                        Illumina changed their naming schema. There are scripts out there to convert this back to the /1 and /2 style (even as Unix one-line commands with sed or awk).

                        I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:

                        Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


                        Do you have the original Illumina FASTQ files before whatever processing was done to them?

                        Peter
                        Because I obtained a huge number of duplicates in my original fastq file (it is a de novo experiment)I first removed them in each file separately and that resulted in a different number of reads and also mixed all of them. Any clue?

                        Comment


                        • #13
                          Either use a pair-aware filtering pipeline, or, I suggest:

                          1. Interleave the pairs (one FASTQ file with r1, r2, r1, r2, etc)
                          2. Filter the interleaved file (preserving the original order)
                          3. Run https://github.com/peterjc/pico_gala...aired_unpaired

                          Comment


                          • #14
                            Originally posted by chariko View Post
                            I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files
                            You may want to try Pairfq for a more general solution (handling multi-line Fasta/q, compressed or uncompressed, multiple Illumina formats). The specific command you would want is pairfq makepairs, and this is described on the wiki for that command.

                            Comment


                            • #15
                              Thanks SES for your suggestion. I will try it also.

                              Anyway, I finally made it work by doing the following:

                              First I changed my ending form 1:N to /1 and 2:N to /1 and /2 with:

                              gawk '{print((NR % 4 == 1) ? $1"/1" : $0)}' Sample1_R1.fq > Sample1_newtags_R1.fq
                              gawk '{print((NR % 4 == 1) ? $1"/2" : $0)}' Sample1_R2.fq > Sample1_newtags_R2.fq

                              (https://wikis.utexas.edu/display/bio...nux+one-liners).

                              I interleaved the pairs using the script interleave_fastq.py (https://gist.github.com/ngcrawford/2232505).

                              I filter my fastq file using the following script:
                              (https://github.com/brentp/bio-playgr...r/reads-utils/)

                              the problem is that the filtered file does not mantain the order of paired reads so you will have it to do it manually (in my case with Excel).

                              And finally after installing galaxy locally I run the galaxy script made by maubp.
                              Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.


                              When running again fastqc with my files, the duplication levels where ok.

                              Thanks a lot for your suggestions.
                              Last edited by chariko; 08-08-2014, 06:44 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 08:47 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X