SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Mate-Pair sequencing versa Bioinformatics 0 02-09-2011 11:51 PM
Mate-Pair biotinylation djlarsen Sample Prep / Library Generation 11 01-27-2011 10:43 AM
Interleaved mate pair fastq after quality filtering natstreet Illumina/Solexa 1 08-10-2010 03:19 AM
file format headaches - producing interleaved fastq natstreet SOLiD 1 07-28-2010 01:04 AM
Difference between mate pair and pair end bassu General 2 06-19-2010 06:13 AM

Reply
 
Thread Tools
Old 07-28-2010, 01:12 AM   #1
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default Interleaved mate pair fastq after quality filtering

I am looking for a script that can produce an interleaved fastq file and a file of orphaned [single end] reads from two input fastq files. The two read pairs have been quality filtered using fastx-toolkit 'fastq_quality_filter', which breaks the exact pairing that previously existed.

Has anyone come across a script to do this rather than just to interleave to perfectly paired sets of fastq files?

This is a cross-post (apologies) but I wasn't sure where it fitted best.
natstreet is offline   Reply With Quote
Old 07-28-2010, 06:46 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I once wrote a script to do just that a while back using Biopython's Bio.SeqIO.index() function for random access to FASTQ files. This holds the offsets in memory so there is a limit on how many reads you can work with before having to store the index on disk, e.g. using SQLite.

How many reads do you have?

You'd also need to specify how to match up the read pairs, for example are you using the /1 and /2 suffix convention?
maubp is offline   Reply With Quote
Old 07-28-2010, 11:38 PM   #3
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default #Reads

I have about 250 millions reads. I concatenated the reads from all lanes into a single file but I could go back and work on the per-lane files. My machine has 96 GB RAM so it should be able to handle it I think.

The pairs are specified by /1 /2 for the Illumina data I have and _F _R for the SOLiD data - but it's mainly the Illumina data I'm interested in doing this for.

Last edited by natstreet; 07-28-2010 at 11:41 PM.
natstreet is offline   Reply With Quote
Old 07-29-2010, 03:01 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Just to be clear, your two input files will be for the forward and reverse reads?
maubp is offline   Reply With Quote
Old 07-29-2010, 03:03 AM   #5
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Default

Yes. Ideally I want to input read1 and read2 (forward and reverse reads) that have each been pre-filtered using the fastx fastq_quality_filter tool and to then output an interleaved file of proper pairs and a file of the reads that have been orphaned by the quality filtering.
natstreet is offline   Reply With Quote
Old 07-29-2010, 05:10 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Personally I would process the reads in per-lane batches, which will help greatly with reducing the scale of the problem (and let you spread the work over multiple machines or cores).

I couldn't find the original script I used, so I rewrote one using a straightforward but fairly memory naive approach:
Code:
from Bio import SeqIO #Biopython 1.54 or later needed

#######################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_pairs_filename = "out_interleaved_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

#######################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(name):
        """Remove the suffix from a forward read name."""
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    #No suffix, don't need a function to fix the name
    f_name = None

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(name):
        """Remove the suffix from a reverse read name."""
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    #No suffix, don't need a function to fix the name
    r_name = None
    
print "Indexing forward file..."
forward_dict = SeqIO.index(input_forward_filename, "fastq", key_function=f_name)

print "Indexing reverse file..."
reverse_dict = SeqIO.index(input_reverse_filename, "fastq", key_function=r_name)

print "Ouputing pairs and forward orphans..."
pair_handle = open(output_pairs_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for key in forward_dict:
    if key in reverse_dict:
         pair_handle.write(forward_dict.get_raw(key))
         pair_handle.write(reverse_dict.get_raw(key))
    else:
         orphan_handle.write(forward_dict.get_raw(key))
pair_handle.close()
print "Ouputing reverse orphans..."
for key in reverse_dict:
    if key not in forward_dict:
         orphan_handle.write(reverse_dict.get_raw(key))
orphan_handle.close()
print "Done"
Given you have a big memory machine, this may work fine as is.

Alternatively, this second version should need much less memory but produces three files forward/reverse/orphaned, so you'd need to interleave the forward/reverse files afterwards. This makes three passes though the files and just keeps a list of IDs in memory (as a Python set for speed) rather what the first script does which holds a map of IDs to file offsets (as a Python dictionary). The records are output in the order found in the input files, so as long as the input files follow the same sorting, doing the interleaving step is trivial (and easy enough to add on in Python).

Code:
from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

##########################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_paired_forward_filename = "out_forward_pairs.fastq"
output_paired_reverse_filename = "out_reverse_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

##########################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scaning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

Last edited by maubp; 08-10-2010 at 03:17 AM. Reason: Added second script; correction from hl450; show name in asserts
maubp is offline   Reply With Quote
Old 07-29-2010, 10:38 AM   #7
natstreet
Member
 
Location: Sweden

Join Date: Nov 2009
Posts: 83
Smile

That's brilliant thanks! It's also great to have some example python scripts to help me learn from

I really appreciate this and hopefully others will also.
natstreet is offline   Reply With Quote
Old 08-02-2010, 08:06 AM   #8
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

wow I think it's time someone really sat down and made Fastx-toolkit "pair-safe"
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 08-09-2010, 01:12 PM   #9
hl450
Junior Member
 
Location: IN, USA

Join Date: Jun 2010
Posts: 8
Default

Quote:
Originally Posted by maubp View Post
Code:
from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

##########################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_paired_forward_filename = "out_forward_pairs.fastq"
output_paired_reverse_filename = "out_reverse_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

##########################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix)
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix)
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scaning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

I think there is an error in the code it should be input_reverse_filename rather than input_foward_filename

Code:
print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
hl450 is offline   Reply With Quote
Old 08-10-2010, 03:13 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by hl450 View Post
I think there is an error in the code it should be input_reverse_filename rather than input_foward_filename
Yes, that does look better. I've updated the example. Thanks
maubp is offline   Reply With Quote
Old 07-05-2011, 01:05 AM   #11
menenuh
Junior Member
 
Location: sweden

Join Date: Jan 2010
Posts: 8
Default casava 1.8

Hello,
I have been using this script with great ease and pleasure, till Casava 1.8 update. Now the read id structure is changed, the script no longer recognizes the pair information. Do you have an update for this script?
menenuh is offline   Reply With Quote
Old 07-05-2011, 01:40 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by menenuh View Post
Hello,
I have been using this script with great ease and pleasure, till Casava 1.8 update. Now the read id structure is changed, the script no longer recognizes the pair information. Do you have an update for this script?
You've run into the very problem I predicted on this thread announcing the Casava 1.8 changes:
http://seqanswers.com/forums/showthread.php?t=8895

Simple answer, just look at the name (first word before the space) and they should match. i.e. There is no suffix - change these lines:
Code:
f_suffix = "/1"
r_suffix = "/2"
to:
Code:
f_suffix = ""
r_suffix = ""
and the script should work.

A slightly more robust script would look at the read number field in the second word.
maubp is offline   Reply With Quote
Old 03-06-2012, 05:07 AM   #13
nposnien
Member
 
Location: Göttingen, Germany

Join Date: May 2011
Posts: 13
Default

Hi,
@maubp: thank you very much for the script! I have no experience in programming but I guess that this script https://github.com/lexnederbragt/den...leave_pairs.py is e is exactly what I need (produce paired/singleton file from trimmed/filtered paired-end reads). However, my data is also based on CASAVA 1.8. Thus, I guess that I will need to change the script as recommended in the previous post. But my question is: How does the script identify the pairs? Based on the file they come from (forward or reverse)?
Thank you very much!!
nposnien is offline   Reply With Quote
Old 03-07-2012, 07:15 AM   #14
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

If your FASTQ pairs come in separate forward and reverse files that tells you which is which. However, for the CASAVA 1.8 naming you could/should also look for 1 or 2 after the first space.
maubp is offline   Reply With Quote
Old 03-19-2012, 09:22 PM   #15
odoyle81
Member
 
Location: United States

Join Date: Aug 2011
Posts: 31
Default

Thanks guys for these scripts! I was struggling trying to create one from scratch and these have been a huge help!
odoyle81 is offline   Reply With Quote
Old 03-22-2012, 03:06 AM   #16
chrishah
Member
 
Location: Oslo

Join Date: Jul 2011
Posts: 18
Default

Hi guys,

Thanks for this script! I used it successfully both on standard fastq as well as on casava format (with the appropriate changes in the script)

But now I hit a wall again:

I have been doing some error correction and this added some information to the header so afterwards the script does not work anymore.

my data looks now like this:
@PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
+PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1
BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
+PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

I tried to change the script in different ways, but it always results basically in the following error message:

Traceback (most recent call last):
File "/usit/titan/u1/chrishah/python/pair_up_reads.py", line 49, in <module>
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
File "/site/VERSIONS/python-2.6.2/lib/python2.6/site-packages/Bio/SeqIO/QualityIO.py", line 907, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

I could of course just remove the parts from the header that have been added by the errorcorrection tool and use the normal script...
But I really think it should be easy to adjust the script so that it works for my data - I just dont know enough about Python...

Can anyone give me a hint how to change the script? Any tips are highly appreciated!

Much obliged!
Christoph
chrishah is offline   Reply With Quote
Old 03-22-2012, 03:12 AM   #17
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by chrishah View Post
Hi guys,

Thanks for this script! I used it successfully both on standard fastq as well as on casava format (with the appropriate changes in the script)

But now I hit a wall again

I have been doing some error correction and this added some information to the header so afterwards the script does not work anymore.
You should either edit BOTH the @ and the + lines like this,
Code:
@PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
+PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
+PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
or use a blank + line like this (which is recommended as the file is smaller):
Code:
@PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
+
BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
The Biopython parser is complaining about the inconsistency.
maubp is offline   Reply With Quote
Old 03-22-2012, 05:32 AM   #18
chrishah
Member
 
Location: Oslo

Join Date: Jul 2011
Posts: 18
Default

Thanks for your rapid reply maubp!

Have tried the second solution you suggested and it works fine with the standard version of the script!

I will use this approach to proceed. Thanks again!

take care,
Christoph
chrishah is offline   Reply With Quote
Old 04-17-2012, 10:12 AM   #19
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default

I have data generated from Illumina Hiseq 2000 which uses 1:N and 2:N as opposed to /1 and /2 as suffices.
Ex:

@D8CXHXP1:222:C0HVHACXX:2:1101:1836:2165 2:N:0:ACAGTG

Do I just follow replacing those 2 lines of code with:

f_suffix = ""
r_suffix = ""

Would it work then?? Sorry I don't understand
vkartha is offline   Reply With Quote
Old 04-18-2012, 01:10 AM   #20
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by vkartha View Post
Would it work then?? Sorry I don't understand
Yes, i.e.

Code:
from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

#######################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_pairs_filename = "out_interleaved_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = ""
r_suffix = ""
#For older Illumina files use this instead:
#f_suffix = "/1"
#r_suffix = "/2"

#######################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scanning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"
A more rigorous script would verify the 1 or 2 as well.

Last edited by maubp; 04-24-2012 at 12:59 AM. Reason: Correcting messed up example - I'd copy & pasted the wrong bits from earlier posts.
maubp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO