Seqanswers Leaderboard Ad

**maubp** · 05-30-2012, 11:54 PM

Originally posted by MattB View Post

Just wondering if anyone has a script to do the opposite to the scripts in this thread, ie. split an already interleaved forward and reverse file into separate forward and reverse files (for Casava 1.8 and beyond read naming)?

I've got a tool (fastq_paired_unpaired) in the Galaxy Tool Shed to do just that (also handling the older Illumina /1 and /2 names, Sanger naming, etc). It is just a Python script internally but since it uses the Galaxy FASTQ library perhaps not ideal for your needs? http://toolshed.g2.bx.psu.edu/

**MattB** · 05-31-2012, 02:02 AM

Thanks, I had actually planned to use the Galaxy option but our local installation is undergoing some maintenance at the moment hence I was after a simple script alternative.

The Perl script that Wallysb01 pointed me to seems to work nicely in any case!

Matt

**cwzkevin** · 11-07-2012, 02:44 PM

Hi, maubp
Thank you for your python code. Just a question, if the sequences order in the 2 input fastq files are randomized, then the sequences order in the output two paired fastq files are not the same. The two output fastq files are indeed paired files, just the sequence orders are not the same. Can you please update your code to make the output paired fastq files with the same sequence order if the input fastq files are randomized? Thank you.

Originally posted by maubp View Post

Alternatively, this second version should need much less memory but produces three files forward/reverse/orphaned, so you'd need to interleave the forward/reverse files afterwards. This makes three passes though the files and just keeps a list of IDs in memory (as a Python set for speed) rather what the first script does which holds a map of IDs to file offsets (as a Python dictionary). The records are output in the order found in the input files, so as long as the input files follow the same sorting, doing the interleaving step is trivial (and easy enough to add on in Python).

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

##########################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_paired_forward_filename = "out_forward_pairs.fastq"
output_paired_reverse_filename = "out_reverse_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = "/1"
r_suffix = "/2"

##########################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scaning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

**maubp** · 11-07-2012, 04:05 PM

Originally posted by cwzkevin View Post

Hi, maubp
Thank you for your python code. Just a question, if the sequences order in the 2 input fastq files are randomized, then the sequences order in the output two paired fastq files are not the same. The two output fastq files are indeed paired files, just the sequence orders are not the same. Can you please update your code to make the output paired fastq files with the same sequence order if the input fastq files are randomized? Thank you.

If the order is randomised, you can either presort both files (and then you are back in the original situation), or use an indexing strategy (you could use the SeqIO indexing functions in Biopython for this if you wanted to).

One way to do the sorting is to convert your FASTQ file into a tab-separated file, use the Unix sort command, and then turn the data back into FASTQ. The trick here is the Unix sort command sorts at the line level, so we must transform the data to be one line per record.

There is probably an 'off the shelf' solution to this, perhaps using biopieces?

Or go back and find out why your FASTQ files are out of order, and fix that

**Daniel1977** · 03-20-2013, 03:11 AM

how about fasta files?

Hi,
could anyone please modify the script so that it works with fasta files instead of fastq files?

many many thanks!

**safina** · 03-26-2015, 08:57 PM

Error: Sequence and quality captions differ.

I ran the script but its giving me the error:
Sequence and quality captions differ.

Can anyone help me with this error for the script posted above.???

**maubp** · 03-26-2015, 11:38 PM

Originally posted by safina View Post

I ran the script but its giving me the error:
Sequence and quality captions differ.

Can anyone help me with this error for the script posted above.???

In FASTQ, the @ lines and + lines should be consistent (or the plus line can leave out the caption completely which is preferable as it otherwise is a waste of disk space).

It sounds like somehow your FASTQ files have been messed up

**safina** · 03-27-2015, 11:17 PM

Im sending you some of my fastq file reads. pls see.

==> forward_sequences.fastq <==
@SRR1562087.10.1/1
GAGCTAGATCAGCACCATATATTACACGATGATCAGCTGTAACATTTACCTGCATCTGGTTCTTCATTCCTATCCGACCATCCTTGG
+
JJJJJJIIJJJJJJJJIJJJJJJJJJJJJJJJIJJJJJJJGIIJJJJIJJJJJJJJJIJJJJDHIHHHHHHHFDFFDDDDDDDDD>C
@SRR1562087.11.1/1
AGGTTGACTATGGTCCAGGCCATGCCAGGAGAGCAACCGAAAACAGAGAGAACGGTAAGCCAGGAGAAGAACAGTATGAGTATATAG
+
IJJGHIJIIIFIBHHGAFHGGIHJIJGJEGIGGGHGIJJJJHHGFEFEDACEEDDBDBCCCDDDDDDBDDDCDDCADDDCCCDDDDD
@SRR1562087.15.1/1
TAACATCCACAATCTCCTTCTACCCAAGAAGTCTGGAACTTCAGCATCAAAGGCTGGTGATGACGACAACTAATCCATTTACTGAAT

==> reverse_sequences.fastq <==
@SRR1562087.7.2/2
CCTGTAGATATACGTACTGCCAAAGGGTAGATAGTTGCCCATCTCAGAAAACACAACTTCAACAGCCAAGATTAATATCCATGTGAT
+
IJJJGGJBHIJJGHHHIIHJJGJGJIIDFHIJIJJJGHJJJJJJJIJGIGH@FHJIJIHIIIHHH=BDFFAEECCEEFDEDDCDCA>
@SRR1562087.9.2/2
GTAATCCAAATAAGGTATACTCACTCATCGGAGGATTTTGTGCTTCCCCTGTGAATTTCCACGCTAAGGATGGCTCCGGCTATAAAT
+
JIJIIJJJGGIIJIBC@FH@HHJGIJGCHGIEGIFHDFHJIJIJIHHIIIIJGGHHHHHCDDFDDDBDDDDDDDCDBDDBD@CDCEE
@SRR1562087.11.2/2
GAAACACTGATTGGTTCACGTATCCAGGTGTATGGACCACCTATATACTCATACTGTTCTTCTCCTGGCTTACCGTTCTCTCTGTTT

**maubp** · 03-28-2015, 04:21 AM

Your FASTQ files look fine from those snippets. Which script exactly are you using, what was the command line you used, and most importantly, the exact error message?

Update: Hang on, the names don't match, e.g. SRR1562087.10.1/1 vs SRR1562087.7.2/2

**SES** · 03-29-2015, 10:40 AM

Originally posted by MattB View Post

Thanks for quick reply. Actually, I'm looking for something even simpler, where I just have one interleaved fastq file that I want to split into separate forward and reverse. Probably should have posted in a separate thread since I'm not so worried about trimming in this case.

I'm sure it is some very simple python that I could manage in the end, such a script is probably already out there.

This is a late reply, but you can split interleaved files with Pairfq like so:

Code:

curl -sL git.io/pairfq_lite | perl - splitpairs -i interl.fq.gz -f forward.fq -r reverse.fq

Interleaving the pairs is also possible with a slightly different command:

Code:

curl -sL git.io/pairfq_lite | perl - joinpairs -f forward.fq -r reverse.fq -o interl.fq

Originally posted by Daniel1977 View Post

Hi,
could anyone please modify the script so that it works with fasta files instead of fastq files?

many many thanks!

These commands work with FASTA/Q, compressed or uncompressed. There is more information about the commands on the Pairfq wiki. The 'splitpairs' command should work on any system and has a tiny memory footprint, but the 'joinpairs' and 'makepairs' (for fixing broken paired-end files) require more memory. If you have something like 250 million reads, as described in the original post, I would follow the install instructions for this tool and use the "--index" flag to reduce the memory usage (for pairing files, that is).

**safina** · 03-30-2015, 12:07 AM

Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:

GitHub - sestaton/Pairfq: Sync paired-end FASTA/Q files and keep singleton reads

https://github.com/sestaton/Pairfq

Sync paired-end FASTA/Q files and keep singleton reads - sestaton/Pairfq

the cammand was:

$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index

**safina** · 03-30-2015, 12:20 AM

Hi i tries pairfq but it giving me empty paired files whereas putting all reads in unique. Can anyone help?

**safina** · 03-30-2015, 12:21 AM

Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:

GitHub - sestaton/Pairfq: Sync paired-end FASTA/Q files and keep singleton reads

https://github.com/sestaton/Pairfq

Sync paired-end FASTA/Q files and keep singleton reads - sestaton/Pairfq

the cammand was:

$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index

**safina** · 03-30-2015, 12:22 AM

Originally posted by maubp View Post

Your FASTQ files look fine from those snippets. Which script exactly are you using, what was the command line you used, and most importantly, the exact error message?

Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:

GitHub - sestaton/Pairfq: Sync paired-end FASTA/Q files and keep singleton reads

https://github.com/sestaton/Pairfq

Sync paired-end FASTA/Q files and keep singleton reads - sestaton/Pairfq

the cammand was:

$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index

**safina** · 03-30-2015, 02:32 AM

i got this error while running the above script:

File "new_pairing.py", line 51, in <module>
forward_handle = open(output_paired_forward_filename, "w")
NameError: name 'output_paired_forward_filename' is not defined

Topics	Statistics	Last Post
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, 03-03-2025, 01:15 PM	0 responses 154 views 0 likes	Last Post by seqadmin 03-03-2025, 01:15 PM
Highlights from AGBT 2025 – Part II by seqadmin Started by seqadmin, 02-28-2025, 12:58 PM	0 responses 237 views 0 likes	Last Post by seqadmin 02-28-2025, 12:58 PM
Highlights from AGBT 2025 – Part I by seqadmin Started by seqadmin, 02-24-2025, 02:48 PM	0 responses 607 views 0 likes	Last Post by seqadmin 02-24-2025, 02:48 PM
Selecting the Right AI Model for Bioinformatics Research by seqadmin Started by seqadmin, 02-21-2025, 02:46 PM	0 responses 263 views 0 likes	Last Post by seqadmin 02-21-2025, 02:46 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News