Originally posted by MattB
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Thanks, I had actually planned to use the Galaxy option but our local installation is undergoing some maintenance at the moment hence I was after a simple script alternative.
The Perl script that Wallysb01 pointed me to seems to work nicely in any case!
Matt
Comment
-
Hi, maubp
Thank you for your python code. Just a question, if the sequences order in the 2 input fastq files are randomized, then the sequences order in the output two paired fastq files are not the same. The two output fastq files are indeed paired files, just the sequence orders are not the same. Can you please update your code to make the output paired fastq files with the same sequence order if the input fastq files are randomized? Thank you.
Originally posted by maubp View PostAlternatively, this second version should need much less memory but produces three files forward/reverse/orphaned, so you'd need to interleave the forward/reverse files afterwards. This makes three passes though the files and just keeps a list of IDs in memory (as a Python set for speed) rather what the first script does which holds a map of IDs to file offsets (as a Python dictionary). The records are output in the order found in the input files, so as long as the input files follow the same sorting, doing the interleaving step is trivial (and easy enough to add on in Python).
Code:from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later ########################################################## # # Change the following settings to suit your needs # input_forward_filename = "SRR001666_1_rnd.fastq" input_reverse_filename = "SRR001666_2_rnd.fastq" output_paired_forward_filename = "out_forward_pairs.fastq" output_paired_reverse_filename = "out_reverse_pairs.fastq" output_orphan_filename = "out_unpaired_orphans.fastq" f_suffix = "/1" r_suffix = "/2" ########################################################## if f_suffix: f_suffix_crop = -len(f_suffix) def f_name(title): """Remove the suffix from a forward read name.""" name = title.split()[0] assert name.endswith(f_suffix), name return name[:f_suffix_crop] else: def f_name(title): return title.split()[0] if r_suffix: r_suffix_crop = -len(r_suffix) def r_name(title): """Remove the suffix from a reverse read name.""" name = title.split()[0] assert name.endswith(r_suffix), name return name[:r_suffix_crop] else: def r_name(title): return title.split()[0] print "Scaning reverse file to build list of names..." reverse_ids = set() paired_ids = set() for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)): reverse_ids.add(r_name(title)) print "Processing forward file..." forward_handle = open(output_paired_forward_filename, "w") orphan_handle = open(output_orphan_filename, "w") for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)): name = f_name(title) if name in reverse_ids: #Paired paired_ids.add(name) reverse_ids.remove(name) #frees a little memory forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) else: #Orphan orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) forward_handle.close() del reverse_ids #frees memory, although we won't need more now print "Processing reverse file..." reverse_handle = open(output_paired_reverse_filename, "w") for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)): name = r_name(title) if name in paired_ids: #Paired reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) else: #Orphan orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) orphan_handle.close() reverse_handle.close() print "Done"
Comment
-
Originally posted by cwzkevin View PostHi, maubp
Thank you for your python code. Just a question, if the sequences order in the 2 input fastq files are randomized, then the sequences order in the output two paired fastq files are not the same. The two output fastq files are indeed paired files, just the sequence orders are not the same. Can you please update your code to make the output paired fastq files with the same sequence order if the input fastq files are randomized? Thank you.
One way to do the sorting is to convert your FASTQ file into a tab-separated file, use the Unix sort command, and then turn the data back into FASTQ. The trick here is the Unix sort command sorts at the line level, so we must transform the data to be one line per record.
There is probably an 'off the shelf' solution to this, perhaps using biopieces?
Or go back and find out why your FASTQ files are out of order, and fix that
Comment
-
Originally posted by safina View PostI ran the script but its giving me the error:
Sequence and quality captions differ.
Can anyone help me with this error for the script posted above.???
It sounds like somehow your FASTQ files have been messed up
Comment
-
Im sending you some of my fastq file reads. pls see.
==> forward_sequences.fastq <==
@SRR1562087.10.1/1
GAGCTAGATCAGCACCATATATTACACGATGATCAGCTGTAACATTTACCTGCATCTGGTTCTTCATTCCTATCCGACCATCCTTGG
+
JJJJJJIIJJJJJJJJIJJJJJJJJJJJJJJJIJJJJJJJGIIJJJJIJJJJJJJJJIJJJJDHIHHHHHHHFDFFDDDDDDDDD>C
@SRR1562087.11.1/1
AGGTTGACTATGGTCCAGGCCATGCCAGGAGAGCAACCGAAAACAGAGAGAACGGTAAGCCAGGAGAAGAACAGTATGAGTATATAG
+
IJJGHIJIIIFIBHHGAFHGGIHJIJGJEGIGGGHGIJJJJHHGFEFEDACEEDDBDBCCCDDDDDDBDDDCDDCADDDCCCDDDDD
@SRR1562087.15.1/1
TAACATCCACAATCTCCTTCTACCCAAGAAGTCTGGAACTTCAGCATCAAAGGCTGGTGATGACGACAACTAATCCATTTACTGAAT
==> reverse_sequences.fastq <==
@SRR1562087.7.2/2
CCTGTAGATATACGTACTGCCAAAGGGTAGATAGTTGCCCATCTCAGAAAACACAACTTCAACAGCCAAGATTAATATCCATGTGAT
+
IJJJGGJBHIJJGHHHIIHJJGJGJIIDFHIJIJJJGHJJJJJJJIJGIGH@FHJIJIHIIIHHH=BDFFAEECCEEFDEDDCDCA>
@SRR1562087.9.2/2
GTAATCCAAATAAGGTATACTCACTCATCGGAGGATTTTGTGCTTCCCCTGTGAATTTCCACGCTAAGGATGGCTCCGGCTATAAAT
+
JIJIIJJJGGIIJIBC@FH@HHJGIJGCHGIEGIFHDFHJIJIJIHHIIIIJGGHHHHHCDDFDDDBDDDDDDDCDBDDBD@CDCEE
@SRR1562087.11.2/2
GAAACACTGATTGGTTCACGTATCCAGGTGTATGGACCACCTATATACTCATACTGTTCTTCTCCTGGCTTACCGTTCTCTCTGTTT
Comment
-
Your FASTQ files look fine from those snippets. Which script exactly are you using, what was the command line you used, and most importantly, the exact error message?
Update: Hang on, the names don't match, e.g. SRR1562087.10.1/1 vs SRR1562087.7.2/2Last edited by maubp; 04-01-2015, 12:20 AM.
Comment
-
Originally posted by MattB View PostThanks for quick reply. Actually, I'm looking for something even simpler, where I just have one interleaved fastq file that I want to split into separate forward and reverse. Probably should have posted in a separate thread since I'm not so worried about trimming in this case.
I'm sure it is some very simple python that I could manage in the end, such a script is probably already out there.
Code:curl -sL git.io/pairfq_lite | perl - splitpairs -i interl.fq.gz -f forward.fq -r reverse.fq
Code:curl -sL git.io/pairfq_lite | perl - joinpairs -f forward.fq -r reverse.fq -o interl.fq
Originally posted by Daniel1977 View PostHi,
could anyone please modify the script so that it works with fasta files instead of fastq files?
many many thanks!
Comment
-
Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:
the cammand was:
$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index
Comment
-
Hi. Im not getting any eroor infact the pairfq program putting all my reads in single.fq where as 1-paired.fq and 2_paired.fq remains empty. the link to the script is below:
the cammand was:
$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index
Comment
-
Originally posted by maubp View PostYour FASTQ files look fine from those snippets. Which script exactly are you using, what was the command line you used, and most importantly, the exact error message?
the cammand was:
$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index
Comment
Latest Articles
Collapse
-
by seqadmin
The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...-
Channel: Articles
02-24-2025, 06:31 AM -
-
by seqadmin
Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.
Nucleic Acid Quality Control
Preparing for NGS starts with isolating the...-
Channel: Articles
02-10-2025, 01:58 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 03-03-2025, 01:15 PM
|
0 responses
154 views
0 likes
|
Last Post
by seqadmin
03-03-2025, 01:15 PM
|
||
Started by seqadmin, 02-28-2025, 12:58 PM
|
0 responses
237 views
0 likes
|
Last Post
by seqadmin
02-28-2025, 12:58 PM
|
||
Started by seqadmin, 02-24-2025, 02:48 PM
|
0 responses
607 views
0 likes
|
Last Post
by seqadmin
02-24-2025, 02:48 PM
|
||
Started by seqadmin, 02-21-2025, 02:46 PM
|
0 responses
263 views
0 likes
|
Last Post
by seqadmin
02-21-2025, 02:46 PM
|
Comment