Seqanswers Leaderboard Ad

**chrishah** · 03-22-2012, 03:06 AM

Hi guys,

Thanks for this script! I used it successfully both on standard fastq as well as on casava format (with the appropriate changes in the script)

But now I hit a wall again:

I have been doing some error correction and this added some information to the header so afterwards the script does not work anymore.

my data looks now like this:
@PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
+PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1
BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
+PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

I tried to change the script in different ways, but it always results basically in the following error message:

Traceback (most recent call last):
File "/usit/titan/u1/chrishah/python/pair_up_reads.py", line 49, in <module>
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
File "/site/VERSIONS/python-2.6.2/lib/python2.6/site-packages/Bio/SeqIO/QualityIO.py", line 907, in FastqGeneralIterator
raise ValueError("Sequence and quality captions differ.")
ValueError: Sequence and quality captions differ.

I could of course just remove the parts from the header that have been added by the errorcorrection tool and use the normal script...
But I really think it should be easy to adjust the script so that it works for my data - I just dont know enough about Python...

Can anyone give me a hint how to change the script? Any tips are highly appreciated!

Much obliged!
Christoph

**maubp** · 03-22-2012, 03:12 AM

Originally posted by chrishah View Post

Hi guys,

Thanks for this script! I used it successfully both on standard fastq as well as on casava format (with the appropriate changes in the script)

But now I hit a wall again

I have been doing some error correction and this added some information to the header so afterwards the script does not work anymore.

You should either edit BOTH the @ and the + lines like this,

Code:

@PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
+PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
+PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

or use a blank + line like this (which is recommended as the file is smaller):

Code:

@PCUS-319-EAS487_0006_FC:7:1:2093:983#0/1 0 0 0 0 0 f: b:
NACAGAACTCATTTGGCAGGCAAAACCCTGAGACAGATTCTGACAGGAAGTGGATACCTGATGTGTTGTATTACCT
+
BGIFHIMLMM_______W_Y_____BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@PCUS-319-EAS487_0006_FC:7:1:2415:988#0/1 0 2 0 0 0 f: b:39G/1T
TATAGACCCTTTTATCTATCATTTCTTCACAAGCTTAGGACAGAAGACTCTTATTGTGCATGATAAAGTAAAAGTC
+
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

The Biopython parser is complaining about the inconsistency.

**chrishah** · 03-22-2012, 05:32 AM

Thanks for your rapid reply maubp!

Have tried the second solution you suggested and it works fine with the standard version of the script!

I will use this approach to proceed. Thanks again!

take care,
Christoph

**vkartha** · 04-17-2012, 10:12 AM

I have data generated from Illumina Hiseq 2000 which uses 1:N and 2:N as opposed to /1 and /2 as suffices.
Ex:

@D8CXHXP1:222:C0HVHACXX:2:1101:1836:2165 2:N:0:ACAGTG

Do I just follow replacing those 2 lines of code with:

f_suffix = ""
r_suffix = ""

Would it work then?? Sorry I don't understand

**maubp** · 04-18-2012, 01:10 AM

Originally posted by vkartha View Post

Would it work then?? Sorry I don't understand

Yes, i.e.

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator #Biopython 1.51 or later

#######################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_pairs_filename = "out_interleaved_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = ""
r_suffix = ""
#For older Illumina files use this instead:
#f_suffix = "/1"
#r_suffix = "/2"

#######################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(title):
        """Remove the suffix from a forward read name."""
        name = title.split()[0]
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    def f_name(title):
        return title.split()[0]

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(title):
        """Remove the suffix from a reverse read name."""
        name = title.split()[0]
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    def r_name(title):
        return title.split()[0]

print "Scanning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

A more rigorous script would verify the 1 or 2 as well.

**borjito** · 04-23-2012, 08:17 AM

Hi everybody and thanks for supporting this forum. I'm running the latter script and I'm getting this error:

Traceback (most recent call last):
File "clean-fastQ.py", line 46, in <module>
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
NameError: name 'FastqGeneralIterator' is not defined

Any ideas for someone that is pretty new in bioinformatics??

Regards

Originally posted by maubp View Post

Yes, i.e.

Code:

from Bio import SeqIO #Biopython 1.54 or later needed

#######################################################
#
# Change the following settings to suit your needs
#

input_forward_filename = "SRR001666_1_rnd.fastq"
input_reverse_filename = "SRR001666_2_rnd.fastq"

output_pairs_filename = "out_interleaved_pairs.fastq"
output_orphan_filename = "out_unpaired_orphans.fastq"

f_suffix = ""
r_suffix = ""
#For older Illumina files use this instead:
#f_suffix = "/1"
#r_suffix = "/2"

#######################################################

if f_suffix:
    f_suffix_crop = -len(f_suffix)
    def f_name(name):
        """Remove the suffix from a forward read name."""
        assert name.endswith(f_suffix), name
        return name[:f_suffix_crop]
else:
    #No suffix, don't need a function to fix the name
    f_name = None

if r_suffix:
    r_suffix_crop = -len(r_suffix)
    def r_name(name):
        """Remove the suffix from a reverse read name."""
        assert name.endswith(r_suffix), name
        return name[:r_suffix_crop]
else:
    #No suffix, don't need a function to fix the name
    r_name = None

print "Scaning reverse file to build list of names..."    
reverse_ids = set()
paired_ids = set()
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    reverse_ids.add(r_name(title))

print "Processing forward file..."
forward_handle = open(output_paired_forward_filename, "w")
orphan_handle = open(output_orphan_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_forward_filename)):
    name = f_name(title)
    if name in reverse_ids:
        #Paired
        paired_ids.add(name)
        reverse_ids.remove(name) #frees a little memory
        forward_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
forward_handle.close()
del reverse_ids #frees memory, although we won't need more now

print "Processing reverse file..."
reverse_handle = open(output_paired_reverse_filename, "w")
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
    name = r_name(title)
    if name in paired_ids:
        #Paired
        reverse_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
    else:
        #Orphan
        orphan_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
orphan_handle.close()
reverse_handle.close()
print "Done"

A more rigorous script would verify the 1 or 2 as well.

**maubp** · 04-23-2012, 08:20 AM

Originally posted by borjito View Post

Hi everybody and thanks for supporting this forum. I'm running the latter script and I'm getting this error:

Traceback (most recent call last):
File "clean-fastQ.py", line 46, in <module>
for title, seq, qual in FastqGeneralIterator(open(input_reverse_filename)):
NameError: name 'FastqGeneralIterator' is not defined

Any ideas for someone that is pretty new in bioinformatics??

Regards

That's a typical Python error (exception), it is saying 'FastqGeneralIterator' has not been defined. In this case I think you are missing the 'import' line which would have loaded FastqGeneralIterator from Biopython (v1.51 or later). i.e.

from Bio.SeqIO.QualityIO import FastqGeneralIterator

Update: It looks like when I wrote that re-cap post I must have slipped up.

**borjito** · 04-23-2012, 11:15 PM

Many thanks! So now the exception disappeared, but I found this new error:

$ python clean-fastQ.py
Scaning reverse file to build list of names...
Traceback (most recent call last):
File "clean-fastQ.py", line 46, in <module>
reverse_ids.add(r_name(title))
TypeError: 'NoneType' object is not callable

I'm wondering whether this error comes from the file format; this is an example of the files I'm trying to "clean":

@HTKZQN1:399:C0JJVACXX:5:1101:1680:2187 2:N:5122:AGTCAA
ACTAGTATGGCCCGGGGGATCCTAGAGACCATTCGCGATTCCATGAGACTCCAAGGGTTCTGCACAACTTATGCACCTCTATTAGATCATTGTGTTCTACG
+
CCCDFDFFHGHHHIJJJJDHIJJIJJIIIIIIIJJJHDDEBCEECCCDDDCDDDDDDAADDDDCCDDBDDDDAAACCCDCACDDCCDDDCDCDDCCDDCCD

Many thanks in advance

Borja

**maubp** · 04-24-2012, 12:57 AM

Originally posted by borjito View Post

Many thanks! So now the exception disappeared, ...

Yep - I messed up that recap post. Updating it now... sorry.

**borjito** · 04-24-2012, 06:28 AM

Thanks again, I've also changed the

output_pairs_filename = "out_interleaved_pairs.fastq"

at the beggining by

output_paired_forward_filename = "IPLA2_1fc.fq"
output_paired_reverse_filename = "IPLA2_2fc.fq"

o create 2 files, as defined in the script, and it works perfectly! I've used the two outputs as input for SOAPdenovo and got a notable decrease in computation time.

Many thanks!!

**MattB** · 05-30-2012, 10:21 PM

Just wondering if anyone has a script to do the opposite to the scripts in this thread, ie. split an already interleaved forward and reverse file into separate forward and reverse files (for Casava 1.8 and beyond read naming)?

**Wallysb01** · 05-30-2012, 10:42 PM

Looks like you have some answers here, but I've used this and it works wonderfully well: https://github.com/ucdavis-bioinformatics/sickle

It won't make the interleaved, instead it retains the split left/right files with a new single end file. But you can easily use other scripts to create the interleaved file if you must.

**MattB** · 05-30-2012, 11:21 PM

Thanks for quick reply. Actually, I'm looking for something even simpler, where I just have one interleaved fastq file that I want to split into separate forward and reverse. Probably should have posted in a separate thread since I'm not so worried about trimming in this case.

I'm sure it is some very simple python that I could manage in the end, such a script is probably already out there.

**Wallysb01** · 05-30-2012, 11:42 PM

Hi Matt,

Just googling around I see this post with a script that claims to do such a thing, I've never had to do this type of thing however: http://bioinfomemaid.blogspot.com/

**MattB** · 05-30-2012, 11:46 PM

Ah, that looks promising, will give it a go. Don't know why my google search didn't hit upon that one!

Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News