SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FastX-toolkit liu_xt005 Bioinformatics 13 10-11-2014 04:52 AM
FASTX-Toolkit: quality score value thinkRNA Bioinformatics 13 09-30-2014 09:25 AM
Has anyone used FastX Toolkit with IonTorrent data? gkijak Ion Torrent 7 06-05-2013 05:16 PM
Filtering Illumina GAIIx reads using fastx rubi Bioinformatics 8 05-02-2011 03:54 AM
matching unmapped paired SOLiD reads smarkel General 6 10-09-2009 11:49 AM

Reply
 
Thread Tools
Old 10-12-2012, 02:52 PM   #1
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default matching up paired-end reads after fastx-toolkit filtering

Hi,

I just learned that most genome assemblers rely on the order of sequences within paired read files to maintain read pairings. Unfortunately, I have already gone and processed each of my paired-end read files separately using fastx-toolkit and now sequences within each file are not paired up correctly. I've searched for ways to compare the two files and extract matching sequences into two new files and put all the orphaned reads into a third file, but so far I haven't found anything that has worked. Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)? I'm working with RADseq data (fastq = phred33).

Anyhelp would be great!

Thanks,
Bryan
bmtb is offline   Reply With Quote
Old 10-12-2012, 04:38 PM   #2
bioliyezhang
Member
 
Location: Boston, MA

Join Date: Mar 2011
Posts: 19
Default

maybe try FixMateInformation of picard tools.
Not sure whether it will work.
bioliyezhang is offline   Reply With Quote
Old 10-12-2012, 06:16 PM   #3
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

These guys also have a perl script for the same function...http://sfg.stanford.edu/
I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download
JackieBadger is offline   Reply With Quote
Old 10-13-2012, 01:40 AM   #4
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by bmtb View Post
Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)?
Hi- Sure you files can be fixed. However, I suspect that the easiest and fastest solution is to go back to the original files and use some tool that preserves the pairing order. trim_galore for example is very easy to use and does a great job.

All the best
Dario
dariober is offline   Reply With Quote
Old 10-14-2012, 04:03 PM   #5
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Thanks Jackie! I downloaded the scripts available from the link you posted and checked the PEcombine.sh script, and it appears to be exactly what I need. I'll post again to let those interested know if it worked or not.
Cheers,
Bryan
bmtb is offline   Reply With Quote
Old 10-14-2012, 06:23 PM   #6
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Quote:
Originally Posted by JackieBadger View Post
You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.

These guys also have a perl script for the same function...http://sfg.stanford.edu/
I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download
Hello again, so I made several attempts to use the PECombiner.sh/fastqcombinepairedreads.py script to work on my paired reads but Jackie previously noted the script is faulty.
bmtb is offline   Reply With Quote
Old 10-14-2012, 06:45 PM   #7
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

I was sent the following fixed script from the authors with instructions. Reference them if you use it. I haven't tested it so contact the authors if it fails again

"You can either replace the name in the PECombiner.sh script with fastqcombinepairedend_revision.py or delete the fastqcombinepairedend.py that you downloaded with the SFG and remove the _revision from the revised script name to match the old one"



#!/usr/bin/env python

import sys
#####Authors: Team Ladner-Barshis-Tepolt
######Usage########
######This script takes a set of separate quality trimmed paired end .fastq files
######and pairs the reads that still have a mate and combined the solitary reads
######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
######Command line usage: fastqcombinepairedend.py 'the seqheader text' 'the delimiter' infiledirection1.fastq infiledirection2.fastq


seqheader=sys.argv[1]
paireddelim=sys.argv[2]
in1=sys.argv[3]
in2=sys.argv[4]
# filecount=0
# pairedlist=[]
# for file in infilelist:
# if filecount==0:
# firstname=file
# filecount+=1
# else:
# secondname=file
# filecount=0
# pairedlist.append((firstname,secondname))

#Opens an infile specified by the user.
IN1 = open(sys.argv[3], 'r')
IN2 = open(sys.argv[4], 'r')
setoffiles=[in1, in2]
#Opens an output text file as specified by user
SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC


names1=[]
names2=[]
seqs1={}
seqs2={}
quals1={}
quals2={}
linenum1=0

#Starts a for loop to read through the infile line by line
for line in IN1:

line = line.rstrip()
linenum1+=1

#sets count to 1 when it finds the header before a sequence
if line[0:4]==seqheader:
line=line.split(paireddelim)
name=line[0][1:]
# print name
names1.append(name)
count = 0
else:
if count==1:
seqs1[name]=line
if count==3:
quals1[name]=line

count=count+1
# print linenum1
print 'finished reading in: %s' %(setoffiles[0])
for line in IN2:

# print 2

line = line.rstrip()

#sets count to 1 when it finds the header before a sequence
if line[0:4]=='@HWI':
line=line.split(paireddelim)
name=line[0][1:]
names2.append(name)
count = 0
else:

if count==1:
seqs2[name]=line
if count==3:
quals2[name]=line

count=count+1

# print 'here'
print 'finished reading in: %s'%(setoffiles[0])
paired = list(set(names1) & set(names2))
# print len(paired)
print 'number of paired reads: %d'%(len(paired))
single = list(set(names1) ^ set(names2))
# print len(single)
single1 = list(set(single) & set(names1))
# print len(single1)
print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
single2 = list(set(single) & set(names2))
# print len(single2)
print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
# for label in paired:
# PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

for label in single1:
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

for label in single2:
SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

for label in paired:
PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

IN1.close()
IN2.close()
SIN.close()
PAIR1.close()
PAIR2.close()
JackieBadger is offline   Reply With Quote
Old 10-16-2012, 12:07 PM   #8
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Thanks Jackie! I tried the revised script but it still didn't work correctly. Only after I went through the script re-setting all the indents and after correcting some lines did I manage to get it working. My only concern now is that I have to ramp up the requested RAM on the cluster i'm using just to use it with my data. So far 8GB of RAM won't cut it as I keep getting 'MemoryErrors'. It works fine on smaller datasets though. Anyway, here is the corrected code (again reference the authors if you use it):


####################################################
#!/usr/bin/env python

import sys
#####Authors: Team Ladner-Barshis-Tepolt
######Usage########
######This script takes a set of separate quality trimmed paired end .fastq files
######and pairs the reads that still have a mate and combined the solitary reads
######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics).
######Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq


seqheader=sys.argv[1]
paireddelim=sys.argv[2]
in1=sys.argv[3]
in2=sys.argv[4]

###This line had to be added 15Oct2012 for script to work properly
setoffiles=[in1, in2]

#filecount=0
#pairedlist=[]
#for file in infilelist:
#if filecount==0:
#firstname=file
#filecount+=1
#else:
#secondname=file
#filecount=0
#pairedlist.append((firstname,secondname))

#Opens an infile specified by the user.
IN1 = open(sys.argv[3], 'r')
IN2 = open(sys.argv[4], 'r')

#Opens an output text file as specified by user
SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC
PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC
PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC


names1=[]
names2=[]
seqs1={}
seqs2={}
quals1={}
quals2={}
linenum1=0

#Starts a for loop to read through the infile line by line
for line in IN1:

line = line.rstrip()
linenum1+=1

#sets count to 1 when it finds the header before a sequence
if line[0:4]==seqheader:
line=line.split(paireddelim)
name=line[0][1:]
# print name
names1.append(name)
count = 0
else:
if count==1:
seqs1[name]=line
if count==3:
quals1[name]=line

count=count+1
# print linenum1
print 'finished reading in: %s' %(setoffiles[0])

for line in IN2:

# print 2

line = line.rstrip()

#sets count to 1 when it finds the header before a sequence
if line[0:4]==seqheader: #change to seqheader from '@HWI' 15oct2012 - I'm not sure why they didn't just use seqheader to begin with as done in previous for loop.
line=line.split(paireddelim)
name=line[0][1:]
names2.append(name)
count = 0
else:
if count==1:
seqs2[name]=line
if count==3:
quals2[name]=line

count=count+1
# print linenum1
# print 'here'
print 'finished reading in: %s'%(setoffiles[1])

paired = list(set(names1) & set(names2))
#print paired
# print len(paired)
print 'number of paired reads: %d'%(len(paired))
single = list(set(names1) ^ set(names2))
# print len(single)
single1 = list(set(single) & set(names1))
# print len(single1)
print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1))
single2 = list(set(single) & set(names2))
# print len(single2)
print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))

# for label in paired:
# PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + $

for label in single1:
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

for label in single2:
SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

for label in paired:
PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n')

IN1.close()
IN2.close()
SIN.close()
PAIR1.close()
PAIR2.close()
####################################################
bmtb is offline   Reply With Quote
Old 10-16-2012, 12:15 PM   #9
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Hi again, I just realized my last post with the correct code didn't keep the indents. so I've attached a file containing the corrected script. To use it remove the .txt extension.

Cheers,
Bryan
Attached Files
File Type: txt fastqcombinepairedend.py.txt (3.6 KB, 646 views)
bmtb is offline   Reply With Quote
Old 10-18-2012, 09:04 AM   #10
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Just thought I'd let you know I needed ~227GB of RAM in order to use this script on my data which consisted of two paired-end files 58GB and 29GB in size (after filtering for quality).
bmtb is offline   Reply With Quote
Old 12-28-2012, 11:00 AM   #11
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

bmtb,

Thanks for the corrected script and the info on memory usage.

I'd like to ask what the 'the delimiter' 'the seqheader text' refer to in the program usage.

Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq

Carmen
carmeyeii is offline   Reply With Quote
Old 01-03-2013, 12:02 PM   #12
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Hi! Follow-up / bump to my question above:

Does i have to be the entire line, ie,

@HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

, or just @HWI ?

I'm guessing for delimiter I should use " : " ?

Thanks!
carmeyeii is offline   Reply With Quote
Old 01-03-2013, 01:35 PM   #13
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Dear btmb,

I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

Thanks!
carmeyeii is offline   Reply With Quote
Old 01-03-2013, 06:09 PM   #14
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Hi carmeyeii,

Sorry it took so long for me to reply. If you open the script in a text editor there should be a description in the commented out section at the top of the page. If I recall correctly, the seqheader text should just be "@HWI" and the delimiter should be what ever separates the string of characters adjacent to the seqheader text from the read identifier part of the header line. In your case, you would use " " as the delimiter since there is a space that separates the read identifier (i.e. 1:N:0) from the rest of the line.

Cheers,

Bryan





Quote:
Originally Posted by carmeyeii View Post
Hi! Follow-up / bump to my question above:

Does i have to be the entire line, ie,

@HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0:

, or just @HWI ?

I'm guessing for delimiter I should use " : " ?

Thanks!
bmtb is offline   Reply With Quote
Old 01-03-2013, 06:38 PM   #15
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Hi again,

I just tested the attached copy of the script on 2500 paired seqs and it worked as expected using the following command:

python fastqcombinepairedend.py "@HWI" " " fastq1 fastq2

Make sure to change the .py.txt extension to just .py.

Bryan



Quote:
Originally Posted by carmeyeii View Post
Dear btmb,

I have tried the code and it seems not to be working 'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

Thanks!
Attached Files
File Type: txt fastqcombinepairedend.py.txt (3.5 KB, 341 views)
bmtb is offline   Reply With Quote
Old 01-17-2013, 04:52 PM   #16
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Quote:
Originally Posted by bioliyezhang View Post
maybe try FixMateInformation of picard tools.
Not sure whether it will work.
I believe FixMateInformation is to be used on ALIGNED SAM files, to make sure both mates of a paired-end read have the same info, and not on pre-mapped fastq files to ensure both files contain the same mate pairs and in the same order
carmeyeii is offline   Reply With Quote
Old 01-17-2013, 05:29 PM   #17
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Dear btmb,
I'm afraid I still cannot run it. Sorry to keep bothering?

I have corrected tabs and spaces to avoid getting the Unexpected indent Error,

but now I get:

Quote:
Traceback (most recent call last):
File "fastq_mate_pair", line 61, in <module>
if count==1:
NameError: name 'count' is not defined
Thanks again for any help,

Carmen
carmeyeii is offline   Reply With Quote
Old 01-17-2013, 09:09 PM   #18
a_mt
Member
 
Location: C:/Program files/Google/Chrome

Join Date: Jul 2012
Posts: 34
Default

Hi bmtb,

I hope this is what you are looking for. Even I had the same problem where reads in forward and reverse files were not in order. I use this shell script to extract common reads and orphan reads. It produces 4 files forward_matched, reverse_matched, forward_orphan and reverse_orphan.

Usage is simple

Code:
filteresCommonPEreads.sh forward.fastq reverse.fastq
Hope this helps.

P.S : change file extension to sh.
Attached Files
File Type: txt filteredCommonPEreads.txt (1.8 KB, 873 views)
a_mt is offline   Reply With Quote
Old 10-21-2013, 02:11 PM   #19
etwatson
Member
 
Location: Los Angeles, CA

Join Date: Jun 2012
Posts: 18
Default

btmb,

I got your edited script to run today, thank you. However, it prints nothing out to the created files. I get the following error.
Quote:
finished reading in: 46na_1_trimmed_clipped.fastq
finished reading in: 46na_1_trimmed_clipped.fastq
number of paired reads: 37819703
number of single reads left from 46na_1_trimmed_clipped.fastq: 3984165
number of single reads left from 46na_2_trimmed_clipped.fastq: 4582643
Traceback (most recent call last):
File "/usr/bin/fastqcombinepairedend_edit.py", line 106, in <module>
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
KeyError: 'HWI-ST594:1:2110:2163:8239#TTAGGCTTAGGC'
Here are the corresponding lines in the script:
Quote:
print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
for label in single1:
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
It looks like the dictionary (label) contains nothing, and I can not find it anywhere else in the script. This is my first Python script to use and I would like to get it running. How would I go about fixing this?
etwatson is offline   Reply With Quote
Old 10-28-2013, 05:19 AM   #20
sphil
Senior Member
 
Location: Stuttgart, Germany

Join Date: Apr 2010
Posts: 192
Default

Hi, maybe I'm tot late but for someone looking into this. this neat one liner does the job as well:
Code:
paste file1_PE_DATA.fastq file2_PE_DATA.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'
it joins, sorts and splits again.

best,

Phil
sphil is offline   Reply With Quote
Reply

Tags
fastx-toolkit, paired-end reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO