![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
FastX-toolkit | liu_xt005 | Bioinformatics | 13 | 10-11-2014 05:52 AM |
FASTX-Toolkit: quality score value | thinkRNA | Bioinformatics | 13 | 09-30-2014 10:25 AM |
Has anyone used FastX Toolkit with IonTorrent data? | gkijak | Ion Torrent | 7 | 06-05-2013 06:16 PM |
Filtering Illumina GAIIx reads using fastx | rubi | Bioinformatics | 8 | 05-02-2011 04:54 AM |
matching unmapped paired SOLiD reads | smarkel | General | 6 | 10-09-2009 12:49 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Hi,
I just learned that most genome assemblers rely on the order of sequences within paired read files to maintain read pairings. Unfortunately, I have already gone and processed each of my paired-end read files separately using fastx-toolkit and now sequences within each file are not paired up correctly. I've searched for ways to compare the two files and extract matching sequences into two new files and put all the orphaned reads into a third file, but so far I haven't found anything that has worked. Does anyone have any simple workarounds for this (eg. perl scripts) other than starting over using other filtering software (eg. Trimmomatic)? I'm working with RADseq data (fastq = phred33). Anyhelp would be great! Thanks, Bryan |
![]() |
![]() |
![]() |
#2 |
Member
Location: Boston, MA Join Date: Mar 2011
Posts: 19
|
![]()
maybe try FixMateInformation of picard tools.
Not sure whether it will work. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Halifax, Nova Scotia Join Date: Mar 2009
Posts: 381
|
![]()
You can easily install a local instance of Galaxy and edit the start up script to give you admin rights. Then you can add the tool via Galaxy toolshed Fastq interlacer/deinterlacer. this groups by sequence ID into pairs i.e. cluster coordinates/1 and /2.
These guys also have a perl script for the same function...http://sfg.stanford.edu/ I noticed the scrit in question was not working properly and the authors sent me an updated working version. They may or may not have updated the download |
![]() |
![]() |
![]() |
#4 | |
Senior Member
Location: Cambridge, UK Join Date: May 2010
Posts: 311
|
![]() Quote:
All the best Dario |
|
![]() |
![]() |
![]() |
#5 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Thanks Jackie! I downloaded the scripts available from the link you posted and checked the PEcombine.sh script, and it appears to be exactly what I need. I'll post again to let those interested know if it worked or not.
Cheers, Bryan |
![]() |
![]() |
![]() |
#6 | |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Halifax, Nova Scotia Join Date: Mar 2009
Posts: 381
|
![]()
I was sent the following fixed script from the authors with instructions. Reference them if you use it. I haven't tested it so contact the authors if it fails again
"You can either replace the name in the PECombiner.sh script with fastqcombinepairedend_revision.py or delete the fastqcombinepairedend.py that you downloaded with the SFG and remove the _revision from the revised script name to match the old one" #!/usr/bin/env python import sys #####Authors: Team Ladner-Barshis-Tepolt ######Usage######## ######This script takes a set of separate quality trimmed paired end .fastq files ######and pairs the reads that still have a mate and combined the solitary reads ######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics). ######Command line usage: fastqcombinepairedend.py 'the seqheader text' 'the delimiter' infiledirection1.fastq infiledirection2.fastq seqheader=sys.argv[1] paireddelim=sys.argv[2] in1=sys.argv[3] in2=sys.argv[4] # filecount=0 # pairedlist=[] # for file in infilelist: # if filecount==0: # firstname=file # filecount+=1 # else: # secondname=file # filecount=0 # pairedlist.append((firstname,secondname)) #Opens an infile specified by the user. IN1 = open(sys.argv[3], 'r') IN2 = open(sys.argv[4], 'r') setoffiles=[in1, in2] #Opens an output text file as specified by user SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC names1=[] names2=[] seqs1={} seqs2={} quals1={} quals2={} linenum1=0 #Starts a for loop to read through the infile line by line for line in IN1: line = line.rstrip() linenum1+=1 #sets count to 1 when it finds the header before a sequence if line[0:4]==seqheader: line=line.split(paireddelim) name=line[0][1:] # print name names1.append(name) count = 0 else: if count==1: seqs1[name]=line if count==3: quals1[name]=line count=count+1 # print linenum1 print 'finished reading in: %s' %(setoffiles[0]) for line in IN2: # print 2 line = line.rstrip() #sets count to 1 when it finds the header before a sequence if line[0:4]=='@HWI': line=line.split(paireddelim) name=line[0][1:] names2.append(name) count = 0 else: if count==1: seqs2[name]=line if count==3: quals2[name]=line count=count+1 # print 'here' print 'finished reading in: %s'%(setoffiles[0]) paired = list(set(names1) & set(names2)) # print len(paired) print 'number of paired reads: %d'%(len(paired)) single = list(set(names1) ^ set(names2)) # print len(single) single1 = list(set(single) & set(names1)) # print len(single1) print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1)) single2 = list(set(single) & set(names2)) # print len(single2) print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2)) # for label in paired: # PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n') for label in single1: SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n') for label in single2: SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n') for label in paired: PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n') PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n') IN1.close() IN2.close() SIN.close() PAIR1.close() PAIR2.close() |
![]() |
![]() |
![]() |
#8 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Thanks Jackie! I tried the revised script but it still didn't work correctly. Only after I went through the script re-setting all the indents and after correcting some lines did I manage to get it working. My only concern now is that I have to ramp up the requested RAM on the cluster i'm using just to use it with my data. So far 8GB of RAM won't cut it as I keep getting 'MemoryErrors'. It works fine on smaller datasets though. Anyway, here is the corrected code (again reference the authors if you use it):
#################################################### #!/usr/bin/env python import sys #####Authors: Team Ladner-Barshis-Tepolt ######Usage######## ######This script takes a set of separate quality trimmed paired end .fastq files ######and pairs the reads that still have a mate and combined the solitary reads ######into a single file of orphans for input into denovo assembly software (e.g. CLC Genomics). ######Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq seqheader=sys.argv[1] paireddelim=sys.argv[2] in1=sys.argv[3] in2=sys.argv[4] ###This line had to be added 15Oct2012 for script to work properly setoffiles=[in1, in2] #filecount=0 #pairedlist=[] #for file in infilelist: #if filecount==0: #firstname=file #filecount+=1 #else: #secondname=file #filecount=0 #pairedlist.append((firstname,secondname)) #Opens an infile specified by the user. IN1 = open(sys.argv[3], 'r') IN2 = open(sys.argv[4], 'r') #Opens an output text file as specified by user SIN = open('%s_trimmed_clipped_singles.fastq'%(sys.argv[3][:-23]), 'w') #Unpaired reads from both directions for CLC PAIR1 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[3][:-6]), 'w') #Read one of paired reads for CLC PAIR2 = open('%s_trimmed_clipped_stillpaired.fastq'%(sys.argv[4][:-6]), 'w') #Read two of paired reads for CLC names1=[] names2=[] seqs1={} seqs2={} quals1={} quals2={} linenum1=0 #Starts a for loop to read through the infile line by line for line in IN1: line = line.rstrip() linenum1+=1 #sets count to 1 when it finds the header before a sequence if line[0:4]==seqheader: line=line.split(paireddelim) name=line[0][1:] # print name names1.append(name) count = 0 else: if count==1: seqs1[name]=line if count==3: quals1[name]=line count=count+1 # print linenum1 print 'finished reading in: %s' %(setoffiles[0]) for line in IN2: # print 2 line = line.rstrip() #sets count to 1 when it finds the header before a sequence if line[0:4]==seqheader: #change to seqheader from '@HWI' 15oct2012 - I'm not sure why they didn't just use seqheader to begin with as done in previous for loop. line=line.split(paireddelim) name=line[0][1:] names2.append(name) count = 0 else: if count==1: seqs2[name]=line if count==3: quals2[name]=line count=count+1 # print linenum1 # print 'here' print 'finished reading in: %s'%(setoffiles[1]) paired = list(set(names1) & set(names2)) #print paired # print len(paired) print 'number of paired reads: %d'%(len(paired)) single = list(set(names1) ^ set(names2)) # print len(single) single1 = list(set(single) & set(names1)) # print len(single1) print 'number of single reads left from %s: %d'%(setoffiles[0],len(single1)) single2 = list(set(single) & set(names2)) # print len(single2) print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2)) # for label in paired: # PAIR.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n@' + str(label) + '2\n' + str(seqs2[label]) + $ for label in single1: SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n') for label in single2: SIN.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n') for label in paired: PAIR1.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n') PAIR2.write('@' + str(label) + '2\n' + str(seqs2[label]) + '\n+' + str(label) + '2\n' + str(quals2[label]) + '\n') IN1.close() IN2.close() SIN.close() PAIR1.close() PAIR2.close() #################################################### |
![]() |
![]() |
![]() |
#9 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Hi again, I just realized my last post with the correct code didn't keep the indents. so I've attached a file containing the corrected script. To use it remove the .txt extension.
Cheers, Bryan |
![]() |
![]() |
![]() |
#10 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Just thought I'd let you know I needed ~227GB of RAM in order to use this script on my data which consisted of two paired-end files 58GB and 29GB in size (after filtering for quality).
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]()
bmtb,
Thanks for the corrected script and the info on memory usage. I'd like to ask what the 'the delimiter' 'the seqheader text' refer to in the program usage. Command line usage: fastqcombinepairedend.py 'the delimiter' 'the seqheader text' infiledirection1.fastq infiledirection2.fastq Carmen |
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]()
Hi! Follow-up / bump to my question above:
Does i have to be the entire line, ie, @HWI-M00149:16:000000000-A12VK:1:1101:17796:2300 1:N:0: , or just @HWI ? I'm guessing for delimiter I should use " : " ? Thanks! |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]()
Dear btmb,
I have tried the code and it seems not to be working ![]() Thanks! |
![]() |
![]() |
![]() |
#14 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Hi carmeyeii,
Sorry it took so long for me to reply. If you open the script in a text editor there should be a description in the commented out section at the top of the page. If I recall correctly, the seqheader text should just be "@HWI" and the delimiter should be what ever separates the string of characters adjacent to the seqheader text from the read identifier part of the header line. In your case, you would use " " as the delimiter since there is a space that separates the read identifier (i.e. 1:N:0) from the rest of the line. Cheers, Bryan |
![]() |
![]() |
![]() |
#15 | |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Hi again,
I just tested the attached copy of the script on 2500 paired seqs and it worked as expected using the following command: python fastqcombinepairedend.py "@HWI" " " fastq1 fastq2 Make sure to change the .py.txt extension to just .py. Bryan Quote:
|
|
![]() |
![]() |
![]() |
#16 | |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#17 | |
Senior Member
Location: Mexico Join Date: Mar 2011
Posts: 137
|
![]()
Dear btmb,
I'm afraid I still cannot run it. Sorry to keep bothering? I have corrected tabs and spaces to avoid getting the Unexpected indent Error, but now I get: Quote:
Carmen |
|
![]() |
![]() |
![]() |
#18 |
Member
Location: C:/Program files/Google/Chrome Join Date: Jul 2012
Posts: 34
|
![]()
Hi bmtb,
I hope this is what you are looking for. Even I had the same problem where reads in forward and reverse files were not in order. I use this shell script to extract common reads and orphan reads. It produces 4 files forward_matched, reverse_matched, forward_orphan and reverse_orphan. Usage is simple Code:
filteresCommonPEreads.sh forward.fastq reverse.fastq P.S : change file extension to sh. |
![]() |
![]() |
![]() |
#19 | ||
Member
Location: Los Angeles, CA Join Date: Jun 2012
Posts: 18
|
![]()
btmb,
I got your edited script to run today, thank you. However, it prints nothing out to the created files. I get the following error. Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#20 |
Senior Member
Location: Stuttgart, Germany Join Date: Apr 2010
Posts: 192
|
![]()
Hi, maybe I'm tot late but for someone looking into this. this neat one liner does the job as well:
Code:
paste file1_PE_DATA.fastq file2_PE_DATA.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}' best, Phil |
![]() |
![]() |
![]() |
Tags |
fastx-toolkit, paired-end reads |
Thread Tools | |
|
|