SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BSMAP: whole genome Bisulfite Sequence MAPping program wei Epigenetics 4 03-20-2014 01:13 PM
How to make HiSeq indexed paired-end library with homemade oligos? ostrakon Illumina/Solexa 6 03-16-2012 04:22 AM
Difference between mate pair and pair end bassu General 2 06-19-2010 06:13 AM
pair-end sequencing produces single-end read artifact pparg Bioinformatics 9 03-29-2010 11:15 AM
whole genome Bisulfite Sequence MAPping program wei Bioinformatics 0 08-07-2009 02:46 PM

Reply
 
Thread Tools
Old 02-24-2012, 06:43 PM   #1
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default program which can make a pair end to have equal number of sequence

i have a PE100, the problem is some tiles of read 2 are corrupted and it caused un even number of reads in read 1 and read 2. Are there any programs which can match up sequences from both reads and just keep mathed ones for down stream analysis. Thanks

mike
dejavu2010 is offline   Reply With Quote
Old 02-27-2012, 02:53 AM   #2
mgg
Member
 
Location: London, UK

Join Date: Nov 2011
Posts: 12
Default re-pairing PE files

I remember a nice contribution from kmcarr a while back which can probably help; search for thread 10392. (incidentally I can't recommend my own contribution to that thread - it is hideously slow)

best

m

Last edited by mgg; 02-27-2012 at 02:55 AM. Reason: typo correction
mgg is offline   Reply With Quote
Old 02-27-2012, 03:33 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,168
Default

Quote:
Originally Posted by mgg View Post
I remember a nice contribution from kmcarr a while back which can probably help; search for thread 10392. (incidentally I can't recommend my own contribution to that thread - it is hideously slow)

best

m
m,

Thanks for the acknowledgement. Here is a link to the thread. If you go there you'll see that I just posted an update. Due to a limitation in cdbfasta my method will not work for large input fastq files. The only work-around at the moment is to split the input up into smaller chunks.
kmcarr is offline   Reply With Quote
Old 02-27-2012, 04:50 AM   #4
Hobbe
Member
 
Location: Uppsala, Sweden

Join Date: Apr 2010
Posts: 29
Default

The trimmer Trimmomatic outputs files with intact pairs as well as files with single reads. It should be able to split your files in the way you want, as well as do trimming at the same time if you wish.
Hobbe is offline   Reply With Quote
Old 02-27-2012, 05:04 AM   #5
azneto
Member
 
Location: Brazil

Join Date: Dec 2009
Posts: 24
Default

I wrote a script exactly to tackle that issue. You'll find a copy attached.
The script will output either an interleaved mate pair fastq or two fastq files. The unpaired reads will also be saved in a separate file. The script uses a regular expression to identify the ID, so let me know if you need help with that. It requires at least as much RAM as the size of the first file. Feel free to use it and let please let me know how can we improve it.
Adhemar
Attached Files
File Type: pl mergeShuffledFastqSeqs.pl (4.3 KB, 751 views)
azneto is offline   Reply With Quote
Old 02-27-2012, 05:28 AM   #6
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 628
Default

Maybe of interest as well, PairedreadFinder:
Quote:
Usage: PairedreadFinder, Version 1.01. This tool takes two fasta/q files and looks for matching readnames in both files. [OPTION]...

-h, --help displays this help message
-v, --version return program version
-s1, --source1 input file 1
-s2, --source2 input file 2
-f, --format input file format
-t1, --target1 target file 1
-t2, --target2 target file 2
-n, --nr-threads nr of threads to use (default 1)
-is, --suffix-ignore nr of characters to ignore from the END of the readname (in case paired reads are named like /1 /2 it should be set to 2) (default 0)
-ip, --prefix-ignore nr of characters to ignore from the BEGINNING of the readname (in case paired reads are named like s_1.. s_2.. it should be set to 3) (default 0)
from FAR, http://sourceforge.net/apps/mediawik...itle=Main_Page

Sven
sklages is offline   Reply With Quote
Old 02-27-2012, 02:47 PM   #7
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default

hi azneto, how to setup regular expression like the following
@HWI-ST829:138071VACXX:1:1101:1131:2048 1:N:0:ATCACG.

Thanks.
dejavu2010 is offline   Reply With Quote
Old 02-28-2012, 02:12 AM   #8
azneto
Member
 
Location: Brazil

Join Date: Dec 2009
Posts: 24
Default

Hi dejavu2010.

@HWI-ST829:138071VACXX:1:1101:1131:2048 1:N:0:ATCACG.

you can use: '^@(\S+)\s[1|2]\S+$'

Assuming that 1 and 2 will appear right after the space char.
'@' 'ID' 'space' '1or1' '...'

I'll add this example to the script.
azneto is offline   Reply With Quote
Old 02-28-2012, 09:56 AM   #9
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default

Hi

my process got killed

perl mergeShuffledFastqSeqs.pl -f1 2044-BH-1_1_sequence.txt -f2 2044-BH-1_2_sequence.txt -r '^@(\S+)\s[1|2]\S+$' -o 2044-BH-1 -t

Loading the first file...Killed

2044-BH-1_1_sequence.txt 18gb, the other one is 17gb. we have a server with 32 duel core cpus and 192gb mem. I wonder what could be the reason it got killed.

thx
dejavu2010 is offline   Reply With Quote
Old 02-28-2012, 10:27 AM   #10
epistatic
Senior Member
 
Location: Dronning Maud Land

Join Date: Mar 2009
Posts: 129
Default

Picard has a FixMateInformation to "Ensure that all mate-pair information is in sync between each read and it's mate pair."
http://picard.sourceforge.net/comman...ateInformation

If you are in Galaxy this is implemented under Picard as: Paired Read Mate Fixer
epistatic is offline   Reply With Quote
Old 02-28-2012, 11:09 AM   #11
azneto
Member
 
Location: Brazil

Join Date: Dec 2009
Posts: 24
Default

Hi,
It most probably is a memory issue.
The script loads only the first file into the memory and starts to match with the entries in the second file. You'll have to monitor the memory usage ('top' or 'free -m').
I just ran a test and perl uses 220Gb RAM for two 33Gb fastq file.
Soon I'll start to search for alternative ways to handle memory using perl in order to improve the script. I'll let you know.
-Adhemar
azneto is offline   Reply With Quote
Old 02-28-2012, 02:51 PM   #12
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default thx. everybody, i got it resolved

thx. everybody, i got it resolved
dejavu2010 is offline   Reply With Quote
Old 03-21-2012, 01:22 PM   #13
dejavu2010
Member
 
Location: usa

Join Date: Jan 2012
Posts: 21
Default

i feel that Trimmomatic index your reads based on input order, not lane_position_... combination, i tested one un matched dataset, they can not handle it.
dejavu2010 is offline   Reply With Quote
Old 05-08-2012, 02:02 PM   #14
SES
Senior Member
 
Location: Vancouver, BC

Join Date: Mar 2010
Posts: 275
Default

Quote:
Originally Posted by kmcarr View Post
m,

Thanks for the acknowledgement. Here is a link to the thread. If you go there you'll see that I just posted an update. Due to a limitation in cdbfasta my method will not work for large input fastq files. The only work-around at the moment is to split the input up into smaller chunks.
The other work-around is to stick with fasta files (smaller index files). If you are mapping to a reference and the quality scores are important then this probably won't help, but for assembly it doesn't matter.

Quote:
Originally Posted by sklages View Post
Maybe of interest as well, PairedreadFinder:
from FAR, http://sourceforge.net/apps/mediawik...itle=Main_Page

Sven
Has anyone actually used this program and found it to work correctly? I gave it a try but found several bugs. First, it inserted random blank lines in the individual "paired" output files. Second, the individual "paired" files actually differed by more than 800 records with my data, which means that trying to interleave the two files causes them to get all out of order. Of course, this could be due to the data or the user, but I've successfully used other methods with the same data and the usage of the program is quite simple, so I'm a bit skeptical on this one.
SES is offline   Reply With Quote
Old 08-11-2012, 11:59 AM   #15
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

Hi All,

I have exactly this problem as well, but with fasta files. Anybody know of a program that will work with Fasta or could modify 'mergeShuffledFastqSeqs.pl' so it would work on that format as well?

Much appreciated.
kga1978 is offline   Reply With Quote
Old 08-11-2012, 08:11 PM   #16
carojasq
Junior Member
 
Location: Bogotá , Colombia

Join Date: Aug 2012
Posts: 4
Default

I'm not sure that this works for you, but anyway if you have left and right fasta files and want to detect the paired-end and the single-end reads you will found usefull this script.https://github.com/lexnederbragt/den...leave_pairs.py
Needs biopython installed. http://biopython.org/wiki/Biopython
carojasq is offline   Reply With Quote
Old 08-12-2012, 05:07 AM   #17
kga1978
Senior Member
 
Location: Boston, MA

Join Date: Nov 2010
Posts: 100
Default

That would probably have done the job, but unfortunately we don't have Biopython installed on our cluster so I can't run it (same with Bioperl).

What I am looking for is basically a standalone script that would do the trick - like the one for Fastq files, which works well.
kga1978 is offline   Reply With Quote
Old 11-29-2012, 03:44 PM   #18
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default mergeShuffledfastqseqs.pl script issues

I can't get the mergeShuffledFastqSeqs.pl script to work with my data. I have two shuffled paired read files (one is 63GB in size and the other is 35GB). When I submit the job via batch script, it gets killed and I'm left with the following message: swap rate due to memory oversubscription is too high.
I've allocated 512GB of memory for this run so I don't think it has to do with that. Also, i've tried predefining the hash table size in the merge...pl script to be between 4-100 billion but this hasn't worked. Anyone have any ideas?

Thanks,
bmtb



Quote:
Originally Posted by azneto View Post
Hi,
It most probably is a memory issue.
The script loads only the first file into the memory and starts to match with the entries in the second file. You'll have to monitor the memory usage ('top' or 'free -m').
I just ran a test and perl uses 220Gb RAM for two 33Gb fastq file.
Soon I'll start to search for alternative ways to handle memory using perl in order to improve the script. I'll let you know.
-Adhemar
bmtb is offline   Reply With Quote
Old 12-17-2012, 03:04 PM   #19
azneto
Member
 
Location: Brazil

Join Date: Dec 2009
Posts: 24
Default

Hi bmtb,
Sorry it took me so long to reply.
The version of the script you have uses 40x the size of the f1 file.
I've just attached a version that uses about 6x.
So, if you use the 35GB file as f1 you should be able to run it this time.
Please let me know if it worked.
Perl hashes are really memory consuming structures and we're studing alternatives.
Best,
Adhemar
Attached Files
File Type: pl mergeShuffledFastqSeqs.pl (4.8 KB, 102 views)
azneto is offline   Reply With Quote
Old 12-17-2012, 03:12 PM   #20
bmtb
Member
 
Location: canada

Join Date: Oct 2012
Posts: 11
Default

Hi azneto,

I guess I should have posted this earlier, but I actually got your first script to work by increasing the memory allocation. Thanks for the updated script though.

Cheers,
bmtb
bmtb is offline   Reply With Quote
Reply

Tags
match reads, paired end read, program, uneven number

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO