![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
BSMAP: whole genome Bisulfite Sequence MAPping program | wei | Epigenetics | 4 | 03-20-2014 02:13 PM |
How to make HiSeq indexed paired-end library with homemade oligos? | ostrakon | Illumina/Solexa | 6 | 03-16-2012 05:22 AM |
Difference between mate pair and pair end | bassu | General | 2 | 06-19-2010 07:13 AM |
pair-end sequencing produces single-end read artifact | pparg | Bioinformatics | 9 | 03-29-2010 12:15 PM |
whole genome Bisulfite Sequence MAPping program | wei | Bioinformatics | 0 | 08-07-2009 03:46 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
i have a PE100, the problem is some tiles of read 2 are corrupted and it caused un even number of reads in read 1 and read 2. Are there any programs which can match up sequences from both reads and just keep mathed ones for down stream analysis. Thanks
mike |
![]() |
![]() |
![]() |
#2 |
Member
Location: London, UK Join Date: Nov 2011
Posts: 12
|
![]()
I remember a nice contribution from kmcarr a while back which can probably help; search for thread 10392. (incidentally I can't recommend my own contribution to that thread - it is hideously slow)
best m Last edited by mgg; 02-27-2012 at 03:55 AM. Reason: typo correction |
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
Thanks for the acknowledgement. Here is a link to the thread. If you go there you'll see that I just posted an update. Due to a limitation in cdbfasta my method will not work for large input fastq files. The only work-around at the moment is to split the input up into smaller chunks. |
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Uppsala, Sweden Join Date: Apr 2010
Posts: 29
|
![]()
The trimmer Trimmomatic outputs files with intact pairs as well as files with single reads. It should be able to split your files in the way you want, as well as do trimming at the same time if you wish.
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Brazil Join Date: Dec 2009
Posts: 24
|
![]()
I wrote a script exactly to tackle that issue. You'll find a copy attached.
The script will output either an interleaved mate pair fastq or two fastq files. The unpaired reads will also be saved in a separate file. The script uses a regular expression to identify the ID, so let me know if you need help with that. It requires at least as much RAM as the size of the first file. Feel free to use it and let please let me know how can we improve it. Adhemar |
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: Berlin, DE Join Date: May 2008
Posts: 628
|
![]()
Maybe of interest as well, PairedreadFinder:
Quote:
Sven |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
hi azneto, how to setup regular expression like the following
@HWI-ST829:138 ![]() Thanks. |
![]() |
![]() |
![]() |
#8 |
Member
Location: Brazil Join Date: Dec 2009
Posts: 24
|
![]()
Hi dejavu2010.
@HWI-ST829:138 ![]() you can use: '^@(\S+)\s[1|2]\S+$' Assuming that 1 and 2 will appear right after the space char. '@' 'ID' 'space' '1or1' '...' I'll add this example to the script. |
![]() |
![]() |
![]() |
#9 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
Hi
my process got killed perl mergeShuffledFastqSeqs.pl -f1 2044-BH-1_1_sequence.txt -f2 2044-BH-1_2_sequence.txt -r '^@(\S+)\s[1|2]\S+$' -o 2044-BH-1 -t Loading the first file...Killed 2044-BH-1_1_sequence.txt 18gb, the other one is 17gb. we have a server with 32 duel core cpus and 192gb mem. I wonder what could be the reason it got killed. thx |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Dronning Maud Land Join Date: Mar 2009
Posts: 129
|
![]()
Picard has a FixMateInformation to "Ensure that all mate-pair information is in sync between each read and it's mate pair."
http://picard.sourceforge.net/comman...ateInformation If you are in Galaxy this is implemented under Picard as: Paired Read Mate Fixer |
![]() |
![]() |
![]() |
#11 |
Member
Location: Brazil Join Date: Dec 2009
Posts: 24
|
![]()
Hi,
It most probably is a memory issue. The script loads only the first file into the memory and starts to match with the entries in the second file. You'll have to monitor the memory usage ('top' or 'free -m'). I just ran a test and perl uses 220Gb RAM for two 33Gb fastq file. Soon I'll start to search for alternative ways to handle memory using perl in order to improve the script. I'll let you know. -Adhemar |
![]() |
![]() |
![]() |
#12 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
thx. everybody, i got it resolved
|
![]() |
![]() |
![]() |
#13 |
Member
Location: usa Join Date: Jan 2012
Posts: 21
|
![]()
i feel that Trimmomatic index your reads based on input order, not lane_position_... combination, i tested one un matched dataset, they can not handle it.
|
![]() |
![]() |
![]() |
#14 | ||
Senior Member
Location: Vancouver, BC Join Date: Mar 2010
Posts: 275
|
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
Hi All,
I have exactly this problem as well, but with fasta files. Anybody know of a program that will work with Fasta or could modify 'mergeShuffledFastqSeqs.pl' so it would work on that format as well? Much appreciated. |
![]() |
![]() |
![]() |
#16 |
Junior Member
Location: Bogotá , Colombia Join Date: Aug 2012
Posts: 4
|
![]()
I'm not sure that this works for you, but anyway if you have left and right fasta files and want to detect the paired-end and the single-end reads you will found usefull this script.https://github.com/lexnederbragt/den...leave_pairs.py
Needs biopython installed. http://biopython.org/wiki/Biopython |
![]() |
![]() |
![]() |
#17 |
Senior Member
Location: Boston, MA Join Date: Nov 2010
Posts: 100
|
![]()
That would probably have done the job, but unfortunately we don't have Biopython installed on our cluster so I can't run it (same with Bioperl).
What I am looking for is basically a standalone script that would do the trick - like the one for Fastq files, which works well. |
![]() |
![]() |
![]() |
#18 | |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
I can't get the mergeShuffledFastqSeqs.pl script to work with my data. I have two shuffled paired read files (one is 63GB in size and the other is 35GB). When I submit the job via batch script, it gets killed and I'm left with the following message: swap rate due to memory oversubscription is too high.
I've allocated 512GB of memory for this run so I don't think it has to do with that. Also, i've tried predefining the hash table size in the merge...pl script to be between 4-100 billion but this hasn't worked. Anyone have any ideas? Thanks, bmtb Quote:
|
|
![]() |
![]() |
![]() |
#19 |
Member
Location: Brazil Join Date: Dec 2009
Posts: 24
|
![]()
Hi bmtb,
Sorry it took me so long to reply. The version of the script you have uses 40x the size of the f1 file. I've just attached a version that uses about 6x. So, if you use the 35GB file as f1 you should be able to run it this time. Please let me know if it worked. Perl hashes are really memory consuming structures and we're studing alternatives. Best, Adhemar |
![]() |
![]() |
![]() |
#20 |
Member
Location: canada Join Date: Oct 2012
Posts: 11
|
![]()
Hi azneto,
I guess I should have posted this earlier, but I actually got your first script to work by increasing the memory allocation. Thanks for the updated script though. Cheers, bmtb |
![]() |
![]() |
![]() |
Tags |
match reads, paired end read, program, uneven number |
Thread Tools | |
|
|