SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
bwa mem - paired reads steveped Bioinformatics 12 02-09-2016 03:47 PM
BWA mem trouble: paired reads have different names kamilo889 Bioinformatics 14 09-10-2014 02:58 PM
SAM/BAM sort by read names produces truncated read names allenyu Bioinformatics 7 09-24-2012 11:46 PM
converting UCSC gene names to Hugo Symbol names efoss Bioinformatics 2 07-16-2011 01:41 PM
Paired read names / SAM qname format misko Bioinformatics 2 06-30-2010 11:14 AM

Reply
 
Thread Tools
Old 02-02-2016, 05:51 AM   #1
spabinger
Member
 
Location: Europe

Join Date: Jun 2011
Posts: 13
Default Duplicate read names - BWA mem - paired reads have different names

Hi,

running BWA mem (- PE; - Illumina), I'm getting the following error (replaced the ids):



[mem_sam_pe] paired reads have different names: "XXX:5:YYY:1:11102:4257:13510", "XXX:5:YYY:1:11102:15792:1058"

I checked the fastq file and found out that each read name is duplicated 7 times in the file (exact same name). However, the order of the read names is not matching between the pairs (see bold positions).

Example:

> grep -n "XXX:5:YYY:1:11102:4257:13510" R1.fastq
761397:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
862085:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
962773:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
1063461:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
1164149:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
1264837:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA
1365525:@XXX:5:YYY:1:11102:4257:13510 1:N:0:AGGCAGAA+GCGATCTA

> grep "XXX:5:YYY:1:11102:4257:13510" R2.fastq
761397:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
862085:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
1028309:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
1063461:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
1229685:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
1264837:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA
1365525:@XXX:5:YYY:1:11102:4257:13510 2:N:0:AGGCAGAA+GCGATCTA


Is it ok for a fastq file to have multiple reads with the same read name?
If not, could this be a problem of BCL conversion?
How can I fix it?


Thanks for your help,
Stephan


PS: bwa mem command:

bwa mem -t 40 -v 1 hg19.fa R1.fastq R2.fastq > aln.sam
spabinger is offline   Reply With Quote
Old 02-02-2016, 06:06 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,574
Default

Fastq headers should always start with an "@" so what you have is not following the standard. Have you asked the folks who gave you this data as to whether it has been post-processed in some way? And there should be no duplicates (let alone multiples) in raw sequence files, as far as the fastq header ID's are concerned.

Last edited by GenoMax; 02-02-2016 at 06:44 AM.
GenoMax is offline   Reply With Quote
Old 02-02-2016, 06:24 AM   #3
spabinger
Member
 
Location: Europe

Join Date: Jun 2011
Posts: 13
Default

Hi,

that's not the problem. See "head" result (Sequence and quality trimmed) and also the grep result I posted.

> head R1.fastq
@XXX:5:YYY:1:11101:12923:1051 1:N:0:AGGCAGAA+NCGATCTA
CTT...TTC
+
AAA...</<
@XXX:5:YYY:1:11101:4797:1055 1:N:0:AGGCAGAA+NCGATCTA
ACC...CTA
+
AAA...<A/


Thanks,
Stephan
spabinger is offline   Reply With Quote
Old 02-02-2016, 06:43 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,574
Default

My apologies.

If the order of the reads in your files is messed up then you can "re-pair" the order of reads using the repair tool from BBMap suite like follows:

Code:
$ repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
That said each fastq sequence header should be unique in every sequence file. If that is not the case then there is something wrong with this data.
GenoMax is offline   Reply With Quote
Old 02-02-2016, 06:52 AM   #5
spabinger
Member
 
Location: Europe

Join Date: Jun 2011
Posts: 13
Default

Thanks for you reply.

I was also suspecting that the raw file is not ok.

Best regards,
Stephan
spabinger is offline   Reply With Quote
Old 02-02-2016, 07:00 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,574
Default

If the sequence/Q-scores are identical for those 7 copies then you could potentially keep just one and throw away other 6.

I am puzzled by how this could have happened though. No logical explanation comes to mind.
GenoMax is offline   Reply With Quote
Reply

Tags
bwa, duplicates, fastq, reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO