Seqanswers Leaderboard Ad

**fanli** · 04-21-2016, 07:17 AM

Can you paste a few lines from your R1 and R2 fastq files? My guess is that something went wrong in the BAM->FASTQ conversion and that's why they are showing up as discordant

**GenoMax** · 04-21-2016, 07:22 AM

@ErikFas: This does sound like a classic case of reads having gone out of order. Can you try repair.sh from BBMap on your fastq files and see if that helps any?

@fanli may also be on to something. If the read headers were changed/messed-up in some way then the re-pair programs may not be able to realign the reads.

**ErikFas** · 04-21-2016, 10:21 PM

Originally posted by fanli View Post

Can you paste a few lines from your R1 and R2 fastq files? My guess is that something went wrong in the BAM->FASTQ conversion and that's why they are showing up as discordant

First 5 read pairs from each file:

Code:

R1:
@HWUSI-EAS101E:4:FC:2:1:1028:9892_2:N:0:/1
CCTTGCTCAGCTCACACCGCAGCGTGGCCGTGGCCCCTTCTGTGGCCTCCT
+
IIIIIFIIHIGHIFIIGDIIIIIGIIGIEIEHH@FEE+CC2;5;;;?94?5
@HWUSI-EAS101E:4:FC:2:1:1029:7552_2:N:0:/1
CTTGTTTGAGGAAGCAGAGAAGAATGAAAAGAAAGGCCTGAAGGACTGGGA
+
G:D@DD=13@@12@2;1774EEB<DB>9?>GAAAADB>DB?-EEDGGGG@G
@HWUSI-EAS101E:4:FC:2:1:1029:10586_2:N:0:/1
TCCAACTCTACATCACTTGAGTTCTCATCTTCAACCTCTCCTTCTTCAGCT
+
GIDIIDIIIGHIHIIG<GGEHIIIGHGIIIIIGHIGIIBIIHGGDIII@GI
@HWUSI-EAS101E:4:FC:2:1:1029:10648_1:N:0:/1
CGACAGCTCCTCAACTGCCTCCATGTCATCACCCTGTACAACCGCATCAAG
+
HDH:HGHHHHHHHGHHGHHHHHHHHHHHHHHEDFHHHHHHHHHDGHGHGHH
@HWUSI-EAS101E:4:FC:2:1:1029:10648_2:N:0:/1
CTGCCGGGTCATGGTTGACCACATCCCCGATGGCTGTGACGAGTCTGATGA
+
=HGIHGGEG<;B?BBIIG<GIEGGIEGDG<DED>EGABEEGEGDGDBDGG8

Code:

R2:
@HWUSI-EAS101E:4:FC:2:1:1028:9892_2:N:0:/2
CCTTGCTCAGCTCACACCGCAGCGTGGCCGTGGCCCCTTCTGTGGCCTCCT
+
IIIIIFIIHIGHIFIIGDIIIIIGIIGIEIEHH@FEE+CC2;5;;;?94?5
@HWUSI-EAS101E:4:FC:2:1:1029:7552_2:N:0:/2
CTTGTTTGAGGAAGCAGAGAAGAATGAAAAGAAAGGCCTGAAGGACTGGGA
+
G:D@DD=13@@12@2;1774EEB<DB>9?>GAAAADB>DB?-EEDGGGG@G
@HWUSI-EAS101E:4:FC:2:1:1029:10586_2:N:0:/2
TCCAACTCTACATCACTTGAGTTCTCATCTTCAACCTCTCCTTCTTCAGCT
+
GIDIIDIIIGHIHIIG<GGEHIIIGHGIIIIIGHIGIIBIIHGGDIII@GI
@HWUSI-EAS101E:4:FC:2:1:1029:10648_1:N:0:/2
CGACAGCTCCTCAACTGCCTCCATGTCATCACCCTGTACAACCGCATCAAG
+
HDH:HGHHHHHHHGHHGHHHHHHHHHHHHHHEDFHHHHHHHHHDGHGHGHH
@HWUSI-EAS101E:4:FC:2:1:1029:10648_2:N:0:/2
CTGCCGGGTCATGGTTGACCACATCCCCGATGGCTGTGACGAGTCTGATGA
+
=HGIHGGEG<;B?BBIIG<GIEGGIEGDG<DED>EGABEEGEGDGDBDGG8

I haven't really looked that much at raw FASTQ files before, but if I compare to other data I have (the actual raw data), they have the read pair info (_1 or _2) in the middle of the header. Looking at some of the rows above, it seems that a read can sometimes be _1 in the middle, while being /2 at the end. Could this be the problem?

I ran the same FASTQ files as above through BBMap's repair.sh:

Code:

Executing jgi.SplitPairsAndSingles [rp, in=exp2_1.fastq.gz, in2=exp2_2.fastq.gz, out=exp2.repaired_1.fastq.gz, out2=exp2.repaired_2.fastq.gz, outs=exp2_singletons.fastq.gz, -Xmx12g]

Set INTERLEAVED to false
Started output stream.

Input:                  	9205608 reads 		469486008 bases.
Result:                 	9205608 reads (100.00%) 	469486008 bases (100.00%)
Pairs:                  	9205608 reads (100.00%) 	469486008 bases (100.00%)
Singletons:             	0 reads (0.00%) 	0 bases (0.00%)

Time:   			16.581 seconds.
Reads Processed:       9205k 	555.18k reads/sec
Bases Processed:        469m 	28.31m bases/sec

The first 5 read pairs are identical to the ones above. If it is the problem as above, then the repair/syncing scripts won't work because they are, as far as they can tell, already synced because of the correct /1 or /2 at the ends, but the actual alignment step takes the _1 or _2 in the middle?

**GenoMax** · 04-22-2016, 03:16 AM

If you take a look at the FASTQ Illumina read header specification you will agree that someone has changed the read headers in your dataset so they are a mishmash of the two formats used. This does not even appear to be barcoded data (or if it was that barcode has been removed from the header).

I think you should change the headers back to one of the two types (if this is old data appropriate for the old headers then use those) and then try repair.sh again.

**ErikFas** · 04-22-2016, 03:32 AM

Yes indeed, they don't seem to match either format... I'm quite unsure as to what parts I should change, and how. Is it the "_2" part in reads with "/1" at the end, or other things as well? For example, I don't seem to have a flow cell ID, just "FC" with nothing next to it. What part(s) of the header are actually used in the alignments?

**GenoMax** · 04-22-2016, 04:16 AM

To me this data seems to be from Illumina v.1.8 so you could take out the /1/2 at the end of the reads and then remove the _ before the _1:N and _2:N. Flowcell ID also appears to have been taken out (leaving a simple FC) but that is not so critical.

Code:

@HWUSI-EAS101E:4:FC:2:1:1029:10648_2:N:0:/2

to

Code:

@HWUSI-EAS101E:4:FC:2:1:1029:10648 2:N:0:

If this is barcoded data then add that back in the header like

Code:

@HWUSI-EAS101E:4:FC:2:1:1029:10648 2:N:0:ACCGTC

One reason people do modifications like your data is if they want to preserve the read info in their BAM's. Generally aligners will drop the rest of read header text once they encounter a space in the name.

**fanli** · 04-22-2016, 06:56 AM

Am I blind or do the reads appear identical to me?

**GenoMax** · 04-22-2016, 07:28 AM

Originally posted by fanli View Post

Am I blind or do the reads appear identical to me?

Ha! Great catch.

Someone stuck a /1 and /2 at the end of fastq headers of the same datafile and called it a day. The files also seem to be a mix of Read1 and Read 2, if the _1 and _2 are real.

**ErikFas** · 04-22-2016, 07:59 AM

Yeah, they do seem identical... I didn't catch it either, so I did what Genomax said and I'm currently aligning the results - I guess we'll see what happens! But, what if they are identical?

**fanli** · 04-22-2016, 08:01 AM

It's pointless to redo the alignment if the reads are fundamentally flawed. You should probably go back and see if whoever did the BAM->FASTQ conversion made a mistake.

**GenoMax** · 04-22-2016, 08:09 AM

@ErikFas: Since the reads are mixed the results would probably look no better.

If you must work with what you have then separate Read 1 and Read 2 data into separate files and start over.

But I do agree with @fanli. If you have to mess with the raw data to this extent then it is always better to go back and and get an un-tampered version (for the lack of a better word).

**ErikFas** · 04-22-2016, 01:12 PM

Yes, that's what I've trying to do just after I got the first Cufflinks results, actually, but since as the person whom I'm in contact with isn't actually a bioinformatician and the person responsible for the analysis is not available... I'm up **** creek without a paddle in that regard!

I guess I'll have to figure out how to separate them by some script and see if that does it. Would you say that the "_1" and "_2" are the correct reads, then, and that I should separate based on that? The reads have to be in the same order in the two files, if I don't misremember, or?

**GenoMax** · 04-22-2016, 02:18 PM

I must say that we may not know for sure (until you confirm from the person in the know) that the _1 and _2 are actually represent original Read 1 and Read 2.

But desperate times call for similar measures. Following is tested only on the snippet above and may not work with your full file but give it a try.

Code:

$ grep -A 3 "_1:N" your_file | grep -v "\-\-" > read1_file

$ grep -A 3 "_2:N" your_file | grep -v "\-\-" > read2_file

Follow this up with removal of the _ and the /1/2 like you did this morning and then pass the two files through repair.sh again. This will ensure that the reads are in the same order in the final files.

Let us see if that works.

**ErikFas** · 04-23-2016, 03:27 AM

Well, I can't get repair.sh to run, because apparently some of the reads and different read and quality lengths, so I guess something happened in the code you gave. I'm not sure what the second grep in your command does. It greps the lines that don't contains "-"?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Weird Cufflinks results from odd alignments

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News