I recently ran some collaborative FASTQ files through my standard Tophat/Cufflinks pipeline, and got some really weird results. The Cufflinks yielded mostly zero-FPKM genes, which started me on this frustrating journey. I started by checking SAMTOOLS FLAGSTAT for one of the BAM-files:
I have never seen something like this before. Only 72 paired reads? The Tophat alignment_summary.txt looks like this:
So the reads are getting mapped, but discordantly? What does that mean, and how can I fix it? This is obviously affecting the Cufflinks analysis greatly. I've also tried aligning with STAR, but the Cufflinks output from that yields similar results, so it seems it is something with the FASTQ files or their processing.
Some googling turned up that the paired files may be out-of-sync, so I used this python script to re-sync them. The output looks very similar to to what I already have, but I'm currently in the process of aligning the new FASTQ files, but I don't have high hopes of this fixing the issue. [EDIT: it didn't.]
If it matters, the FASTQ files I recieved were, I think, not the actual raw FASTQ files, but rather BAM->FASTQ converted files (my collaborator is unsure, as she is not a bioinformatician and the bioinformatician who did it is not available). I recieved the FASTQ QC reports and they look good, but I have not had time to run through any QC myself.
Does anybody know what's wrong here? This collaboration is really a side-project for me, and I'm quite frustrated that there's no many problems with it...
[EDIT]: I've now also run the alignment for just one of the read pair files, and that gives much better results. So, this could mean that the reads are somehow out of sync, but that the script I used above didn't solve the problem?
Code:
14078966 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 14078966 + 0 mapped (100.00%:-nan%) 14078966 + 0 paired in sequencing 7039483 + 0 read1 7039483 + 0 read2 72 + 0 properly paired (0.00%:-nan%) 14078966 + 0 with itself and mate mapped 0 + 0 singletons (0.00%:-nan%) 1785552 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
Code:
Left reads: Input : 4602804 Mapped : 4571790 (99.3% of input) of these: 328183 ( 7.2%) have multiple alignments (60211 have >20) Right reads: Input : 4602804 Mapped : 4571790 (99.3% of input) of these: 328183 ( 7.2%) have multiple alignments (60211 have >20) 99.3% overall read mapping rate. Aligned pairs: 4571790 of these: 328183 ( 7.2%) have multiple alignments 4571758 (100.0%) are discordant alignments 0.0% concordant pair alignment rate.
Some googling turned up that the paired files may be out-of-sync, so I used this python script to re-sync them. The output looks very similar to to what I already have, but I'm currently in the process of aligning the new FASTQ files, but I don't have high hopes of this fixing the issue. [EDIT: it didn't.]
If it matters, the FASTQ files I recieved were, I think, not the actual raw FASTQ files, but rather BAM->FASTQ converted files (my collaborator is unsure, as she is not a bioinformatician and the bioinformatician who did it is not available). I recieved the FASTQ QC reports and they look good, but I have not had time to run through any QC myself.
Does anybody know what's wrong here? This collaboration is really a side-project for me, and I'm quite frustrated that there's no many problems with it...
[EDIT]: I've now also run the alignment for just one of the read pair files, and that gives much better results. So, this could mean that the reads are somehow out of sync, but that the script I used above didn't solve the problem?
Comment