Seqanswers Leaderboard Ad

**enomis** · 09-26-2013, 02:42 AM

No idea how to get rid of the problem?

**Simon Anders** · 09-26-2013, 03:35 AM

Your SAM file has completely messed up FLAG fields: The value 113d=71h=0111.0001h means that this is from a paired end read (01h set) with the read being from the first pass (40h set) and the second pass (80h set). Having all three bits set cannot be, so something went seriously wrong.

**enomis** · 09-26-2013, 05:36 AM

Thank you for your answer, Simon.

Actually the 113 flag appears after running samtools fixmate.

When I convert my original BAM into SAM, the mentioned reads have flag 81:

Code:

HWI-ST858_57:1:1101:1228:14101#10@0	81	chr1	566918	255	76M	*	0	0 CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA######AC@DCABBDD??5-(##GEC;A@@F=CF=EEACBD9EGB8D@HF?:JJJJIJJJIJIHFHHHFFDBFCC@
HWI-ST858_57:1:1101:1228:14101#10@0	81	chrM	6369	255	76M	*	0	0	CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA######AC@DCABBDD??5-(##GEC;A@@F=CF=EEACBD9EGB8D@HF?:JJJJIJJJIJIHFHHHFFDBFCC@

But this seems not to be really better, as it says for both reads that it is the first in pair, if I understand correctly.

However, when I just converted my BAM files into SAM files and sorted them using samtools sort -n, then dexseq_count doesn't work at all:

Traceback (most recent call last):
File "[...]/dexseq_count.py", line 132, in <module>
for af, ar in HTSeq.pair_SAM_alignments( HTSeq.SAM_Reader( sam_file ) ):
File "[...]/__init__.py", line 610, in pair_SAM_alignments
for almnt in alignments:
File "[...]/__init__.py", line 549, in __iter__
algnt = SAM_Alignment.from_SAM_line( line )
File "_HTSeq.pyx", line 1321, in HTSeq._HTSeq.SAM_Alignment.from_SAM_line (src/_HTSeq.c:22925)
ValueError: ("Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared", 'line 15 of file [...]/myfile.sorted.sam')

Line 15 also has the 81 flag:

Code:

15: HWI-ST858_57:1:1101:10000:139104#10@0   81      chr2    142005556       255     76M     *       0       0       CACCTGAGGCCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAGACTCTGTCTCTACTAAAAATGCAA

This was the reason why I tried to run fixmate, but I guess this was a bad idea?!

By the way, these are all the entries with the same id as in line 15:

Code:

HWI-ST858_57:1:1101:10000:139104#10@0   81      chr2    142005556       255     76M     *       0       0       CACCTGAGGCCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAGACTCTGTCTCTACTAAAAATGCAA
HWI-ST858_57:1:1101:10000:139104#10@0   99      chr15   89194158        255     76M     =       89194263        180     GTAATTTTTGCATTTTTAGTAGAGACAGAGTCTCACCATGTTGGCCAGGCTGGTCTCAAAC
HWI-ST858_57:1:1101:10000:139104#10@0   147     chr15   89194263        255     76M     =       89194158        -180    GGATTACAGACGTGAGACACCGTGCCTGGCTGGTGGCCGGACTTCTTATAGAATTGCGGTC

So it seems as if the last two ones were okay. But however I have lots of cases like this in the data ...

So how could I solve the problem correctly?
As I wrote, unfortunately I have only got the BAM files, the data was not produced in-house.

Enomis

**dpryan** · 09-26-2013, 06:15 AM

Man, that BAM file is a real mess. The HWI-ST858_57:1:1101:1228:14101#10@0 reads are actually not mates, but the same read mapping to multiple places. The read on line 15 says it has an aligned mate, but then it doesn't say where it maps. The whole file is a big violation of the spec. The flags aren't going to be easily salvageable by anything that I've seen. You might need to write something to clean up that file.

**capricy** · 11-06-2013, 07:31 PM

I got the similar situation. I figured it was because when I ran tophat, I did not specify the correct "inner mate distance". So the reads were not properly paired.

Under such circumstance, I wonder if I could just use the bam/sam file as non-paired end input for Dexseq processing???

**areyes** · 11-06-2013, 10:49 PM

Hi @capricy,

the scripts from dexseq are designed to count sequenced fragments, not reads. Therefore, if you use a paired end aligned file and specify that is a single read file, the script will double count all your fragments. This is not recommendable.

Alejandro

**capricy** · 11-07-2013, 07:07 AM

@areyes,

Then should I just ignore those warnings using -p option?

What do you suggest me to do under such circumstances?

I figure if I doubled all the accounts, the stat still can tell me something?

**areyes** · 11-07-2013, 07:13 AM

@capricy,

I would not suggest to ignore the warnings, but rather solve the problem of your alignment files. You should make sure that your sam/bam files follow the specifications of a paired alignment according to the samtools specifications and then use the dexseq python scripts.

Alejandro

**dpryan** · 11-07-2013, 07:22 AM

Just to jump in quickly, the simplest solution is probably to just realign things. Incorrectly specifying the mate inner distance should still not result in the screwed up output shown in post #4. I wonder if this is some weird tophat bug.

BTW, don't run tophat with --fusion-search, in the off-chance that you're doing so (its output will also cause this sort of problem).

**capricy** · 11-07-2013, 07:46 AM

When I run tophat2.0.0 and the only -r was specified

tophat2.0.0 -r 300 genomeindex reads1 reads2

I know this "300" was just an estimate since based on what I read, this number does not affect the alignment. Then I came across the mate finding issue when I ran Dexseq.

what is the easiest way to estimate this inner mate distance parameters? I feel many people just tried different numbers and this sounds very tedious....

**capricy** · 11-07-2013, 05:56 PM

I also noticed that different percentages of properly paired reads came up even when the same RNAseq dataset was used to map to the different reference databases:

Here are some samtools flagstat results for tophat results:
-----------------------
tophat2.0.0 -r 300 genomeindex reads1 reads2
21635810 + 0 properly paired (58.02%:-nan%)

tophat2.0.0 -r 300 GeneModelindex reads1 reads2
20153630 + 0 properly paired (74.69%:-nan%)

tophat2.0.0 -r 300 cufflinksAssemblyindex reads1 reads2
23748636 + 0 properly paired (79.03%: -nan%)
----------------------

Aren't they supposed to be roughly same when I used the same parameter -r?

**capricy** · 11-08-2013, 07:27 AM

I also wonder what percentage of the properly paired in bam file would be acceptable for Dexseq analysis?

**dpryan** · 11-08-2013, 08:18 AM

I wouldn't worry so much about the exact percentage, provided that it's similar between samples and groups. You don't want an analysis to be skewed simply because one group has poorer mapping.

**capricy** · 11-08-2013, 10:50 AM

looks like most of my alignment has ~60% properly paired reads. I wonder if there is way to filter the bam bile and separate the paired reads/singleton... to feed the dexseq...

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

DEXSeq: dexseq_count.py produces lots of warnings (mate could not be found)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News