SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
getting dexseq_count.py to work alpesh Bioinformatics 6 06-18-2013 10:32 PM
Problem with HTSeq dexseq_count.py script SanderEST Bioinformatics 7 04-01-2013 08:54 AM
DEXSEQ dexseq_count.py counts sszelinger Bioinformatics 1 02-11-2013 01:56 PM
singletons in dexseq_count.py Jetse Bioinformatics 1 12-14-2012 12:05 AM
DEXseq_count.py error newbietonextgen Bioinformatics 1 06-27-2012 04:02 PM

Reply
 
Thread Tools
Old 09-24-2013, 06:27 AM   #1
enomis
Junior Member
 
Location: Earth

Join Date: Jun 2013
Posts: 5
Default DEXSeq: dexseq_count.py produces lots of warnings (mate could not be found)

Hi!

I would like to do an analysis of alternatively used exons in RNAseq data (paired reads, strandless) with DEXSeq and I am struggling with generating the counts using the dexseq_count.py script contained in the DEXSeq package (latest version) several days already.

The problem which occurs is that I get the following warning for many reads:

Quote:
UserWarning: Read XXX claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
For example: The first SAM file for which I wanted to get the counts contains 65.851.460 lines with altogether 28.899.079 unique read ids and I get 1.844.364 warnings like the one quoted.

I searched a lot and saw that many people had the same problems, but none of the solutions seems to work for me. I sorted my file using "samtools sort -n" and also ran "samtools fixmate" (because before I got other warnings related to the bitflags set in the SAM files). I also tried to use "sort -sk -k 1,1" to sort the file, with "export LC_ALL=POSIX" as this was suggested elsewhere, but it didn't help.

Here are two examples of the reads corresponding to the first to warnings that appear when running "dexseq_count.py":

Code:
HWI-ST858_57:1:1101:1228:14101#10@0     113     chr1    566918  255     76M     chrM    6369    0       CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA

HWI-ST858_57:1:1101:1228:14101#10@0     113     chrM    6369    255     76M     chr1    566918  0       CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA
Code:
HWI-ST858_57:1:1101:1228:20037#10@0     163     chr1    45243352        255     76M     =       45244240        964     CCAAGACCCTGGTGAAGAATTGCATCGTGCTCATCGACAGCACACCGTACCGACAGTGGTA

HWI-ST858_57:1:1101:1228:20037#10@0     83      chr1    45244240        255     76M     =       45243352        -964    GCTTCNTGCGTGCATCGCTTCNNGGCCGGGACAGTGTGGCCGAGCAGATGGCTATGTGCTA

HWI-ST858_57:1:1101:1228:20037#10@0     97      chr14   106444649       255     76M     chr5    33162796        0       CAACTCTTTGCCCTCTAGCACATAGCCATCTGCTCGGCCACACTGTCCCGGCCNNGAAGCG

HWI-ST858_57:1:1101:1228:20037#10@0     81      chr5    33162796        255     76M     chr14   106444649       0       GCTTCNTGCGTGCATCGCTTCNNGGCCGGGACAGTGTGGCCGAGCAGATGGCTATGTGCTA
In the end, many people reported that they went back and used other aligners to generate the BAM/SAM files then, but I only have access to the BAM files, not to the original data. I read that there is a function "bamtofastq" available in "biobambam" and other tools, nevertheless I am wondering if there is no better way to solve my problem?

Enomis

Last edited by enomis; 09-24-2013 at 07:10 AM.
enomis is offline   Reply With Quote
Old 09-26-2013, 02:42 AM   #2
enomis
Junior Member
 
Location: Earth

Join Date: Jun 2013
Posts: 5
Default

No idea how to get rid of the problem?
enomis is offline   Reply With Quote
Old 09-26-2013, 03:35 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Your SAM file has completely messed up FLAG fields: The value 113d=71h=0111.0001h means that this is from a paired end read (01h set) with the read being from the first pass (40h set) and the second pass (80h set). Having all three bits set cannot be, so something went seriously wrong.
Simon Anders is offline   Reply With Quote
Old 09-26-2013, 05:36 AM   #4
enomis
Junior Member
 
Location: Earth

Join Date: Jun 2013
Posts: 5
Default

Thank you for your answer, Simon.

Actually the 113 flag appears after running samtools fixmate.

When I convert my original BAM into SAM, the mentioned reads have flag 81:

Code:
HWI-ST858_57:1:1101:1228:14101#10@0	81	chr1	566918	255	76M	*	0	0 CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA######AC@DCABBDD??5-(##GEC;A@@F=CF=EEACBD9EGB8D@HF?:JJJJIJJJIJIHFHHHFFDBFCC@
HWI-ST858_57:1:1101:1228:14101#10@0	81	chrM	6369	255	76M	*	0	0	CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA######AC@DCABBDD??5-(##GEC;A@@F=CF=EEACBD9EGB8D@HF?:JJJJIJJJIJIHFHHHFFDBFCC@
But this seems not to be really better, as it says for both reads that it is the first in pair, if I understand correctly.

However, when I just converted my BAM files into SAM files and sorted them using samtools sort -n, then dexseq_count doesn't work at all:

Quote:
Traceback (most recent call last):
File "[...]/dexseq_count.py", line 132, in <module>
for af, ar in HTSeq.pair_SAM_alignments( HTSeq.SAM_Reader( sam_file ) ):
File "[...]/__init__.py", line 610, in pair_SAM_alignments
for almnt in alignments:
File "[...]/__init__.py", line 549, in __iter__
algnt = SAM_Alignment.from_SAM_line( line )
File "_HTSeq.pyx", line 1321, in HTSeq._HTSeq.SAM_Alignment.from_SAM_line (src/_HTSeq.c:22925)
ValueError: ("Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared", 'line 15 of file [...]/myfile.sorted.sam')
Line 15 also has the 81 flag:

Code:
15: HWI-ST858_57:1:1101:10000:139104#10@0   81      chr2    142005556       255     76M     *       0       0       CACCTGAGGCCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAGACTCTGTCTCTACTAAAAATGCAA
This was the reason why I tried to run fixmate, but I guess this was a bad idea?!

By the way, these are all the entries with the same id as in line 15:

Code:
HWI-ST858_57:1:1101:10000:139104#10@0   81      chr2    142005556       255     76M     *       0       0       CACCTGAGGCCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAGACTCTGTCTCTACTAAAAATGCAA
HWI-ST858_57:1:1101:10000:139104#10@0   99      chr15   89194158        255     76M     =       89194263        180     GTAATTTTTGCATTTTTAGTAGAGACAGAGTCTCACCATGTTGGCCAGGCTGGTCTCAAAC
HWI-ST858_57:1:1101:10000:139104#10@0   147     chr15   89194263        255     76M     =       89194158        -180    GGATTACAGACGTGAGACACCGTGCCTGGCTGGTGGCCGGACTTCTTATAGAATTGCGGTC
So it seems as if the last two ones were okay. But however I have lots of cases like this in the data ...

So how could I solve the problem correctly?
As I wrote, unfortunately I have only got the BAM files, the data was not produced in-house.

Enomis
enomis is offline   Reply With Quote
Old 09-26-2013, 06:15 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Man, that BAM file is a real mess. The HWI-ST858_57:1:1101:1228:14101#10@0 reads are actually not mates, but the same read mapping to multiple places. The read on line 15 says it has an aligned mate, but then it doesn't say where it maps. The whole file is a big violation of the spec. The flags aren't going to be easily salvageable by anything that I've seen. You might need to write something to clean up that file.
dpryan is offline   Reply With Quote
Old 11-06-2013, 06:31 PM   #6
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

I got the similar situation. I figured it was because when I ran tophat, I did not specify the correct "inner mate distance". So the reads were not properly paired.

Under such circumstance, I wonder if I could just use the bam/sam file as non-paired end input for Dexseq processing???
capricy is offline   Reply With Quote
Old 11-06-2013, 09:49 PM   #7
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

Hi @capricy,

the scripts from dexseq are designed to count sequenced fragments, not reads. Therefore, if you use a paired end aligned file and specify that is a single read file, the script will double count all your fragments. This is not recommendable.

Alejandro
areyes is offline   Reply With Quote
Old 11-07-2013, 06:07 AM   #8
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

@areyes,

Then should I just ignore those warnings using -p option?

What do you suggest me to do under such circumstances?

I figure if I doubled all the accounts, the stat still can tell me something?
capricy is offline   Reply With Quote
Old 11-07-2013, 06:13 AM   #9
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

@capricy,

I would not suggest to ignore the warnings, but rather solve the problem of your alignment files. You should make sure that your sam/bam files follow the specifications of a paired alignment according to the samtools specifications and then use the dexseq python scripts.

Alejandro
areyes is offline   Reply With Quote
Old 11-07-2013, 06:22 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Just to jump in quickly, the simplest solution is probably to just realign things. Incorrectly specifying the mate inner distance should still not result in the screwed up output shown in post #4. I wonder if this is some weird tophat bug.

BTW, don't run tophat with --fusion-search, in the off-chance that you're doing so (its output will also cause this sort of problem).
dpryan is offline   Reply With Quote
Old 11-07-2013, 06:46 AM   #11
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

When I run tophat2.0.0 and the only -r was specified

tophat2.0.0 -r 300 genomeindex reads1 reads2

I know this "300" was just an estimate since based on what I read, this number does not affect the alignment. Then I came across the mate finding issue when I ran Dexseq.

what is the easiest way to estimate this inner mate distance parameters? I feel many people just tried different numbers and this sounds very tedious....
capricy is offline   Reply With Quote
Old 11-07-2013, 04:56 PM   #12
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

I also noticed that different percentages of properly paired reads came up even when the same RNAseq dataset was used to map to the different reference databases:


Here are some samtools flagstat results for tophat results:
-----------------------
tophat2.0.0 -r 300 genomeindex reads1 reads2
21635810 + 0 properly paired (58.02%:-nan%)

tophat2.0.0 -r 300 GeneModelindex reads1 reads2
20153630 + 0 properly paired (74.69%:-nan%)

tophat2.0.0 -r 300 cufflinksAssemblyindex reads1 reads2
23748636 + 0 properly paired (79.03%: -nan%)
----------------------

Aren't they supposed to be roughly same when I used the same parameter -r?
capricy is offline   Reply With Quote
Old 11-08-2013, 06:27 AM   #13
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

I also wonder what percentage of the properly paired in bam file would be acceptable for Dexseq analysis?
capricy is offline   Reply With Quote
Old 11-08-2013, 07:18 AM   #14
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I wouldn't worry so much about the exact percentage, provided that it's similar between samples and groups. You don't want an analysis to be skewed simply because one group has poorer mapping.
dpryan is offline   Reply With Quote
Old 11-08-2013, 09:50 AM   #15
capricy
Senior Member
 
Location: 63130

Join Date: Apr 2012
Posts: 125
Default

looks like most of my alignment has ~60% properly paired reads. I wonder if there is way to filter the bam bile and separate the paired reads/singleton... to feed the dexseq...
capricy is offline   Reply With Quote
Old 11-08-2013, 01:48 PM   #16
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by capricy View Post
looks like most of my alignment has ~60% properly paired reads. I wonder if there is way to filter the bam bile and separate the paired reads/singleton... to feed the dexseq...
Was that meant to be a challenge ?

Code:
#!/usr/bin/env python
import csv
import sys

f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
previous = None
for line in f :
    if((int(line[1]) & 0x100) != 0) :
        #Don't output secondary alignments
        continue 
    if((int(line[1]) & 0x2) == 0) :
        #Skip anything not properly paired
        continue
    if(previous != None and line[0] == previous[0]) :
        of.writerow(previous)  
        of.writerow(line)
        previous = None
    else :
        previous = line
Usage would be something like:
Code:
samtools view namesorted.bam | filter.py | dexseq_count.py ....
That will filter out anything that's not properly paired, has is a secondary alignment, or is missing its mate. You can use that as a basis if you want to filter in a different way.
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:54 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO