Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DEXSeq: dexseq_count.py produces lots of warnings (mate could not be found)

    Hi!

    I would like to do an analysis of alternatively used exons in RNAseq data (paired reads, strandless) with DEXSeq and I am struggling with generating the counts using the dexseq_count.py script contained in the DEXSeq package (latest version) several days already.

    The problem which occurs is that I get the following warning for many reads:

    UserWarning: Read XXX claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)
    For example: The first SAM file for which I wanted to get the counts contains 65.851.460 lines with altogether 28.899.079 unique read ids and I get 1.844.364 warnings like the one quoted.

    I searched a lot and saw that many people had the same problems, but none of the solutions seems to work for me. I sorted my file using "samtools sort -n" and also ran "samtools fixmate" (because before I got other warnings related to the bitflags set in the SAM files). I also tried to use "sort -sk -k 1,1" to sort the file, with "export LC_ALL=POSIX" as this was suggested elsewhere, but it didn't help.

    Here are two examples of the reads corresponding to the first to warnings that appear when running "dexseq_count.py":

    Code:
    HWI-ST858_57:1:1101:1228:14101#10@0     113     chr1    566918  255     76M     chrM    6369    0       CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA
    
    HWI-ST858_57:1:1101:1228:14101#10@0     113     chrM    6369    255     76M     chr1    566918  0       CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA
    Code:
    HWI-ST858_57:1:1101:1228:20037#10@0     163     chr1    45243352        255     76M     =       45244240        964     CCAAGACCCTGGTGAAGAATTGCATCGTGCTCATCGACAGCACACCGTACCGACAGTGGTA
    
    HWI-ST858_57:1:1101:1228:20037#10@0     83      chr1    45244240        255     76M     =       45243352        -964    GCTTCNTGCGTGCATCGCTTCNNGGCCGGGACAGTGTGGCCGAGCAGATGGCTATGTGCTA
    
    HWI-ST858_57:1:1101:1228:20037#10@0     97      chr14   106444649       255     76M     chr5    33162796        0       CAACTCTTTGCCCTCTAGCACATAGCCATCTGCTCGGCCACACTGTCCCGGCCNNGAAGCG
    
    HWI-ST858_57:1:1101:1228:20037#10@0     81      chr5    33162796        255     76M     chr14   106444649       0       GCTTCNTGCGTGCATCGCTTCNNGGCCGGGACAGTGTGGCCGAGCAGATGGCTATGTGCTA
    In the end, many people reported that they went back and used other aligners to generate the BAM/SAM files then, but I only have access to the BAM files, not to the original data. I read that there is a function "bamtofastq" available in "biobambam" and other tools, nevertheless I am wondering if there is no better way to solve my problem?

    Enomis
    Last edited by enomis; 09-24-2013, 07:10 AM.

  • #2
    No idea how to get rid of the problem?

    Comment


    • #3
      Your SAM file has completely messed up FLAG fields: The value 113d=71h=0111.0001h means that this is from a paired end read (01h set) with the read being from the first pass (40h set) and the second pass (80h set). Having all three bits set cannot be, so something went seriously wrong.

      Comment


      • #4
        Thank you for your answer, Simon.

        Actually the 113 flag appears after running samtools fixmate.

        When I convert my original BAM into SAM, the mentioned reads have flag 81:

        Code:
        HWI-ST858_57:1:1101:1228:14101#10@0	81	chr1	566918	255	76M	*	0	0 CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA######AC@DCABBDD??5-(##GEC;A@@F=CF=EEACBD9EGB8D@HF?:JJJJIJJJIJIHFHHHFFDBFCC@
        HWI-ST858_57:1:1101:1228:14101#10@0	81	chrM	6369	255	76M	*	0	0	CTCCTNTATCTTAGGGGCCATNNATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAA######AC@DCABBDD??5-(##GEC;A@@F=CF=EEACBD9EGB8D@HF?:JJJJIJJJIJIHFHHHFFDBFCC@
        But this seems not to be really better, as it says for both reads that it is the first in pair, if I understand correctly.

        However, when I just converted my BAM files into SAM files and sorted them using samtools sort -n, then dexseq_count doesn't work at all:

        Traceback (most recent call last):
        File "[...]/dexseq_count.py", line 132, in <module>
        for af, ar in HTSeq.pair_SAM_alignments( HTSeq.SAM_Reader( sam_file ) ):
        File "[...]/__init__.py", line 610, in pair_SAM_alignments
        for almnt in alignments:
        File "[...]/__init__.py", line 549, in __iter__
        algnt = SAM_Alignment.from_SAM_line( line )
        File "_HTSeq.pyx", line 1321, in HTSeq._HTSeq.SAM_Alignment.from_SAM_line (src/_HTSeq.c:22925)
        ValueError: ("Malformed SAM line: MRNM == '*' although flag bit &0x0008 cleared", 'line 15 of file [...]/myfile.sorted.sam')
        Line 15 also has the 81 flag:

        Code:
        15: HWI-ST858_57:1:1101:10000:139104#10@0   81      chr2    142005556       255     76M     *       0       0       CACCTGAGGCCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAGACTCTGTCTCTACTAAAAATGCAA
        This was the reason why I tried to run fixmate, but I guess this was a bad idea?!

        By the way, these are all the entries with the same id as in line 15:

        Code:
        HWI-ST858_57:1:1101:10000:139104#10@0   81      chr2    142005556       255     76M     *       0       0       CACCTGAGGCCAGGAGTTTGAGACCAGCCTGGCCAACATGGTGAGACTCTGTCTCTACTAAAAATGCAA
        HWI-ST858_57:1:1101:10000:139104#10@0   99      chr15   89194158        255     76M     =       89194263        180     GTAATTTTTGCATTTTTAGTAGAGACAGAGTCTCACCATGTTGGCCAGGCTGGTCTCAAAC
        HWI-ST858_57:1:1101:10000:139104#10@0   147     chr15   89194263        255     76M     =       89194158        -180    GGATTACAGACGTGAGACACCGTGCCTGGCTGGTGGCCGGACTTCTTATAGAATTGCGGTC
        So it seems as if the last two ones were okay. But however I have lots of cases like this in the data ...

        So how could I solve the problem correctly?
        As I wrote, unfortunately I have only got the BAM files, the data was not produced in-house.

        Enomis

        Comment


        • #5
          Man, that BAM file is a real mess. The HWI-ST858_57:1:1101:1228:14101#10@0 reads are actually not mates, but the same read mapping to multiple places. The read on line 15 says it has an aligned mate, but then it doesn't say where it maps. The whole file is a big violation of the spec. The flags aren't going to be easily salvageable by anything that I've seen. You might need to write something to clean up that file.

          Comment


          • #6
            I got the similar situation. I figured it was because when I ran tophat, I did not specify the correct "inner mate distance". So the reads were not properly paired.

            Under such circumstance, I wonder if I could just use the bam/sam file as non-paired end input for Dexseq processing???

            Comment


            • #7
              Hi @capricy,

              the scripts from dexseq are designed to count sequenced fragments, not reads. Therefore, if you use a paired end aligned file and specify that is a single read file, the script will double count all your fragments. This is not recommendable.

              Alejandro

              Comment


              • #8
                @areyes,

                Then should I just ignore those warnings using -p option?

                What do you suggest me to do under such circumstances?

                I figure if I doubled all the accounts, the stat still can tell me something?

                Comment


                • #9
                  @capricy,

                  I would not suggest to ignore the warnings, but rather solve the problem of your alignment files. You should make sure that your sam/bam files follow the specifications of a paired alignment according to the samtools specifications and then use the dexseq python scripts.

                  Alejandro

                  Comment


                  • #10
                    Just to jump in quickly, the simplest solution is probably to just realign things. Incorrectly specifying the mate inner distance should still not result in the screwed up output shown in post #4. I wonder if this is some weird tophat bug.

                    BTW, don't run tophat with --fusion-search, in the off-chance that you're doing so (its output will also cause this sort of problem).

                    Comment


                    • #11
                      When I run tophat2.0.0 and the only -r was specified

                      tophat2.0.0 -r 300 genomeindex reads1 reads2

                      I know this "300" was just an estimate since based on what I read, this number does not affect the alignment. Then I came across the mate finding issue when I ran Dexseq.

                      what is the easiest way to estimate this inner mate distance parameters? I feel many people just tried different numbers and this sounds very tedious....

                      Comment


                      • #12
                        I also noticed that different percentages of properly paired reads came up even when the same RNAseq dataset was used to map to the different reference databases:


                        Here are some samtools flagstat results for tophat results:
                        -----------------------
                        tophat2.0.0 -r 300 genomeindex reads1 reads2
                        21635810 + 0 properly paired (58.02%:-nan%)

                        tophat2.0.0 -r 300 GeneModelindex reads1 reads2
                        20153630 + 0 properly paired (74.69%:-nan%)

                        tophat2.0.0 -r 300 cufflinksAssemblyindex reads1 reads2
                        23748636 + 0 properly paired (79.03%: -nan%)
                        ----------------------

                        Aren't they supposed to be roughly same when I used the same parameter -r?

                        Comment


                        • #13
                          I also wonder what percentage of the properly paired in bam file would be acceptable for Dexseq analysis?

                          Comment


                          • #14
                            I wouldn't worry so much about the exact percentage, provided that it's similar between samples and groups. You don't want an analysis to be skewed simply because one group has poorer mapping.

                            Comment


                            • #15
                              looks like most of my alignment has ~60% properly paired reads. I wonder if there is way to filter the bam bile and separate the paired reads/singleton... to feed the dexseq...

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X