Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTseq Segmentation fault

    I have been trying to get counts for genes using HTseq but it keep crashing giving me a segmentation error and I have no idea what is causing it.

    The data I use is created using bowtie2 on Human data from Encode:
    bowtie2 -p 16 -x human_index/hg19 -1 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep1 -2 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep1 -S wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam

    The commandline I use to run HTseq:
    htseq-count -s no wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam Homo_sapiens.GRCh37.70.gtf

    I get 2 warnings and an error if I run this command:
    2200000 GFF lines processed.
    2280612 GFF lines processed.
    Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
    Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
    Segmentation fault (core dumped)

    The log for the segmentation fault has this error:
    kernel: [5076890.033992] python[27097]: segfault at 0 ip 00007f38cec6336c sp 00007fff4b6b7190 error 6 in _HTSeq.so[7f38cec24000+50000]

    I downloaded the GTF file from ensembl and from what I understand I had to replace the chromosome numbers e.g.: "12" to "chr12" (As before I did this it could not find the chromosomes), which I did using sed.

    My GTF file has the following format:
    chr11 processed_pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon_number "1"; gene_name "RP11-304M2.1"; gene_biotype "pseudogene"; transcript_name "RP11-304M2.1-001"; exon_id "ENSE00002139035";
    chr11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST00000521196"; exon_number "1"; gene_name "OR4F2P"; gene_biotype "pseudogene"; transcript_name "OR4F2P-002"; exon_id "ENSE00002124594";
    My Sam file has the following format:
    BILLIEHOLIDAY_0004:1:1:2921:975#0 77 0 0 0 * * 0 0 NCAAAAGTGACAATCCAGCAATTCCAAATAAGGTATGAAAAGGATCCACCATATCTCCTGGCCTGTCTGCAAATCC BIKIHLJMNMb___b______________bQQ__bb____b_____b____bbb_____________bb______b YT:Z:UP
    BILLIEHOLIDAY_0004:1:1:2921:975#0 141 0 0 0 * * 0 0 NNTAGTTATNCTACTCATGTTGNTTCCNNGNNTCCCTAAAGATAATTNGAAGACTTCATTGGATTTATAGAGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP
    BILLIEHOLIDAY_0004:1:1:3277:975#0 83 chr10 64566791 23 76M = 64566652 -215 CAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CPCAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CP

    What am I doing wrong?

    Thank you, Sipko

  • #2
    I wonder if the segfault is related to the malformed SAM file. The warnings are referring to unmapped reads (or their mates) still having chromosomes assigned to them. I've usually only seen that with bwa. You might try taking a small chunk of the your SAM file and see if you can find and remove the offending line. If removing just that read prevents the segfault, then you will have at least gotten around that.

    Comment


    • #3
      You are the first to report a segfault on HTSeq. Not sure how to debug this. You could send me your files and I could try to reproduce it on my machine. You could also try to use the "--samout" option to get htseq-count to write out each SAM line it reads in order to see which line is at fault and then isolate that one.

      Comment


      • #4
        Thanks for your quick response.

        Seeing the run crashes right after the GFF file is read in I presume the line it crashes on is the first one. Also when I try the --samout argument as you suggested the resulting file is empty (0 bytes). I think it has something to do with the format bowtie2 uses as out put. Maybe it could be something to do with spaces or tabs? I had a look at the python-count script but the parser is in a different file, which I did not look at. Maybe something the parser is expecting is having troubles with the bowtie2 format. However I would assume there has been at least a few other people that have used HTseq in conjunction with bowtie2 mappings...

        The sam file and gff file are about 3.5 gb together and I am not sure what is the best way of transferring them. I tried dropbox, but it is a bit slow (and it gave me an error). I could ask our server administrator to give you direct access to our server, but if you have an easier solution I would be happy to hear.

        Thanks, Sipko

        Comment


        • #5
          Actually, as it's the first line: Can you just post the first couple of lines?

          In a way, Python should not crash jsut because some spaces or tabs are mixed up. One notorious issue, however, is line endings. If they get messed up (can happen, for example, if you move files between Windows and Unxi environments, or through FTP), suddenly the whole huge file appears as a single line, which Python then tries to read in one go.

          Comment


          • #6
            I send you a truncated version of the file by mail as the forums seem to change the tabs into spaces.

            Comment


            • #7
              Any updates with this? I'm having the exact same issue. It seems that HTseq doesn't work with sam file generated by bowtie2.

              The reason I switched to bowtie2 is that, in my case, the bwa aligner always generates malformed sam file that can not be correctly converted into bam file using samtools. (with core dump error message "CIGAR and sequence length are inconsistent").

              which aligner to choose? please help!

              Comment


              • #8
                I didnt get any update concerning this topic. I am using STAR to align now and HTseq works with the resulting SAM file. I am very satisfied with STAR and I have seen more people with similar reactions. It was easy to install and I find it very intuitive to use. It is much faster and I like the fact that in their paper (http://bioinformatics.oxfordjournals...ts635.abstract) they show that STAR is the mapper that is the most likely to map reads in agreement with at least two other mappers.
                I specifically like this last part as one of my concerns is that different mappers can give different results leading to different conclusions, even though the data is the same...

                Comment


                • #9
                  I was a bit busy with other stuff, so very sorry I didn't reply earlier. Could one of you two try to make a test case for me, i.e., truncate your SAM and GFF file to just a few lines, and if the segfault still happens, send me the file. Or put the full files somewhere for me to download.

                  (Sipko: Yes, I got your file from two weeks ago, but I think you said it did not cause a segfault, right? Also, you email client changed the tabs to spaces, too.)

                  Comment


                  • #10
                    I would like to report that I'm also getting a segmentation fault when using a sorted SAM produced from a bowtie2 alignment and running htseq-count. I could send the files, but they are quite big. I could trim the SAM file, though I'm too sure how.

                    Comment


                    • #11
                      Add me to the list now. I've also used samtools to convert sam to bam, sort the bam file, and convert back to sam - and still get the same error. This means it's not the first line causing the problem.

                      Is this only happening for paired-end alignments? Mine is paired-end - anyone else?

                      Comment


                      • #12
                        Originally posted by DrS View Post
                        Add me to the list now. I've also used samtools to convert sam to bam, sort the bam file, and convert back to sam - and still get the same error. This means it's not the first line causing the problem.

                        Is this only happening for paired-end alignments? Mine is paired-end - anyone else?
                        If you could post a small version of your SAM file (such that it still causes a crash) somewhere then it'll be easier for this bug (if that's what it turns out to be) to get fixed.

                        Comment


                        • #13
                          I'm trying to narrow down which line is giving me problems right now..

                          Comment


                          • #14
                            It doesn't like the last line from these few lines (the lines before it are fine... this is actually the 74th line of my sam file)


                            M02019:4:000000000-A4P4A:1:1101:6364:9627 99 chr16 24051992 17 160M90S = 24051992 -160 AAAATGTCAATCTGCCCTGTAATAATGCTCATTCTGGCTGCACATCAACTACATGGAACTATAACAGACATTCAGAGACGGTTGAACTGGTTGCTGGAGGGTCACTGAAGACAAACATAAAGAGACGTGAGAGACTGAGTCTGGGCTCTGACTGCTCTCTCTGTCTCTTTAACCCTTTCCCGACCCCCCGAGCCCGACCAAGATCCGGTATGCCGTTTTTTGCTTGAAAAAAAAAAAAAAAAAAAATAAT ABCCAFFFFFFFGGGGGGGGGGHGHHHHHHHHHHHHHHHGGHHHHHHHHHHHHHHHHHHHIIHIIIHHHHHHHHHHHHHFEFGGHHHHHHHHGHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHGG/GGHHGHHHHHHFHHHHHHHHHHHHHHHHHHHHH2@HHHH222221/0011?1//->C.---<<-=.-----...000.---;0;09..;..--0;00;0BFFF;B=--:;-;-;:--.000 AS:i:31XS:i:217 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:8C151 YS:i:313 YT:Z:CP
                            M02019:4:000000000-A4P4A:1:1101:6364:9627 147 chr16 24051992 17 160M = 24051992 -160 AAAATGTCAATCTGCCCTGTAATAATGCTCATTCTGGCTGCACATCAACTACATGGAACTATAACAGACATTCAGAGACGGTTGAACTGGTTGCTGGAGGGTCACTGAAGACAAACATAAAGAGACGTGAGAGACTGAGTCTGGGCTCTGACTGCTCTCT HHHHHHHHHHGGGDGHHHHHHHHHHHHHHHHHHHGHHHHHGHHHHHHHHHHGHHHHHG4HHHHHHHHHHHHHGHHHGGGHHHHHHHHGHHHHHHHGHHHHHHHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFCCCBBBA AS:i:313 XS:i:218 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:8C151 YS:i:313 YT:Z:CP
                            M02019:4:000000000-A4P4A:1:1101:6397:9032 89 chr16 19255696 21 104S146M = 19255696 0 TTTATATTCTGAAGACGGATTACGTGTTTTTGTTTTTTTTTTTCAAGCAGAAGACGGCATAAGAGATCTAGTACGGTCTCGTGGGCTCGGAGATGTGTATAAGAAGCTCAGGAATTACTTTGCCTACAGCCTTGGCAGCCCCAGTGGAGGCTGGGATGATGTTCTGACTGGCACCACGGCCATCCCTCCACAGCTTCCCAGAGGGCCCATCAACGGTCTTCTGTGTTGCTGTGATGGCATGAACAGTGCT 99;///;://9..;.://..:......9;...;FFFDFF;;0GGGGGGGGGGGGGHHHHHG:HGFCHHHHGEDGGGGGGFGHF>>>>/HHHHHHHHHHGHHHHHHHHHHBHHHHHHHHHHHHGHHHHHHGHHHHGGGGGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHGHGFGGHHGGHGGGHHGHHHHHGHHHGHHHHGHHHHHHHGGGGHHHHHHHHGHHHHHGGGGGGGGGGFFFFFFFCCCCC AS:i:292 XS:i:72 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:146 YT:Z:UP
                            M02019:4:000000000-A4P4A:1:1101:6397:9032 165 chr16 19255696 0 * = 19255696 0 CCTTGATGTTCTTCCACTCTTCAGGTGTAACTGTAAAAGTTCTGAACTTCTATAATTTTCCGTCCTCATTGCACCACCGGTCTTCGAGCTCCAGATCCTCATCTGGCTATTGTCCCTCTTTCTGCCCTTCTCTCTTTATTCCACTACGAGCAACTCCTGTCGCCTGACGCAACACTCGCGGTTGCTCCTATCCTAATCATCCGCTACCGCCGCTGATCACAGGTAAGTCCTAGTTAGTAGGAGGTCACTT >111111B3BD333A1111E1A311111133D1DD3331133212B1DD2222222AA2D1/BB//1112211B00//A////B2//0//1@@11@1@1@1B1211101121BB11100B1B@21FCB001121B2122BB2FG1B0/////0<0B111B1///</00///</0/1</---.--.<00<<=0<00=0D00:0;.:-;/::----9;0;0/00;0900000000;///9///:/---//// YT:Z:UP

                            Comment


                            • #15
                              I've now tried this on 14 files I generated alignments for today, and every single one of them fails.

                              I just ran a quick test on a file I had aligned using tophat2, also paired-end reads, and it worked just fine, so I know it's not my install.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X