Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Sipkovandam@gmail.com
    Member
    • Mar 2013
    • 13

    HTseq Segmentation fault

    I have been trying to get counts for genes using HTseq but it keep crashing giving me a segmentation error and I have no idea what is causing it.

    The data I use is created using bowtie2 on Human data from Encode:
    bowtie2 -p 16 -x human_index/hg19 -1 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep1 -2 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep1 -S wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam

    The commandline I use to run HTseq:
    htseq-count -s no wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam Homo_sapiens.GRCh37.70.gtf

    I get 2 warnings and an error if I run this command:
    2200000 GFF lines processed.
    2280612 GFF lines processed.
    Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
    Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
    Segmentation fault (core dumped)

    The log for the segmentation fault has this error:
    kernel: [5076890.033992] python[27097]: segfault at 0 ip 00007f38cec6336c sp 00007fff4b6b7190 error 6 in _HTSeq.so[7f38cec24000+50000]

    I downloaded the GTF file from ensembl and from what I understand I had to replace the chromosome numbers e.g.: "12" to "chr12" (As before I did this it could not find the chromosomes), which I did using sed.

    My GTF file has the following format:
    chr11 processed_pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon_number "1"; gene_name "RP11-304M2.1"; gene_biotype "pseudogene"; transcript_name "RP11-304M2.1-001"; exon_id "ENSE00002139035";
    chr11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST00000521196"; exon_number "1"; gene_name "OR4F2P"; gene_biotype "pseudogene"; transcript_name "OR4F2P-002"; exon_id "ENSE00002124594";
    My Sam file has the following format:
    BILLIEHOLIDAY_0004:1:1:2921:975#0 77 0 0 0 * * 0 0 NCAAAAGTGACAATCCAGCAATTCCAAATAAGGTATGAAAAGGATCCACCATATCTCCTGGCCTGTCTGCAAATCC BIKIHLJMNMb___b______________bQQ__bb____b_____b____bbb_____________bb______b YT:Z:UP
    BILLIEHOLIDAY_0004:1:1:2921:975#0 141 0 0 0 * * 0 0 NNTAGTTATNCTACTCATGTTGNTTCCNNGNNTCCCTAAAGATAATTNGAAGACTTCATTGGATTTATAGAGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP
    BILLIEHOLIDAY_0004:1:1:3277:975#0 83 chr10 64566791 23 76M = 64566652 -215 CAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CPCAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CP

    What am I doing wrong?

    Thank you, Sipko
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    I wonder if the segfault is related to the malformed SAM file. The warnings are referring to unmapped reads (or their mates) still having chromosomes assigned to them. I've usually only seen that with bwa. You might try taking a small chunk of the your SAM file and see if you can find and remove the offending line. If removing just that read prevents the segfault, then you will have at least gotten around that.

    Comment

    • Simon Anders
      Senior Member
      • Feb 2010
      • 995

      #3
      You are the first to report a segfault on HTSeq. Not sure how to debug this. You could send me your files and I could try to reproduce it on my machine. You could also try to use the "--samout" option to get htseq-count to write out each SAM line it reads in order to see which line is at fault and then isolate that one.

      Comment

      • Sipkovandam@gmail.com
        Member
        • Mar 2013
        • 13

        #4
        Thanks for your quick response.

        Seeing the run crashes right after the GFF file is read in I presume the line it crashes on is the first one. Also when I try the --samout argument as you suggested the resulting file is empty (0 bytes). I think it has something to do with the format bowtie2 uses as out put. Maybe it could be something to do with spaces or tabs? I had a look at the python-count script but the parser is in a different file, which I did not look at. Maybe something the parser is expecting is having troubles with the bowtie2 format. However I would assume there has been at least a few other people that have used HTseq in conjunction with bowtie2 mappings...

        The sam file and gff file are about 3.5 gb together and I am not sure what is the best way of transferring them. I tried dropbox, but it is a bit slow (and it gave me an error). I could ask our server administrator to give you direct access to our server, but if you have an easier solution I would be happy to hear.

        Thanks, Sipko

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          Actually, as it's the first line: Can you just post the first couple of lines?

          In a way, Python should not crash jsut because some spaces or tabs are mixed up. One notorious issue, however, is line endings. If they get messed up (can happen, for example, if you move files between Windows and Unxi environments, or through FTP), suddenly the whole huge file appears as a single line, which Python then tries to read in one go.

          Comment

          • Sipkovandam@gmail.com
            Member
            • Mar 2013
            • 13

            #6
            I send you a truncated version of the file by mail as the forums seem to change the tabs into spaces.

            Comment

            • Rainbird
              Member
              • Dec 2012
              • 16

              #7
              Any updates with this? I'm having the exact same issue. It seems that HTseq doesn't work with sam file generated by bowtie2.

              The reason I switched to bowtie2 is that, in my case, the bwa aligner always generates malformed sam file that can not be correctly converted into bam file using samtools. (with core dump error message "CIGAR and sequence length are inconsistent").

              which aligner to choose? please help!

              Comment

              • Sipkovandam@gmail.com
                Member
                • Mar 2013
                • 13

                #8
                I didnt get any update concerning this topic. I am using STAR to align now and HTseq works with the resulting SAM file. I am very satisfied with STAR and I have seen more people with similar reactions. It was easy to install and I find it very intuitive to use. It is much faster and I like the fact that in their paper (http://bioinformatics.oxfordjournals...ts635.abstract) they show that STAR is the mapper that is the most likely to map reads in agreement with at least two other mappers.
                I specifically like this last part as one of my concerns is that different mappers can give different results leading to different conclusions, even though the data is the same...

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  I was a bit busy with other stuff, so very sorry I didn't reply earlier. Could one of you two try to make a test case for me, i.e., truncate your SAM and GFF file to just a few lines, and if the segfault still happens, send me the file. Or put the full files somewhere for me to download.

                  (Sipko: Yes, I got your file from two weeks ago, but I think you said it did not cause a segfault, right? Also, you email client changed the tabs to spaces, too.)

                  Comment

                  • pecanton
                    Member
                    • Jun 2011
                    • 13

                    #10
                    I would like to report that I'm also getting a segmentation fault when using a sorted SAM produced from a bowtie2 alignment and running htseq-count. I could send the files, but they are quite big. I could trim the SAM file, though I'm too sure how.

                    Comment

                    • DrS
                      Member
                      • Mar 2013
                      • 17

                      #11
                      Add me to the list now. I've also used samtools to convert sam to bam, sort the bam file, and convert back to sam - and still get the same error. This means it's not the first line causing the problem.

                      Is this only happening for paired-end alignments? Mine is paired-end - anyone else?

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        Originally posted by DrS View Post
                        Add me to the list now. I've also used samtools to convert sam to bam, sort the bam file, and convert back to sam - and still get the same error. This means it's not the first line causing the problem.

                        Is this only happening for paired-end alignments? Mine is paired-end - anyone else?
                        If you could post a small version of your SAM file (such that it still causes a crash) somewhere then it'll be easier for this bug (if that's what it turns out to be) to get fixed.

                        Comment

                        • DrS
                          Member
                          • Mar 2013
                          • 17

                          #13
                          I'm trying to narrow down which line is giving me problems right now..

                          Comment

                          • DrS
                            Member
                            • Mar 2013
                            • 17

                            #14
                            It doesn't like the last line from these few lines (the lines before it are fine... this is actually the 74th line of my sam file)


                            M02019:4:000000000-A4P4A:1:1101:6364:9627 99 chr16 24051992 17 160M90S = 24051992 -160 AAAATGTCAATCTGCCCTGTAATAATGCTCATTCTGGCTGCACATCAACTACATGGAACTATAACAGACATTCAGAGACGGTTGAACTGGTTGCTGGAGGGTCACTGAAGACAAACATAAAGAGACGTGAGAGACTGAGTCTGGGCTCTGACTGCTCTCTCTGTCTCTTTAACCCTTTCCCGACCCCCCGAGCCCGACCAAGATCCGGTATGCCGTTTTTTGCTTGAAAAAAAAAAAAAAAAAAAATAAT ABCCAFFFFFFFGGGGGGGGGGHGHHHHHHHHHHHHHHHGGHHHHHHHHHHHHHHHHHHHIIHIIIHHHHHHHHHHHHHFEFGGHHHHHHHHGHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHGG/GGHHGHHHHHHFHHHHHHHHHHHHHHHHHHHHH2@HHHH222221/0011?1//->C.---<<-=.-----...000.---;0;09..;..--0;00;0BFFF;B=--:;-;-;:--.000 AS:i:31XS:i:217 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:8C151 YS:i:313 YT:Z:CP
                            M02019:4:000000000-A4P4A:1:1101:6364:9627 147 chr16 24051992 17 160M = 24051992 -160 AAAATGTCAATCTGCCCTGTAATAATGCTCATTCTGGCTGCACATCAACTACATGGAACTATAACAGACATTCAGAGACGGTTGAACTGGTTGCTGGAGGGTCACTGAAGACAAACATAAAGAGACGTGAGAGACTGAGTCTGGGCTCTGACTGCTCTCT HHHHHHHHHHGGGDGHHHHHHHHHHHHHHHHHHHGHHHHHGHHHHHHHHHHGHHHHHG4HHHHHHHHHHHHHGHHHGGGHHHHHHHHGHHHHHHHGHHHHHHHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFCCCBBBA AS:i:313 XS:i:218 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:8C151 YS:i:313 YT:Z:CP
                            M02019:4:000000000-A4P4A:1:1101:6397:9032 89 chr16 19255696 21 104S146M = 19255696 0 TTTATATTCTGAAGACGGATTACGTGTTTTTGTTTTTTTTTTTCAAGCAGAAGACGGCATAAGAGATCTAGTACGGTCTCGTGGGCTCGGAGATGTGTATAAGAAGCTCAGGAATTACTTTGCCTACAGCCTTGGCAGCCCCAGTGGAGGCTGGGATGATGTTCTGACTGGCACCACGGCCATCCCTCCACAGCTTCCCAGAGGGCCCATCAACGGTCTTCTGTGTTGCTGTGATGGCATGAACAGTGCT 99;///;://9..;.://..:......9;...;FFFDFF;;0GGGGGGGGGGGGGHHHHHG:HGFCHHHHGEDGGGGGGFGHF>>>>/HHHHHHHHHHGHHHHHHHHHHBHHHHHHHHHHHHGHHHHHHGHHHHGGGGGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHGHGFGGHHGGHGGGHHGHHHHHGHHHGHHHHGHHHHHHHGGGGHHHHHHHHGHHHHHGGGGGGGGGGFFFFFFFCCCCC AS:i:292 XS:i:72 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:146 YT:Z:UP
                            M02019:4:000000000-A4P4A:1:1101:6397:9032 165 chr16 19255696 0 * = 19255696 0 CCTTGATGTTCTTCCACTCTTCAGGTGTAACTGTAAAAGTTCTGAACTTCTATAATTTTCCGTCCTCATTGCACCACCGGTCTTCGAGCTCCAGATCCTCATCTGGCTATTGTCCCTCTTTCTGCCCTTCTCTCTTTATTCCACTACGAGCAACTCCTGTCGCCTGACGCAACACTCGCGGTTGCTCCTATCCTAATCATCCGCTACCGCCGCTGATCACAGGTAAGTCCTAGTTAGTAGGAGGTCACTT >111111B3BD333A1111E1A311111133D1DD3331133212B1DD2222222AA2D1/BB//1112211B00//A////B2//0//1@@11@1@1@1B1211101121BB11100B1B@21FCB001121B2122BB2FG1B0/////0<0B111B1///</00///</0/1</---.--.<00<<=0<00=0D00:0;.:-;/::----9;0;0/00;0900000000;///9///:/---//// YT:Z:UP

                            Comment

                            • DrS
                              Member
                              • Mar 2013
                              • 17

                              #15
                              I've now tried this on 14 files I generated alignments for today, and every single one of them fails.

                              I just ran a quick test on a file I had aligned using tophat2, also paired-end reads, and it worked just fine, so I know it's not my install.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...