SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks 2.0.2 segmentation fault bmicro_mit1 RNA Sequencing 13 01-21-2015 10:45 AM
segmentation fault with cisGenome aquleaf Bioinformatics 0 10-03-2012 07:31 AM
Newbler segmentation fault flobpf Bioinformatics 4 04-18-2011 11:45 AM
Maq: Segmentation Fault mrxcm3 Bioinformatics 2 03-21-2011 02:43 PM
Segmentation fault in consed michaelbarton Bioinformatics 0 06-11-2010 12:50 PM

Reply
 
Thread Tools
Old 03-07-2013, 04:45 AM   #1
Sipkovandam@gmail.com
Member
 
Location: England

Join Date: Mar 2013
Posts: 13
Default HTseq Segmentation fault

I have been trying to get counts for genes using HTseq but it keep crashing giving me a segmentation error and I have no idea what is causing it.

The data I use is created using bowtie2 on Human data from Encode:
bowtie2 -p 16 -x human_index/hg19 -1 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep1 -2 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep1 -S wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam

The commandline I use to run HTseq:
htseq-count -s no wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam Homo_sapiens.GRCh37.70.gtf

I get 2 warnings and an error if I run this command:
2200000 GFF lines processed.
2280612 GFF lines processed.
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Segmentation fault (core dumped)

The log for the segmentation fault has this error:
kernel: [5076890.033992] python[27097]: segfault at 0 ip 00007f38cec6336c sp 00007fff4b6b7190 error 6 in _HTSeq.so[7f38cec24000+50000]

I downloaded the GTF file from ensembl and from what I understand I had to replace the chromosome numbers e.g.: "12" to "chr12" (As before I did this it could not find the chromosomes), which I did using sed.

My GTF file has the following format:
chr11 processed_pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon_number "1"; gene_name "RP11-304M2.1"; gene_biotype "pseudogene"; transcript_name "RP11-304M2.1-001"; exon_id "ENSE00002139035";
chr11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST00000521196"; exon_number "1"; gene_name "OR4F2P"; gene_biotype "pseudogene"; transcript_name "OR4F2P-002"; exon_id "ENSE00002124594";
My Sam file has the following format:
BILLIEHOLIDAY_0004:1:1:2921:975#0 77 0 0 0 * * 0 0 NCAAAAGTGACAATCCAGCAATTCCAAATAAGGTATGAAAAGGATCCACCATATCTCCTGGCCTGTCTGCAAATCC BIKIHLJMNMb___b______________bQQ__bb____b_____b____bbb_____________bb______b YT:Z:UP
BILLIEHOLIDAY_0004:1:1:2921:975#0 141 0 0 0 * * 0 0 NNTAGTTATNCTACTCATGTTGNTTCCNNGNNTCCCTAAAGATAATTNGAAGACTTCATTGGATTTATAGAGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP
BILLIEHOLIDAY_0004:1:1:3277:975#0 83 chr10 64566791 23 76M = 64566652 -215 CAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CPCAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CP

What am I doing wrong?

Thank you, Sipko
Sipkovandam@gmail.com is offline   Reply With Quote
Old 03-07-2013, 09:43 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I wonder if the segfault is related to the malformed SAM file. The warnings are referring to unmapped reads (or their mates) still having chromosomes assigned to them. I've usually only seen that with bwa. You might try taking a small chunk of the your SAM file and see if you can find and remove the offending line. If removing just that read prevents the segfault, then you will have at least gotten around that.
dpryan is offline   Reply With Quote
Old 03-07-2013, 10:27 AM   #3
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You are the first to report a segfault on HTSeq. Not sure how to debug this. You could send me your files and I could try to reproduce it on my machine. You could also try to use the "--samout" option to get htseq-count to write out each SAM line it reads in order to see which line is at fault and then isolate that one.
Simon Anders is offline   Reply With Quote
Old 03-08-2013, 08:03 AM   #4
Sipkovandam@gmail.com
Member
 
Location: England

Join Date: Mar 2013
Posts: 13
Default

Thanks for your quick response.

Seeing the run crashes right after the GFF file is read in I presume the line it crashes on is the first one. Also when I try the --samout argument as you suggested the resulting file is empty (0 bytes). I think it has something to do with the format bowtie2 uses as out put. Maybe it could be something to do with spaces or tabs? I had a look at the python-count script but the parser is in a different file, which I did not look at. Maybe something the parser is expecting is having troubles with the bowtie2 format. However I would assume there has been at least a few other people that have used HTseq in conjunction with bowtie2 mappings...

The sam file and gff file are about 3.5 gb together and I am not sure what is the best way of transferring them. I tried dropbox, but it is a bit slow (and it gave me an error). I could ask our server administrator to give you direct access to our server, but if you have an easier solution I would be happy to hear.

Thanks, Sipko
Sipkovandam@gmail.com is offline   Reply With Quote
Old 03-08-2013, 08:15 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Actually, as it's the first line: Can you just post the first couple of lines?

In a way, Python should not crash jsut because some spaces or tabs are mixed up. One notorious issue, however, is line endings. If they get messed up (can happen, for example, if you move files between Windows and Unxi environments, or through FTP), suddenly the whole huge file appears as a single line, which Python then tries to read in one go.
Simon Anders is offline   Reply With Quote
Old 03-08-2013, 10:56 AM   #6
Sipkovandam@gmail.com
Member
 
Location: England

Join Date: Mar 2013
Posts: 13
Default

I send you a truncated version of the file by mail as the forums seem to change the tabs into spaces.
Sipkovandam@gmail.com is offline   Reply With Quote
Old 04-22-2013, 06:53 AM   #7
Rainbird
Member
 
Location: US

Join Date: Dec 2012
Posts: 16
Default

Any updates with this? I'm having the exact same issue. It seems that HTseq doesn't work with sam file generated by bowtie2.

The reason I switched to bowtie2 is that, in my case, the bwa aligner always generates malformed sam file that can not be correctly converted into bam file using samtools. (with core dump error message "CIGAR and sequence length are inconsistent").

which aligner to choose? please help!
Rainbird is offline   Reply With Quote
Old 04-22-2013, 07:12 AM   #8
Sipkovandam@gmail.com
Member
 
Location: England

Join Date: Mar 2013
Posts: 13
Default

I didnt get any update concerning this topic. I am using STAR to align now and HTseq works with the resulting SAM file. I am very satisfied with STAR and I have seen more people with similar reactions. It was easy to install and I find it very intuitive to use. It is much faster and I like the fact that in their paper (http://bioinformatics.oxfordjournals...ts635.abstract) they show that STAR is the mapper that is the most likely to map reads in agreement with at least two other mappers.
I specifically like this last part as one of my concerns is that different mappers can give different results leading to different conclusions, even though the data is the same...
Sipkovandam@gmail.com is offline   Reply With Quote
Old 04-22-2013, 07:17 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

I was a bit busy with other stuff, so very sorry I didn't reply earlier. Could one of you two try to make a test case for me, i.e., truncate your SAM and GFF file to just a few lines, and if the segfault still happens, send me the file. Or put the full files somewhere for me to download.

(Sipko: Yes, I got your file from two weeks ago, but I think you said it did not cause a segfault, right? Also, you email client changed the tabs to spaces, too.)
Simon Anders is offline   Reply With Quote
Old 07-08-2013, 10:18 AM   #10
pecanton
Member
 
Location: Florida, USA

Join Date: Jun 2011
Posts: 13
Default

I would like to report that I'm also getting a segmentation fault when using a sorted SAM produced from a bowtie2 alignment and running htseq-count. I could send the files, but they are quite big. I could trim the SAM file, though I'm too sure how.
pecanton is offline   Reply With Quote
Old 08-19-2013, 12:49 PM   #11
DrS
Member
 
Location: North Carolina

Join Date: Mar 2013
Posts: 17
Default

Add me to the list now. I've also used samtools to convert sam to bam, sort the bam file, and convert back to sam - and still get the same error. This means it's not the first line causing the problem.

Is this only happening for paired-end alignments? Mine is paired-end - anyone else?
DrS is offline   Reply With Quote
Old 08-19-2013, 01:03 PM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by DrS View Post
Add me to the list now. I've also used samtools to convert sam to bam, sort the bam file, and convert back to sam - and still get the same error. This means it's not the first line causing the problem.

Is this only happening for paired-end alignments? Mine is paired-end - anyone else?
If you could post a small version of your SAM file (such that it still causes a crash) somewhere then it'll be easier for this bug (if that's what it turns out to be) to get fixed.
dpryan is offline   Reply With Quote
Old 08-19-2013, 01:13 PM   #13
DrS
Member
 
Location: North Carolina

Join Date: Mar 2013
Posts: 17
Default

I'm trying to narrow down which line is giving me problems right now..
DrS is offline   Reply With Quote
Old 08-19-2013, 01:23 PM   #14
DrS
Member
 
Location: North Carolina

Join Date: Mar 2013
Posts: 17
Default

It doesn't like the last line from these few lines (the lines before it are fine... this is actually the 74th line of my sam file)


Quote:
M02019:4:000000000-A4P4A:1:1101:6364:9627 99 chr16 24051992 17 160M90S = 24051992 -160 AAAATGTCAATCTGCCCTGTAATAATGCTCATTCTGGCTGCACATCAACTACATGGAACTATAACAGACATTCAGAGACGGTTGAACTGGTTGCTGGAGGGTCACTGAAGACAAACATAAAGAGACGTGAGAGACTGAGTCTGGGCTCTGACTGCTCTCTCTGTCTCTTTAACCCTTTCCCGACCCCCCGAGCCCGACCAAGATCCGGTATGCCGTTTTTTGCTTGAAAAAAAAAAAAAAAAAAAATAAT ABCCAFFFFFFFGGGGGGGGGGHGHHHHHHHHHHHHHHHGGHHHHHHHHHHHHHHHHHHHIIHIIIHHHHHHHHHHHHHFEFGGHHHHHHHHGHHHHHHGGGHHHHHHHHHHHHHHHHHHHHHHHHGG/GGHHGHHHHHHFHHHHHHHHHHHHHHHHHHHHH2@HHHH222221/0011?1//->C.---<<-=.-----...000.---;0;09..;..--0;00;0BFFF;B=--:;-;-;:--.000 AS:i:31XS:i:217 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:8C151 YS:i:313 YT:Z:CP
M02019:4:000000000-A4P4A:1:1101:6364:9627 147 chr16 24051992 17 160M = 24051992 -160 AAAATGTCAATCTGCCCTGTAATAATGCTCATTCTGGCTGCACATCAACTACATGGAACTATAACAGACATTCAGAGACGGTTGAACTGGTTGCTGGAGGGTCACTGAAGACAAACATAAAGAGACGTGAGAGACTGAGTCTGGGCTCTGACTGCTCTCT HHHHHHHHHHGGGDGHHHHHHHHHHHHHHHHHHHGHHHHHGHHHHHHHHHHGHHHHHG4HHHHHHHHHHHHHGHHHGGGHHHHHHHHGHHHHHHHGHHHHHHHHHHHHHHHHHHHFHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGFFFFFCCCBBBA AS:i:313 XS:i:218 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:8C151 YS:i:313 YT:Z:CP
M02019:4:000000000-A4P4A:1:1101:6397:9032 89 chr16 19255696 21 104S146M = 19255696 0 TTTATATTCTGAAGACGGATTACGTGTTTTTGTTTTTTTTTTTCAAGCAGAAGACGGCATAAGAGATCTAGTACGGTCTCGTGGGCTCGGAGATGTGTATAAGAAGCTCAGGAATTACTTTGCCTACAGCCTTGGCAGCCCCAGTGGAGGCTGGGATGATGTTCTGACTGGCACCACGGCCATCCCTCCACAGCTTCCCAGAGGGCCCATCAACGGTCTTCTGTGTTGCTGTGATGGCATGAACAGTGCT 99;///;://9..;.://..:......9;...;FFFDFF;;0GGGGGGGGGGGGGHHHHHG:HGFCHHHHGEDGGGGGGFGHF>>>>/HHHHHHHHHHGHHHHHHHHHHBHHHHHHHHHHHHGHHHHHHGHHHHGGGGGHHHHHHHHHGHHHHHHHHHHHHHHHHHHHHHHGHGFGGHHGGHGGGHHGHHHHHGHHHGHHHHGHHHHHHHGGGGHHHHHHHHGHHHHHGGGGGGGGGGFFFFFFFCCCCC AS:i:292 XS:i:72 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:146 YT:Z:UP
M02019:4:000000000-A4P4A:1:1101:6397:9032 165 chr16 19255696 0 * = 19255696 0 CCTTGATGTTCTTCCACTCTTCAGGTGTAACTGTAAAAGTTCTGAACTTCTATAATTTTCCGTCCTCATTGCACCACCGGTCTTCGAGCTCCAGATCCTCATCTGGCTATTGTCCCTCTTTCTGCCCTTCTCTCTTTATTCCACTACGAGCAACTCCTGTCGCCTGACGCAACACTCGCGGTTGCTCCTATCCTAATCATCCGCTACCGCCGCTGATCACAGGTAAGTCCTAGTTAGTAGGAGGTCACTT >111111B3BD333A1111E1A311111133D1DD3331133212B1DD2222222AA2D1/BB//1112211B00//A////B2//0//1@@11@1@1@1B1211101121BB11100B1B@21FCB001121B2122BB2FG1B0/////0<0B111B1///</00///</0/1</---.--.<00<<=0<00=0D00:0;.:-;/::----9;0;0/00;0900000000;///9///:/---//// YT:Z:UP
DrS is offline   Reply With Quote
Old 08-19-2013, 01:40 PM   #15
DrS
Member
 
Location: North Carolina

Join Date: Mar 2013
Posts: 17
Default

I've now tried this on 14 files I generated alignments for today, and every single one of them fails.

I just ran a quick test on a file I had aligned using tophat2, also paired-end reads, and it worked just fine, so I know it's not my install.
DrS is offline   Reply With Quote
Old 08-19-2013, 01:43 PM   #16
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by DrS View Post
It doesn't like the last line from these few lines (the lines before it are fine... this is actually the 74th line of my sam file)
Are those two reads sufficient to cause the crash, or are the earlier lines also required? Also, could you post the exact command you used that caused the crash (with options, like strandedness)?
dpryan is offline   Reply With Quote
Old 08-20-2013, 01:23 AM   #17
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

FYI, I was able to reproduce this crash in HTSeq-0.5.3 but it ran correctly in HTSeq-0.5.4. If you're not using the most up-to-date version, try upgrading and see if that fixes things.
dpryan is offline   Reply With Quote
Old 08-20-2013, 06:16 AM   #18
DrS
Member
 
Location: North Carolina

Join Date: Mar 2013
Posts: 17
Default

Thank you - I'll try to do that. And yes, those were sufficient to make it crash.
DrS is offline   Reply With Quote
Old 08-20-2013, 06:24 AM   #19
DrS
Member
 
Location: North Carolina

Join Date: Mar 2013
Posts: 17
Default

Public License v3. Part of the 'HTSeq' framework, version 0.5.4p1.


Already running 0.5.4
DrS is offline   Reply With Quote
Old 08-20-2013, 06:26 AM   #20
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The most recent version is 0.5.4p3
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:47 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO