I have been trying to get counts for genes using HTseq but it keep crashing giving me a segmentation error and I have no idea what is causing it.
The data I use is created using bowtie2 on Human data from Encode:
bowtie2 -p 16 -x human_index/hg19 -1 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep1 -2 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep1 -S wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam
The commandline I use to run HTseq:
htseq-count -s no wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam Homo_sapiens.GRCh37.70.gtf
I get 2 warnings and an error if I run this command:
2200000 GFF lines processed.
2280612 GFF lines processed.
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Segmentation fault (core dumped)
The log for the segmentation fault has this error:
kernel: [5076890.033992] python[27097]: segfault at 0 ip 00007f38cec6336c sp 00007fff4b6b7190 error 6 in _HTSeq.so[7f38cec24000+50000]
I downloaded the GTF file from ensembl and from what I understand I had to replace the chromosome numbers e.g.: "12" to "chr12" (As before I did this it could not find the chromosomes), which I did using sed.
My GTF file has the following format:
chr11 processed_pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon_number "1"; gene_name "RP11-304M2.1"; gene_biotype "pseudogene"; transcript_name "RP11-304M2.1-001"; exon_id "ENSE00002139035";
chr11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST00000521196"; exon_number "1"; gene_name "OR4F2P"; gene_biotype "pseudogene"; transcript_name "OR4F2P-002"; exon_id "ENSE00002124594";
My Sam file has the following format:
BILLIEHOLIDAY_0004:1:1:2921:975#0 77 0 0 0 * * 0 0 NCAAAAGTGACAATCCAGCAATTCCAAATAAGGTATGAAAAGGATCCACCATATCTCCTGGCCTGTCTGCAAATCC BIKIHLJMNMb___b______________bQQ__bb____b_____b____bbb_____________bb______b YT:Z:UP
BILLIEHOLIDAY_0004:1:1:2921:975#0 141 0 0 0 * * 0 0 NNTAGTTATNCTACTCATGTTGNTTCCNNGNNTCCCTAAAGATAATTNGAAGACTTCATTGGATTTATAGAGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP
BILLIEHOLIDAY_0004:1:1:3277:975#0 83 chr10 64566791 23 76M = 64566652 -215 CAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CPCAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CP
What am I doing wrong?
Thank you, Sipko
The data I use is created using bowtie2 on Human data from Encode:
bowtie2 -p 16 -x human_index/hg19 -1 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd1Rep1 -2 wgEncodeCshlLongRnaSeqHelas3CellPapFastqRd2Rep1 -S wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam
The commandline I use to run HTseq:
htseq-count -s no wgEncodeCshlLongRnaSeqHelas3CellPapFastqRep1.sam Homo_sapiens.GRCh37.70.gtf
I get 2 warnings and an error if I run this command:
2200000 GFF lines processed.
2280612 GFF lines processed.
Warning: Malformed SAM line: MRNM != '*' although flag bit &0x0008 set
Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set
Segmentation fault (core dumped)
The log for the segmentation fault has this error:
kernel: [5076890.033992] python[27097]: segfault at 0 ip 00007f38cec6336c sp 00007fff4b6b7190 error 6 in _HTSeq.so[7f38cec24000+50000]
I downloaded the GTF file from ensembl and from what I understand I had to replace the chromosome numbers e.g.: "12" to "chr12" (As before I did this it could not find the chromosomes), which I did using sed.
My GTF file has the following format:
chr11 processed_pseudogene exon 75780 76143 . + . gene_id "ENSG00000253826"; transcript_id "ENST00000519787"; exon_number "1"; gene_name "RP11-304M2.1"; gene_biotype "pseudogene"; transcript_name "RP11-304M2.1-001"; exon_id "ENSE00002139035";
chr11 processed_transcript exon 86612 87605 . - . gene_id "ENSG00000224777"; transcript_id "ENST00000521196"; exon_number "1"; gene_name "OR4F2P"; gene_biotype "pseudogene"; transcript_name "OR4F2P-002"; exon_id "ENSE00002124594";
My Sam file has the following format:
BILLIEHOLIDAY_0004:1:1:2921:975#0 77 0 0 0 * * 0 0 NCAAAAGTGACAATCCAGCAATTCCAAATAAGGTATGAAAAGGATCCACCATATCTCCTGGCCTGTCTGCAAATCC BIKIHLJMNMb___b______________bQQ__bb____b_____b____bbb_____________bb______b YT:Z:UP
BILLIEHOLIDAY_0004:1:1:2921:975#0 141 0 0 0 * * 0 0 NNTAGTTATNCTACTCATGTTGNTTCCNNGNNTCCCTAAAGATAATTNGAAGACTTCATTGGATTTATAGAGAGAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB YT:Z:UP
BILLIEHOLIDAY_0004:1:1:3277:975#0 83 chr10 64566791 23 76M = 64566652 -215 CAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CPCAGAAGATCACAGCTAGAGAATTGAGAATTAACTATACTACTAGCCATTTTAGGGCACCAAAACTTGGGATTAAAN BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:75C0 YS:i:-39 YT:Z:CP
What am I doing wrong?
Thank you, Sipko
Comment