SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
featureCounts Error: SAM_pairer_osr_next_name: Assertion `rlen < 1024' failed nachocab Bioinformatics 5 12-01-2015 12:41 PM
featureCounts SylvainL Bioinformatics 4 03-16-2015 03:35 AM
featureCounts error heso Bioinformatics 8 11-14-2014 12:14 PM
Bug in featureCounts? brinda Bioinformatics 1 01-23-2014 06:35 PM
Bug in featureCounts? brinda Bioinformatics 1 01-19-2014 04:22 PM

Reply
 
Thread Tools
Old 03-19-2017, 05:08 PM   #1
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,346
Default featureCounts error

I am running into a featureCounts error that has been referred to in prior reports (https://www.biostars.org/p/213666/ and http://seqanswers.com/forums/showthread.php?t=64587 etc).
Code:
featureCounts: readSummary.c:1578: process_pairer_header: Assertion `l_name < 100' failed.
It does not appear to have been addressed as far as I can tell. I am using data from a regular Illumina HiSeq 4000 run mapped using BBMap.

@Wei Shi: I will send you a PM in case you don't happen to see this thread.

Last edited by GenoMax; 03-19-2017 at 05:11 PM.
GenoMax is offline   Reply With Quote
Old 03-19-2017, 06:00 PM   #2
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Sorry about this. Could you please try out the latest version we released a couple of days ago? If the problem persists, we will need to take a look at the data.
shi is offline   Reply With Quote
Old 03-20-2017, 04:12 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,346
Default

Hi Wei: New version of subread still has the same error. I just emailed you with a sample of the data.

Last edited by GenoMax; 03-20-2017 at 04:20 AM.
GenoMax is offline   Reply With Quote
Old 03-22-2017, 02:48 PM   #4
yangliao
Member
 
Location: Melboune, VIC

Join Date: May 2013
Posts: 11
Default

Quote:
Originally Posted by GenoMax View Post
Hi Wei: New version of subread still has the same error. I just emailed you with a sample of the data.
Hi GenoMax,

I tried using Samtools (v 1.1) to convert the BAM file you provided to the SAM format. In the Samtools output, the chromosome names looked abnormal:

Code:
K00270:29:HFV25BBXX:6:1205:24616:23681 1:N:0:TTAGGC     163     chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38  11612   3       50M     =       11692   130     GTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCC      AAFFFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJ   XT:A:R  NM:i:0  AM:i:3
K00270:29:HFV25BBXX:6:1205:24616:23681 1:N:0:TTAGGC     83      chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38  11692   3       50M     =       11612   -130    ATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCA      JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA   XT:A:R  NM:i:0  AM:i:3
The texts marked in red are where the chromosome names should be in the Samtools output from a normal BAM file. However, in the Samtools output from your BAM file, they are not only the chromosome names, but also include all the information tags ("LN", "M5", "AS", etc) in the BAM file header. This suggests that your BAM file had an incorrect format on the chromosome names.

Can you please look into your pipeline to see if there are format conversion issues when generating the BAM files?

Thanks!

Yang
yangliao is offline   Reply With Quote
Old 03-22-2017, 04:01 PM   #5
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Thanks for sending us an example bam file, @GenoMax. We found there were some tags added into the chr field in the bam file, which led to the crash of featureCounts. Below are the first two reads in the file
Quote:
K00270:29:HFV25BBXX:6:2220:19278:41018 1:N:0:TTAGGC 99 chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38 13603 3 50M = 13712 159 GTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

JJJJJJJFFJJJF XT:A:R NM:i:0 AM:i:3
K00270:29:HFV25BBXX:6:2220:19278:41018 1:N:0:TTAGGC 147 chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38 13712 3 50M = 13603 -159 AGCTCCCCTCTGTGGAATCCAATCTGTCTTCCATCCTGCGTGGCCGAGGG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
JJJJJJJJFFFAA XT:A:R NM:i:0 AM:i:3
shi is offline   Reply With Quote
Old 03-22-2017, 04:09 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,580
Default

It looks like the issue is that BBMap was printing the original names. If you use the "trd" flag it will truncate the names after the first whitespace (to just "chr1"), which some tools expect. Samtools will happily accept names with whitespaces, and by default BBMap retains the entire headers, to avoid loss of information. Can you check whether BBMap with the "trd" flag resolves the issue, or if it's something else?
Brian Bushnell is offline   Reply With Quote
Old 03-23-2017, 03:39 PM   #7
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

Quote:
Originally Posted by Brian Bushnell View Post
Samtools will happily accept names with whitespaces, and by default BBMap retains the entire headers, to avoid loss of information.
Given the SAM specifications disallow spaces in the RNAME field (Regular expression: "\*|[!-()+-<>-~][!-~]*"), it's unsurprising that downstream tools break.

Would changing the BBMap default output to comply with the file format specifications cause other issues? If loss of information is the issue then the full fasta headers could still be included in the SAM file in the @SQ headers, or is the use case you're defending against one in which the fasta header are not unique (eg "chr1 hg19" and "chr1 mm10" in the same reference)?

Edit: Out of curiosity: if the fasta header contains a TAB, does BBMap clean it up, or just write a malformed SAM record?

Last edited by dcameron; 03-23-2017 at 03:44 PM.
dcameron is offline   Reply With Quote
Old 03-23-2017, 06:15 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,580
Default

Quote:
Originally Posted by dcameron View Post
Given the SAM specifications disallow spaces in the RNAME field (Regular expression: "\*|[!-()+-<>-~][!-~]*"), it's unsurprising that downstream tools break.
BBMap reports names as they were stated in the input files. I have no idea why the sam format disallows spaces, since they don't cause any problems and are widely used in sequence headers. I did not initially notice that they might be disallowed since it's never explicitly stated, just encoded in a regex. I ignored that since regexes are language-specific and the sam spec has no mention of which language it refers to, so it's impossible to decode unambiguously. Also, it has no mention of the term "whitespace". It explicitly states that spaces are allowed in some fields, but never mentions any fields in which they are disallowed. Coupled with the fact that samtools supports spaces in names, I have never seen any evidence that spaces in names were disallowed in the sam format, and I conclude that they in fact are not disallowed, because "(Regular expression: "\*|[!-()+-<>-~][!-~]*")" has no meaning without a language being specified; and no language has been specified by the sam format. There are many regex languages. There are with numerous other absences in the sam spec, so you generally have to write an interpretation based on observation, and it involves a lot of guesswork. Consider the cigar string and MD tag, for example - I still have no idea what "P" means in the cigar string. It's defined as "P padding (silent deletion from padded reference)". What does that mean? There's no way to know. I've never encountered one. Similarly, MD tags are crucial, but they are never defined anywhere. The current sam spec does not even mention the term MD. You just have to kind of guess how it works from observation. So I'd say sam is kind of like English or other spoken languages - defined based on usage, since it's obviously not completely specified. As such, not randomly discarding information is probably a better choice.

That said -

1) If people tend to feel like the sam spec disallows spaces in reference names, I will probably change BBMap's default to truncate at the first whitespace, so lossless mode will require the "trd=f" flag. It's never been a problem in the past, though, and samtools obviously does handle spaces in reference names. Since the sam spec is not very informative, I tend to feel like samtools is the definitive answer on what is or is not sam compliant; by that metric, currently, spaces in names are compliant.
2) This change would be unfortunate, since fasta files may contain sequences like "chr1" and "chr1 alt3" which are not supported by this interpretation of the sam format, but are supported by both BBMap and samtools.

Quote:
Would changing the BBMap default output to comply with the file format specifications cause other issues? If loss of information is the issue then the full fasta headers could still be included in the SAM file in the @SQ headers, or is the use case you're defending against one in which the fasta header are not unique (eg "chr1 hg19" and "chr1 mm10" in the same reference)?
There's no problem, I just don't like to discard information when I don't need to. If an input sequence is named "chr1 hg19" I don't see any reason to rename it, as long as downstream tools can handle spaces, which they generally can. But I'll probably change the defaults to discard information, since compatibility is generally considered to be more important than accuracy.

Quote:
Edit: Out of curiosity: if the fasta header contains a TAB, does BBMap clean it up, or just write a malformed SAM record?
That's a good question - I think, in that case, BBMap would write a malformed sam record. I'll fix that.

Last edited by Brian Bushnell; 03-23-2017 at 07:04 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-26-2017, 03:41 PM   #9
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

Quote:
Originally Posted by Brian Bushnell View Post
I have no idea why the sam format disallows spaces, since they don't cause any problems and are widely used in sequence headers. I did not initially notice that they might be disallowed since it's never explicitly stated, just encoded in a regex. I ignored that since regexes are language-specific and the sam spec has no mention of which language it refers to, so it's impossible to decode unambiguously.
The specs are indeed a bit light on detail. Not only language, but character encoding is not specified. For VCF, I contributed a clarification that it is US-ASCII (technically headerless UTF-8 for VCF) but it appears that this, and many other clarifications, did not get ported to the SAM specs as well.

Quote:
Originally Posted by Brian Bushnell View Post
It explicitly states that spaces are allowed in some fields, but never mentions any fields in which they are disallowed. I have never seen any evidence that spaces in names were disallowed in the sam format, and I conclude that they in fact are not disallowed, because "(Regular expression: "\*|[!-()+-<>-~][!-~]*")" has no meaning without a language being specified; and no language has been specified by the sam format. There are many regex languages. There are with numerous other absences in the sam spec, so you generally have to write an interpretation based on observation, and it involves a lot of guesswork.
If we treated every ambiguous part of the SAM specifications as if it didn't exist, then nobody's output would be compatible with anyone else's programs. Yes we have to guess, but as everything I've seen in the SAM specs has used C-style syntax (e.g. the bit-wise operators in the BAM header definition), and the initial SAM/BAM implementation was was written in C, it seems reasonable to default to a C-style interpretation for such things - including things such as char = 1 byte US-ASCII.

There have been multiple attempts (including by me) to define a formal grammar for SAM/VCF, but none of these attempts have even made it into the specifications document. My attempt for VCF just ended up with me identifying ~10 different parsing ambiguities that required some sort of change to the specifications to resolve and stalled getting people to agree on what to do.

Quote:
Originally Posted by Brian Bushnell View Post
Similarly, MD tags are crucial, but they are never defined anywhere. The current sam spec does not even mention the term MD.
That's now defined in the SAM tags specifications document. Any yes, that document still doesn't even actually define it properly.

Quote:
Originally Posted by Brian Bushnell View Post
1) If people tend to feel like the sam spec disallows spaces in reference names, I will probably change BBMap's default to truncate at the first whitespace, so lossless mode will require the "trd=f" flag. It's never been a problem in the past, though, and samtools obviously does handle spaces in reference names. Since the sam spec is not very informative, I tend to feel like samtools is the definitive answer on what is or is not sam compliant; by that metric, currently, spaces in names are compliant.
Ideally, everyone's programs would be able to accept as much invalid input as possible and still function. Unfortunately, it's usually downstream of the alignment/samtools that everything blows up. Even more unfortunate is the fact that the VCF file format can't even unambiguously handle all the characters that are actually allowed by the SAM regex.

Quote:
Originally Posted by Brian Bushnell View Post
There's no problem, I just don't like to discard information when I don't need to. If an input sequence is named "chr1 hg19" I don't see any reason to rename it, as long as downstream tools can handle spaces, which they generally can.
Clearly you are more optimistic about the robustness of bioinformatics software than I am
dcameron is offline   Reply With Quote
Old 03-27-2017, 04:49 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,346
Default

I wanted to provide follow-up. The problem did turn out to be due to spaces in the fasta headers present in iGenomes GRCh38 genome fasta.

As suggested by Brian, after truncating the reference names after first space (trd=t option for bbmap.sh) the problem went away. featureCounts ran and produced counts.

Origins of the problem lie in the iGenomes bundle that I had downloaded (and then built BBMap index) for GRCh38. Instead of simple fasta ID’s there are long strings with spaces in the chromosome names which caused featureCounts to produce an error.

Based on the discussion above it appears that the “standard” for SAM is vague and there may be no reasonable solution.

Prompt assistance provided by @Brian, @Wei and @Yangliao is much appreciated. Bioinformaticians/developers like them who stand behind their software provide an incredible service for us all.

Last edited by GenoMax; 03-27-2017 at 04:53 AM.
GenoMax is offline   Reply With Quote
Old 03-27-2017, 03:47 PM   #11
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

Quote:
Originally Posted by GenoMax View Post
Based on the discussion above it appears that the “standard” for SAM is vague and there may be no reasonable solution.
I've raised an issue with the SAM specs here to explicitly state that the regexes are POSIX basic regular expressions and the encoding is US-ASCII.

As to whether fasta/fastq information after a space is part of the sequence name or is considered a comment is outside the scope of the SAM specifications. Unfortunately, there's not even a FASTA specification document to argue over. I'm inclined to break at any whitespace but I suspect that's due to my familiarity with myriad downstream complications when doing SV calling in VCF breakend notation. Did you know "C]<=:-0>]" is a perfectly valid ALT allele?
dcameron is offline   Reply With Quote
Old 03-27-2017, 04:22 PM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,346
Default

Quote:
Originally Posted by dcameron View Post
Unfortunately, there's not even a FASTA specification document to argue over.
Slightly off-topic but Bill Pearson offered a historical perspective on FASTA format in a thread he participated in a few months back on Biostars.
GenoMax is offline   Reply With Quote
Old 03-28-2017, 01:01 AM   #13
jkbonfield
Senior Member
 
Location: Cambridge, UK

Join Date: Jul 2008
Posts: 143
Default

Quote:
Originally Posted by GenoMax View Post
I wanted to provide follow-up. The problem did turn out to be due to spaces in the fasta headers present in iGenomes GRCh38 genome fasta.
I would say the issues wasn't the spaces in the fasta headers but the program that interpreted the headers without any room for the usual ">name comment" syntax. There has never been, as far as I'm aware, any assumption that the fasta name after the first whitespace (or semicolon I think in early days) was to be used as part of the name itself.
jkbonfield is offline   Reply With Quote
Reply

Tags
featurecounts, subread

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO