Seqanswers Leaderboard Ad

**GenoMax** · 03-17-2016, 02:05 PM

Can you post a couple of lines of that SAM file?

Also give featureCounts a try as an alternative to htseq-count. It is fast.

**sunkid** · 03-17-2016, 02:12 PM

Here is a "broken" SAM file:

Code:

@HD     VN:1.3  SO:coordinate
@SQ     SN:NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence LN:227556264
NB500958:24:HYTTJBGXX:1:13311:22248:4566 1:N:0:CTCTCG   0       NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence    44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG        EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
NB500958:24:HYTTJBGXX:3:21404:5993:2338 1:N:0:CTCTCG    0       NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence    44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
NB500958:24:HYTTJBGXX:3:21406:25254:3292 1:N:0:CTCTCG   0       NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence    44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1

and here is the corresponding "fixed" file:

Code:

@HD     VN:1.3  SO:coordinate
@SQ     SN:NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence LN:227556264
NB500958:24:HYTTJBGXX:1:13311:22248:4566 1:N:0:CTCTCG   0       NC_022272.1     44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG
        EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
NB500958:24:HYTTJBGXX:3:21404:5993:2338 1:N:0:CTCTCG    0       NC_022272.1     44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG
        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
NB500958:24:HYTTJBGXX:3:21406:25254:3292 1:N:0:CTCTCG   0       NC_022272.1     44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG
        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1

**GenoMax** · 03-17-2016, 02:21 PM

I assume that is how your fasta header was formatted in the genome file?

**sunkid** · 03-17-2016, 02:55 PM

Originally posted by GenoMax View Post

I assume that is how your fasta header was formatted in the genome file?

Yes. Seems like I am running into a problem with the description from the FASTA file with featureCounts as well:

Code:

Assertion failed: (l_name < MAX_CHROMOSOME_NAME_LEN), function process_pairer_header, file readSummary.c, line 1509.

This was tripped by the description in the SAM header... probably best to fix the FASTA file and remove all descriptions, no?

**GenoMax** · 03-17-2016, 03:00 PM

That would be the clean way of doing it but if it took a while to do the alignment then trying to edit the SAM file may be an option. You will have to fix the genome fasta file so the ID matches there for subsequent steps.

**sunkid** · 03-17-2016, 03:04 PM

Originally posted by GenoMax View Post

That would be the clean way of doing it but if it took a while to do the alignment then trying to edit the SAM file may be an option. You will have to fix the genome fasta file so the ID matches there for subsequent steps.

Live and learn

For now, I fixed the alignment files and increased MAX_CHROMOSOME_NAME_LEN to 250, which made featureCounts happy.

Thanks again for your help!

**Brian Bushnell** · 03-17-2016, 08:31 PM

It looks like the difference between the "broken" and "fixed" files is that the "broken" files allow whitespace in the sequence names, while the "fixed" ones truncate it. BBMap is capable of truncating sequence names at the first whitespace if you add the "trd" flag (short for "trimreaddescriptions"). Since so many sequence names contain spaces (such as the sequence in your input), I consider it bad practice to write a tool that won't process them by default.

**sunkid** · 03-17-2016, 09:09 PM

Originally posted by Brian Bushnell View Post

It looks like the difference between the "broken" and "fixed" files is that the "broken" files allow whitespace in the sequence names, while the "fixed" ones truncate it. BBMap is capable of truncating sequence names at the first whitespace if you add the "trd" flag (short for "trimreaddescriptions"). Since so many sequence names contain spaces (such as the sequence in your input), I consider it bad practice to write a tool that won't process them by default.

As per the FASTA format definition, sequence names do not contain white spaces. Rather, anything following a white space is considered a description and is typically treated as such by programs reading FASTA files. While I think it is useful to have an option to force a program to do so, IMHO it should be the default.

**Brian Bushnell** · 03-17-2016, 09:47 PM

Originally posted by sunkid View Post

As per the FASTA format definition, sequence names do not contain white spaces. Rather, anything following a white space is considered a description and is typically treated as such by programs reading FASTA files. While I think it is useful to have an option to force a program to do so, IMHO it should be the default.

One of my co-workers told me that, which is why I added that option and named it the way I did ("truncate read descriptions"). However, in my opinion, Fasta does not have a formal format. If it did have a format, there would be some person or agency in charge... and, who is that? Well... nobody. So, like all loose languages, it's defined by usage. And sequences very often contain spaces in their names, but never tabs. For example, "E.coli strain 1234 version B" is a perfectly valid identifier. I would prefer "E.coli_strain_1234_version_B", but that's not how biologists typically name things.

It is occasionally the job of programmers to impose order on a disordered world. That's pretty rare. Usually, it is the job of programmers to maintain compatibility with a disordered world.

IMO, SAM is a tab-delimited format. Each row ends with a newline, and each field ends with a tab. It's very simple. There are many programs incapable of processing this format, and from what you've said, it seems like htseq-count is one of them.

Unfortunately, the sam format specification does not actually state any of these things, and does not fully define most of its fields, which is why we are arguing. So - I will continue to maintain BBTools as compatible with my understanding of the standards.

**sunkid** · 03-18-2016, 08:20 AM

It's always good to check one's assumptions and having worked with and written software for the FASTA format myself for over 25 years now (mostly in the protein sequence world), I had another look at the authoritative source (authority over the format being granted to the inventors of it, not some agency). Accordingly:

FASTA format files consist of a description line, beginning with a ’>’ character, followed by the sequence itself:

Code:

     >sequence name and description 1
     A F A S Y T .... actual sequence.
     F S S       .... second line of sequence.
     >sequence name and description 2
     PMILTYV ... sequence 2

All of the characters of the description line are read, and special characters can be used to indicate additional information about the sequence. In general, non-amino-acid/non-nucleotide sequences in the sequence lines are ignored.

FASTA format files from major sequence distributors, like the NCBI and EBI, have specially formatted description lines, e.g.:

Code:

     >gi|54321|ref|np 12345| example NCBI refused sequence

or

Code:

     >sw:gstm1 human P01234 glutathione transferase GSTM1 - human

[...] You can build your own library by concatenating several sequence files. Just be sure that each sequence is preceded by a line beginning with a ’>’ followed by a sequence name/description.

So, indeed, there is a format specification, albeit a very simple one, which is what we like about FASTA, right!? That said, the specification does not say anything about the formatting required of the "sequence name/description" line and what I had assumed to be a rule (sequence name first, no white space; the remainder of the line is a description) is merely a convention adhered to by almost all of the software that I have used. In fact, some of those applications explicitly put additional restrictions on the FASTA format (e.g. the name must be unique to the file).

Be that as it may, the SAM format specification does include the TAB-delimited nature of the fields. htseq-count seems fully capable of parsing this format correctly, but apparently expects a string not containing any white space in the RNAME field. As to why or what for, I have no idea and am much to busy to dig into that code. featureCounts IS much faster and the results compare well between the two tools.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

BBMap, htseq-count, and reference sequence names

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News