Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBMap, htseq-count, and reference sequence names

    After performing the alignment with bbmap.sh, the resulting SAM file contains both the sequence identifier and the description of each sequence from the FASTA file in the RNAME (reference sequence name) field. htseq-count does not seem to be able to handle these and annotates all reads as '__no_feature'.

    Am I missing something here?

    Other than "brute force", is there a tool to fix the SAM files?

    Again, thanks for any pointers and feedback!

  • #2
    Can you post a couple of lines of that SAM file?

    Also give featureCounts a try as an alternative to htseq-count. It is fast.

    Comment


    • #3
      Here is a "broken" SAM file:
      Code:
      @HD     VN:1.3  SO:coordinate
      @SQ     SN:NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence LN:227556264
      NB500958:24:HYTTJBGXX:1:13311:22248:4566 1:N:0:CTCTCG   0       NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence    44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG        EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
      NB500958:24:HYTTJBGXX:3:21404:5993:2338 1:N:0:CTCTCG    0       NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence    44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
      NB500958:24:HYTTJBGXX:3:21406:25254:3292 1:N:0:CTCTCG   0       NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence    44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG        EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
      and here is the corresponding "fixed" file:
      Code:
      @HD     VN:1.3  SO:coordinate
      @SQ     SN:NC_022272.1 Macaca fascicularis chromosome 1, Macaca_fascicularis_5.0, whole genome shotgun sequence LN:227556264
      NB500958:24:HYTTJBGXX:1:13311:22248:4566 1:N:0:CTCTCG   0       NC_022272.1     44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG
              EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
      NB500958:24:HYTTJBGXX:3:21404:5993:2338 1:N:0:CTCTCG    0       NC_022272.1     44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG
              EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1
      NB500958:24:HYTTJBGXX:3:21406:25254:3292 1:N:0:CTCTCG   0       NC_022272.1     44298   43      64M     *       0       0       TAATCCAGTGTCCAACCATATCAAAGCGGCTCTCAGTCTCCAACCGCCTGCTTCGCCTTCCTTG
              EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE        NM:i:0  AM:i:43 XM:i:1  NH:i:1

      Comment


      • #4
        I assume that is how your fasta header was formatted in the genome file?

        Comment


        • #5
          Originally posted by GenoMax View Post
          I assume that is how your fasta header was formatted in the genome file?
          Yes. Seems like I am running into a problem with the description from the FASTA file with featureCounts as well:
          Code:
          Assertion failed: (l_name < MAX_CHROMOSOME_NAME_LEN), function process_pairer_header, file readSummary.c, line 1509.
          This was tripped by the description in the SAM header... probably best to fix the FASTA file and remove all descriptions, no?

          Comment


          • #6
            That would be the clean way of doing it but if it took a while to do the alignment then trying to edit the SAM file may be an option. You will have to fix the genome fasta file so the ID matches there for subsequent steps.

            Comment


            • #7
              Originally posted by GenoMax View Post
              That would be the clean way of doing it but if it took a while to do the alignment then trying to edit the SAM file may be an option. You will have to fix the genome fasta file so the ID matches there for subsequent steps.
              Live and learn

              For now, I fixed the alignment files and increased MAX_CHROMOSOME_NAME_LEN to 250, which made featureCounts happy.

              Thanks again for your help!

              Comment


              • #8
                It looks like the difference between the "broken" and "fixed" files is that the "broken" files allow whitespace in the sequence names, while the "fixed" ones truncate it. BBMap is capable of truncating sequence names at the first whitespace if you add the "trd" flag (short for "trimreaddescriptions"). Since so many sequence names contain spaces (such as the sequence in your input), I consider it bad practice to write a tool that won't process them by default.

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  It looks like the difference between the "broken" and "fixed" files is that the "broken" files allow whitespace in the sequence names, while the "fixed" ones truncate it. BBMap is capable of truncating sequence names at the first whitespace if you add the "trd" flag (short for "trimreaddescriptions"). Since so many sequence names contain spaces (such as the sequence in your input), I consider it bad practice to write a tool that won't process them by default.
                  As per the FASTA format definition, sequence names do not contain white spaces. Rather, anything following a white space is considered a description and is typically treated as such by programs reading FASTA files. While I think it is useful to have an option to force a program to do so, IMHO it should be the default.

                  Comment


                  • #10
                    Originally posted by sunkid View Post
                    As per the FASTA format definition, sequence names do not contain white spaces. Rather, anything following a white space is considered a description and is typically treated as such by programs reading FASTA files. While I think it is useful to have an option to force a program to do so, IMHO it should be the default.
                    One of my co-workers told me that, which is why I added that option and named it the way I did ("truncate read descriptions"). However, in my opinion, Fasta does not have a formal format. If it did have a format, there would be some person or agency in charge... and, who is that? Well... nobody. So, like all loose languages, it's defined by usage. And sequences very often contain spaces in their names, but never tabs. For example, "E.coli strain 1234 version B" is a perfectly valid identifier. I would prefer "E.coli_strain_1234_version_B", but that's not how biologists typically name things.

                    It is occasionally the job of programmers to impose order on a disordered world. That's pretty rare. Usually, it is the job of programmers to maintain compatibility with a disordered world.

                    IMO, SAM is a tab-delimited format. Each row ends with a newline, and each field ends with a tab. It's very simple. There are many programs incapable of processing this format, and from what you've said, it seems like htseq-count is one of them.

                    Unfortunately, the sam format specification does not actually state any of these things, and does not fully define most of its fields, which is why we are arguing. So - I will continue to maintain BBTools as compatible with my understanding of the standards.

                    Comment


                    • #11
                      It's always good to check one's assumptions and having worked with and written software for the FASTA format myself for over 25 years now (mostly in the protein sequence world), I had another look at the authoritative source (authority over the format being granted to the inventors of it, not some agency). Accordingly:
                      FASTA format files consist of a description line, beginning with a ’>’ character, followed by the sequence itself:

                      Code:
                           >sequence name and description 1
                           A F A S Y T .... actual sequence.
                           F S S       .... second line of sequence.
                           >sequence name and description 2
                           PMILTYV ... sequence 2
                      All of the characters of the description line are read, and special characters can be used to indicate additional information about the sequence. In general, non-amino-acid/non-nucleotide sequences in the sequence lines are ignored.

                      FASTA format files from major sequence distributors, like the NCBI and EBI, have specially formatted description lines, e.g.:
                      Code:
                           >gi|54321|ref|np 12345| example NCBI refused sequence
                      or
                      Code:
                           >sw:gstm1 human P01234 glutathione transferase GSTM1 - human
                      [...] You can build your own library by concatenating several sequence files. Just be sure that each sequence is preceded by a line beginning with a ’>’ followed by a sequence name/description.
                      So, indeed, there is a format specification, albeit a very simple one, which is what we like about FASTA, right!? That said, the specification does not say anything about the formatting required of the "sequence name/description" line and what I had assumed to be a rule (sequence name first, no white space; the remainder of the line is a description) is merely a convention adhered to by almost all of the software that I have used. In fact, some of those applications explicitly put additional restrictions on the FASTA format (e.g. the name must be unique to the file).

                      Be that as it may, the SAM format specification does include the TAB-delimited nature of the fields. htseq-count seems fully capable of parsing this format correctly, but apparently expects a string not containing any white space in the RNAME field. As to why or what for, I have no idea and am much to busy to dig into that code. featureCounts IS much faster and the results compare well between the two tools.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X