Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap generates Bam files with duplications in sequence sequence names

    After running BBmap with the command like this:
    Code:
    java -ea -Xmx30g align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx30g in=8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.fastq.gz out=8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.bam qtrim=t usequality=t minaveragequality=0 local=f strictmaxindel=f xstag=us maxindel=100000 intronlen=10 ambig=toss threads=8
    I get Bam files with duplication in the SN fields, like this:
    Code:
    mac00179:fastq lecka48$ samtools view -H 8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.bam | head -30
    @HD	VN:1.4	SO:unsorted
    @SQ	SN:chr1 1	LN:195471971
    @SQ	SN:chr2 2	LN:182113224
    @SQ	SN:chr3 3	LN:160039680
    @SQ	SN:chr4 4	LN:156508116
    @SQ	SN:chr5 5	LN:151834684
    @SQ	SN:chr6 6	LN:149736546
    @SQ	SN:chr7 7	LN:145441459
    @SQ	SN:chr8 8	LN:129401213
    @SQ	SN:chr9 9	LN:124595110
    @SQ	SN:chr10 10	LN:130694993
    @SQ	SN:chr11 11	LN:122082543
    @SQ	SN:chr12 12	LN:120129022
    @SQ	SN:chr13 13	LN:120421639
    @SQ	SN:chr14 14	LN:124902244
    @SQ	SN:chr15 15	LN:104043685
    @SQ	SN:chr16 16	LN:98207768
    @SQ	SN:chr17 17	LN:94987271
    @SQ	SN:chr18 18	LN:90702639
    @SQ	SN:chr19 19	LN:61431566
    @SQ	SN:chrX X	LN:171031299
    @SQ	SN:chrY Y	LN:91744698
    @SQ	SN:chrM MT	LN:16299
    @SQ	SN:GL456210.1 GL456210.1	LN:169725
    @SQ	SN:GL456211.1 GL456211.1	LN:241735
    @SQ	SN:GL456212.1 GL456212.1	LN:153618
    @SQ	SN:GL456213.1 GL456213.1	LN:39340
    @SQ	SN:GL456216.1 GL456216.1	LN:66673
    @SQ	SN:GL456219.1 GL456219.1	LN:175968
    @SQ	SN:GL456221.1 GL456221.1	LN:206961
    I can get rid of them using samtools reheader, but I'm wondering what is the reason for those duplications. Any ideas welcome.

  • #2
    Show us the headers in your reference file.
    Code:
     grep "^>" your_ref.fa
    BTW: I don't see a "path=" or "ref=" directive in your map command above.

    Comment


    • #3
      This is exactly how they are in the reference, silly me. But is there some hidden method in this...?

      $ grep "^>" GRCm38.primary_assembly.genome.fa | head -20
      >chr1 1
      >chr2 2
      >chr3 3
      >chr4 4
      >chr5 5
      >chr6 6
      >chr7 7
      >chr8 8
      >chr9 9
      >chr10 10

      Ad BTW: My reference is in the same directory, I assumed I don't need the "ref=" directive in such a case.

      Comment


      • #4
        Since you have those numbers duplicated in your reference file fasta headers those are showing up in your alignments. BBMap is one of few aligners that passes along the entire string (including spaces) in fasta header from your reference. Other aligners silently drop all characters past first space in names.

        You could choose to omit all characters after first space by using option "trd=t" (Truncate read and ref names at the first whitespace) when you align.

        Thanks for the clarification on the index. Looks like that works.

        Comment


        • #5
          Ran into the same problem recently, with over 500 bams to further process, I went the reheader route, but it would seem based on the comment "BBMap is one of few aligners that passes along the entire string (including spaces) in fasta header from your reference." it would make more sense to make the default behavior conform to the "norm". That is at least my opinion.

          Thanks for the trd=t option, I will put that in all my scripts now.

          Comment


          • #6
            The problem is there is no formal definition of fasta header. It can be anything following a ">" sign and that is what makes this problemmatic.

            BBMap is actually being very flexible by allowing you to control what gets chosen. If you choose to truncate the fasta header after the first space then some other programs start having issues with the headers. e.g. genome viewers will think that your fasta reference does not match your BAM. You will need to adjust your reference in that case.

            Comment


            • #7
              It is just something you need to know. I switched my source of fasta at the same time as I started trying bbmap. The aligner seemed much better than the one I was using before and then I ran it on a huge project before I realized that the gencode genome.fa had these ucsc/ensembl names in the fasta headers like that.

              Anyway, really happy with bbmap, so just wanted to make that clear.

              Maybe it could spit out a message if it finds multiple strings with a space in the fasta definitions just so people can be aware of it. It took me a bit of probing (in the wrong places of course at first, iirc, it was featureCounts that failed to deal with it) before I understood where the problem even was.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X