Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Lechu
    Junior Member
    • Aug 2017
    • 4

    BBmap generates Bam files with duplications in sequence sequence names

    After running BBmap with the command like this:
    Code:
    java -ea -Xmx30g align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx30g in=8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.fastq.gz out=8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.bam qtrim=t usequality=t minaveragequality=0 local=f strictmaxindel=f xstag=us maxindel=100000 intronlen=10 ambig=toss threads=8
    I get Bam files with duplication in the SN fields, like this:
    Code:
    mac00179:fastq lecka48$ samtools view -H 8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.bam | head -30
    @HD	VN:1.4	SO:unsorted
    @SQ	SN:chr1 1	LN:195471971
    @SQ	SN:chr2 2	LN:182113224
    @SQ	SN:chr3 3	LN:160039680
    @SQ	SN:chr4 4	LN:156508116
    @SQ	SN:chr5 5	LN:151834684
    @SQ	SN:chr6 6	LN:149736546
    @SQ	SN:chr7 7	LN:145441459
    @SQ	SN:chr8 8	LN:129401213
    @SQ	SN:chr9 9	LN:124595110
    @SQ	SN:chr10 10	LN:130694993
    @SQ	SN:chr11 11	LN:122082543
    @SQ	SN:chr12 12	LN:120129022
    @SQ	SN:chr13 13	LN:120421639
    @SQ	SN:chr14 14	LN:124902244
    @SQ	SN:chr15 15	LN:104043685
    @SQ	SN:chr16 16	LN:98207768
    @SQ	SN:chr17 17	LN:94987271
    @SQ	SN:chr18 18	LN:90702639
    @SQ	SN:chr19 19	LN:61431566
    @SQ	SN:chrX X	LN:171031299
    @SQ	SN:chrY Y	LN:91744698
    @SQ	SN:chrM MT	LN:16299
    @SQ	SN:GL456210.1 GL456210.1	LN:169725
    @SQ	SN:GL456211.1 GL456211.1	LN:241735
    @SQ	SN:GL456212.1 GL456212.1	LN:153618
    @SQ	SN:GL456213.1 GL456213.1	LN:39340
    @SQ	SN:GL456216.1 GL456216.1	LN:66673
    @SQ	SN:GL456219.1 GL456219.1	LN:175968
    @SQ	SN:GL456221.1 GL456221.1	LN:206961
    I can get rid of them using samtools reheader, but I'm wondering what is the reason for those duplications. Any ideas welcome.
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    Show us the headers in your reference file.
    Code:
     grep "^>" your_ref.fa
    BTW: I don't see a "path=" or "ref=" directive in your map command above.

    Comment

    • Lechu
      Junior Member
      • Aug 2017
      • 4

      #3
      This is exactly how they are in the reference, silly me. But is there some hidden method in this...?

      $ grep "^>" GRCm38.primary_assembly.genome.fa | head -20
      >chr1 1
      >chr2 2
      >chr3 3
      >chr4 4
      >chr5 5
      >chr6 6
      >chr7 7
      >chr8 8
      >chr9 9
      >chr10 10

      Ad BTW: My reference is in the same directory, I assumed I don't need the "ref=" directive in such a case.

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        Since you have those numbers duplicated in your reference file fasta headers those are showing up in your alignments. BBMap is one of few aligners that passes along the entire string (including spaces) in fasta header from your reference. Other aligners silently drop all characters past first space in names.

        You could choose to omit all characters after first space by using option "trd=t" (Truncate read and ref names at the first whitespace) when you align.

        Thanks for the clarification on the index. Looks like that works.

        Comment

        • jiaco
          Member
          • May 2010
          • 35

          #5
          Ran into the same problem recently, with over 500 bams to further process, I went the reheader route, but it would seem based on the comment "BBMap is one of few aligners that passes along the entire string (including spaces) in fasta header from your reference." it would make more sense to make the default behavior conform to the "norm". That is at least my opinion.

          Thanks for the trd=t option, I will put that in all my scripts now.

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            The problem is there is no formal definition of fasta header. It can be anything following a ">" sign and that is what makes this problemmatic.

            BBMap is actually being very flexible by allowing you to control what gets chosen. If you choose to truncate the fasta header after the first space then some other programs start having issues with the headers. e.g. genome viewers will think that your fasta reference does not match your BAM. You will need to adjust your reference in that case.

            Comment

            • jiaco
              Member
              • May 2010
              • 35

              #7
              It is just something you need to know. I switched my source of fasta at the same time as I started trying bbmap. The aligner seemed much better than the one I was using before and then I ran it on a huge project before I realized that the gencode genome.fa had these ucsc/ensembl names in the fasta headers like that.

              Anyway, really happy with bbmap, so just wanted to make that clear.

              Maybe it could spit out a message if it finds multiple strings with a space in the fasta definitions just so people can be aware of it. It took me a bit of probing (in the wrong places of course at first, iirc, it was featureCounts that failed to deal with it) before I understood where the problem even was.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Pathogen Surveillance with Advanced Genomic Tools
                by seqadmin




                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                03-24-2025, 11:48 AM
              • seqadmin
                New Genomics Tools and Methods Shared at AGBT 2025
                by seqadmin


                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                The Headliner
                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                03-03-2025, 01:39 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-20-2025, 05:03 AM
              0 responses
              49 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-19-2025, 07:27 AM
              0 responses
              57 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-18-2025, 12:50 PM
              0 responses
              50 views
              0 reactions
              Last Post seqadmin  
              Started by seqadmin, 03-03-2025, 01:15 PM
              0 responses
              201 views
              0 reactions
              Last Post seqadmin  
              Working...