Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap generates Bam files with duplications in sequence sequence names

    After running BBmap with the command like this:
    Code:
    java -ea -Xmx30g align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx30g in=8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.fastq.gz out=8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.bam qtrim=t usequality=t minaveragequality=0 local=f strictmaxindel=f xstag=us maxindel=100000 intronlen=10 ambig=toss threads=8
    I get Bam files with duplication in the SN fields, like this:
    Code:
    mac00179:fastq lecka48$ samtools view -H 8601_RG15-S1_IndexD706-D508_GAATTCGT+GTACTGAC_L007_R1_001.bam | head -30
    @HD	VN:1.4	SO:unsorted
    @SQ	SN:chr1 1	LN:195471971
    @SQ	SN:chr2 2	LN:182113224
    @SQ	SN:chr3 3	LN:160039680
    @SQ	SN:chr4 4	LN:156508116
    @SQ	SN:chr5 5	LN:151834684
    @SQ	SN:chr6 6	LN:149736546
    @SQ	SN:chr7 7	LN:145441459
    @SQ	SN:chr8 8	LN:129401213
    @SQ	SN:chr9 9	LN:124595110
    @SQ	SN:chr10 10	LN:130694993
    @SQ	SN:chr11 11	LN:122082543
    @SQ	SN:chr12 12	LN:120129022
    @SQ	SN:chr13 13	LN:120421639
    @SQ	SN:chr14 14	LN:124902244
    @SQ	SN:chr15 15	LN:104043685
    @SQ	SN:chr16 16	LN:98207768
    @SQ	SN:chr17 17	LN:94987271
    @SQ	SN:chr18 18	LN:90702639
    @SQ	SN:chr19 19	LN:61431566
    @SQ	SN:chrX X	LN:171031299
    @SQ	SN:chrY Y	LN:91744698
    @SQ	SN:chrM MT	LN:16299
    @SQ	SN:GL456210.1 GL456210.1	LN:169725
    @SQ	SN:GL456211.1 GL456211.1	LN:241735
    @SQ	SN:GL456212.1 GL456212.1	LN:153618
    @SQ	SN:GL456213.1 GL456213.1	LN:39340
    @SQ	SN:GL456216.1 GL456216.1	LN:66673
    @SQ	SN:GL456219.1 GL456219.1	LN:175968
    @SQ	SN:GL456221.1 GL456221.1	LN:206961
    I can get rid of them using samtools reheader, but I'm wondering what is the reason for those duplications. Any ideas welcome.

  • #2
    Show us the headers in your reference file.
    Code:
     grep "^>" your_ref.fa
    BTW: I don't see a "path=" or "ref=" directive in your map command above.

    Comment


    • #3
      This is exactly how they are in the reference, silly me. But is there some hidden method in this...?

      $ grep "^>" GRCm38.primary_assembly.genome.fa | head -20
      >chr1 1
      >chr2 2
      >chr3 3
      >chr4 4
      >chr5 5
      >chr6 6
      >chr7 7
      >chr8 8
      >chr9 9
      >chr10 10

      Ad BTW: My reference is in the same directory, I assumed I don't need the "ref=" directive in such a case.

      Comment


      • #4
        Since you have those numbers duplicated in your reference file fasta headers those are showing up in your alignments. BBMap is one of few aligners that passes along the entire string (including spaces) in fasta header from your reference. Other aligners silently drop all characters past first space in names.

        You could choose to omit all characters after first space by using option "trd=t" (Truncate read and ref names at the first whitespace) when you align.

        Thanks for the clarification on the index. Looks like that works.

        Comment


        • #5
          Ran into the same problem recently, with over 500 bams to further process, I went the reheader route, but it would seem based on the comment "BBMap is one of few aligners that passes along the entire string (including spaces) in fasta header from your reference." it would make more sense to make the default behavior conform to the "norm". That is at least my opinion.

          Thanks for the trd=t option, I will put that in all my scripts now.

          Comment


          • #6
            The problem is there is no formal definition of fasta header. It can be anything following a ">" sign and that is what makes this problemmatic.

            BBMap is actually being very flexible by allowing you to control what gets chosen. If you choose to truncate the fasta header after the first space then some other programs start having issues with the headers. e.g. genome viewers will think that your fasta reference does not match your BAM. You will need to adjust your reference in that case.

            Comment


            • #7
              It is just something you need to know. I switched my source of fasta at the same time as I started trying bbmap. The aligner seemed much better than the one I was using before and then I ran it on a huge project before I realized that the gencode genome.fa had these ucsc/ensembl names in the fasta headers like that.

              Anyway, really happy with bbmap, so just wanted to make that clear.

              Maybe it could spit out a message if it finds multiple strings with a space in the fasta definitions just so people can be aware of it. It took me a bit of probing (in the wrong places of course at first, iirc, it was featureCounts that failed to deal with it) before I understood where the problem even was.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Innovations in Spatial Biology
                by seqadmin


                Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                3D Genomics
                While spatial biology often involves studying proteins and RNAs in their...
                Yesterday, 07:30 PM
              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-30-2024, 01:35 PM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              40 views
              0 likes
              Last Post seqadmin  
              Working...
              X