Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confident mapping with BWA mem using -A, -C, -L, -U options

    Dear all,

    Until now, I used BWA with default settings. For a new project, I would like to perform a "confident mapping", that is to keep pairs/reads with high probability of correct alignment if the cost is to loose a moderate amount of data.
    Here I use BWA mem version 0.7.12 on paired-ends 75 bp reads. These reads have been trimmed wih Trimmomatic (SLIDINGWINDOW:6:25 LEADING:25 TRAILING:25 AVGQUAL:25 MINLEN:40 + adapters trimming).

    I think 4 parameters could help to focuse on confident mapped reads.
    1. -A: Matching score.
      I wonder if this parameter allows to select reads with mapping >=20.
      I read that read with mapping quality (field 5 in SAM format) are the reads with multiple locations. In a sample that I tested, 1.9% of my reads have field 5 in SAM format=0.
      Does this -A parameter act on the field 5 or is it doing something different?
    2. -c
      Since I want to keep reads with a unique best mapping, should I reduce the number of occurrence of a MEM? Could it help?
    3. -L: clipping penalty
      I don't want to keep read with 10/20 bases aligning somewhere else in the genome. Is there a way to filter those reads?
      Despite reading some discussions on Biostars, I am still not very confortable with the notions of soft and hard clipping...
    4. -U: penalty for unpaired read pair
      Since I won't look for structural variations in this project, I don't want to keep pairs of reads with very abnormal edit distance. I am not sure if the -U parameter could do something for that.


    If you have some experience or advice for me, I would be very happy to hear it!
    Thank you in advance,
    Jane

  • #2
    I consider the BWA parameters to be worth tweaking if that is what is required to get "reasonable" alignments of most reads, because of sequencing technology, library characteristics, etc. If mapping rates are reasonably high, then BWA is already considering several choices during alignment and the mapping-quality value is meant to describe this probability you are seeking.

    With the older aln/sampe pipeline, I wanted 'stringent' mappings like you are seeking, so I restricted seed mismatches, but I soon realised this is a biased view of stringency (and reliance on a fixed seed creates biased mapping, one of the advantages of mem). As you have already found by considering just these options, when attempting to filter during mapping by tweaking parameter values, there are so many different cases that could potentially occur, that you cannot address them all, and the parameter changes are likely to make unusual modifications to mapping that introduce further complications.

    Note that the presence of clips does not necessarily mean a read is poorly mapped.

    If you only want uniquely-mapping reads with high mapping quality, that are properly paired with all mates mapped to the same scaffold, then standard samtools post-alignment filtering on mapping quality and on SAM flags should serve you well.
    Last edited by dgscofield; 12-15-2015, 05:05 AM.

    Comment


    • #3
      Thank you for your answer.
      If I do not change the alignment parameters, I can do post-alignment filtering as you suggested.
      1. Mapping score
        With samtools view -q 20
      2. "Multireads"
        Removal of reads with AS=XS (for those having mapping score>=20)
        Example in one of my samples:
        Code:
        K00103:19:H2MFLBBXX:4:1101:3953:1598	83	chr9	67316168	24	29S46M	=	67316086	-128	GAAACATCCTTGTGAGGTGTGCACTGAAGTCACAGAGTTGAAACTGTCT
        TTTGATTCAGCAGTTTTGAATCTCTC	KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFAAA	NM:i:2	MD:Z:14A1A29	AS:i:36	XS:i:36	RG:Z:WES
        Removal of all reads with "AS:i:X"=="XS:i:X" (command line)
      3. Clipped reads
        From what I understand, clipped reads come either from structural variations or from chimeric reads.
        Since I won't study structural variations for the moment, I wonder if I can remove them all.
        Removal of all reads containing "S" or "H" in field 6 of SAM file. (command line)
      4. Pairs aligned on different locations
        Removal of reads with pair on different locations in field 7 of SAM file. (command line)
        Example in one of my sample:
        Code:
        [merlevede@login01 Sample_X]$ cat file.sam | cut -f 7 | sort | uniq -c
             27 
        219456370 =
          85812 *
         116175 chr1
          80850 chr10
          73198 chr11
          74248 chr12
          37966 chr13
          47290 chr14
          47084 chr15
          52214 chr16
          55936 chr17
          31242 chr18
          50031 chr19
         128166 chr2
          32914 chr20
          20887 chr21
          22450 chr22
          96852 chr3
          86947 chr4
          81397 chr5
          80578 chr6
          78095 chr7
          65246 chr8
          63369 chr9
           3414 chrM
          39836 chrX
          23048 chrY
        Keep only reads with "=" in field 7 of SAM file.


      What do you think about this strategy?
      Jane
      Last edited by Jane M; 12-18-2015, 01:42 AM.

      Comment


      • #4
        Seems reasonable if stringency is what you are after. You might be able to also make use of samtools view flags that include (-f) or exclude (-F) certain SAM flags. It's convenient to use https://broadinstitute.github.io/pic...ain-flags.html to figure out which values to use.

        Clipped reads can also occur with biological causes, like indels, or from sequencing, e.g., incomplete adapter removal.

        Comment


        • #5
          Thank you for your answer.

          I ran some tests to check the percentage of reads with:
          - MAPQ<20: 2.5-2.7%
          - either soft or hard clipping: 0.50-0.64%
          - soft clipping: 0.49-0.62%
          - hard clipping: 0.008-0.01%
          - insertion: 0.18-0.29%
          - deletion: 0.22-0.31%

          The mapping quality seems the most important parameter. I am testing some lower MAPQ scores and if the proportion of soft and hard clipped reads diminish when removing low MAPQ reads.

          Comment


          • #6
            So that shows characteristics of reads left after removing MAPQ<20? To me those look like reasonable ranges likely arising from biological and/or sequencing causes. As you decrease MAPQ I would expect relative proportions (e.g., in vs del, soft vs hard) to remain about the same.

            To dig a bit deeper into clipping if you are particularly worried about that, I would check the distribution of clip lengths. I would expect them to be quite short, since this is after trimmomatic.

            Comment


            • #7
              Originally posted by dgscofield View Post
              So that shows characteristics of reads left after removing MAPQ<20?
              That are the characteristics of the "raw" SAM file obtained after default alignment.
              Currently I am removing reads with MAPQ<20 to check the other characteristics (clipping, indel and mapping). I should have the results by tomorrow for all my samples, I will let you know in case someone else is interested.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              30 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Working...
              X