Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High Percentage of Soft Clipping, BWA-MEM, MiSeq

    Hi everyone,

    I have 1,377,239 paired reads, 250bp each, 82,486 of them have been soft clipped for more than 200bp (eg a read with CIGAR string like 150S30M70S), is this normal?

    These data come from a small targeted sequencing project (~160Kb region). I checked my exome data set, none of the 50 millions reads show >200bp soft clipping. I kind of guess this could be due to contamination, but am not sure since I'm not the library preparation guy.

    command for counting soft clipping reads
    Code:
    perl -ne '@f=split /\t/; @a= $f[5]=~/(\d+)S/g; print "$_\n" if $a[0]+$a[1]>200' 1.1.sort.sam | wc -l
    aligner:
    BWA-MEM 0.7.4-r385

    sequencer:
    MiSeq

    Thanks a lot!

  • #2
    ps I think the adapter has been removed, otherwise there will be much more soft clipped reads. Also the soft-clipping length shouldn't be that long if it is caused by adapter.

    Comment


    • #3
      Have you looked at the quality score distribution of the reads? Was that normal?

      Comment


      • #4
        Originally posted by GenoMax View Post
        Have you looked at the quality score distribution of the reads? Was that normal?
        Overall distribution of quality score looks okay to me
        Click image for larger version

Name:	per_base_quality.png
Views:	1
Size:	10.2 KB
ID:	304429

        The distribution of quality for the reads with >200 soft clipping does look a little weird, though.
        Click image for larger version

Name:	per_base_quality2.png
Views:	1
Size:	8.9 KB
ID:	304430

        Comment


        • #5
          Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

          It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

          Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.

          Comment


          • #6
            Originally posted by GenoMax View Post
            Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

            It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

            Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.
            Thanks for your reply.

            1. I'm checking the sequencing chemistry, we used phiX spike and V2 chemistry.

            2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?

            3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.
            Last edited by logicthief; 02-10-2014, 12:03 AM.

            Comment


            • #7
              Originally posted by logicthief View Post
              2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?
              If the multiple samples are from the same region, that is not going to help with the low nucleotide diversity problem. You may want to consider deliberately under-loading the library and/or adding a phiX spike, if you happen to re-run. Are these amplicons or pull-down libraries?
              3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.
              Can you blat the soft clipped reads against the reference to check that there is nothing strange going on (chimeras)?
              Last edited by GenoMax; 02-09-2014, 03:55 PM.

              Comment


              • #8
                Hi GenoMax,

                Thanks for your reply.

                Our library is prepared from amplicon.

                I ran blat with default settings on the reads with more than 200 soft clipped reads. The first 18 columns of output are summarized by R. It seems the mapping rate is relatively low in these reads, any clue?

                1.matches - Number of matching bases that aren't repeats.
                2.misMatches - Number of bases that don't match.
                3.repMatches - Number of matching bases that are part of repeats.
                4.nCount - Number of 'N' bases.
                5.qNumInsert - Number of inserts in query.
                6.qBaseInsert - Number of bases inserted into query.
                7.tNumInsert - Number of inserts in target.
                8.tBaseInsert - Number of bases inserted into target.
                9.strand - defined as + (forward) or - (reverse) for query strand. In mouse, a second '+' or '-' indecates genomic strand.
                10.qName - Query sequence name.
                11.qSize - Query sequence size.
                12.qStart - Alignment start position in query.
                13.qEnd - Alignment end position in query.
                14.tName - Target sequence name.
                15.tSize - Target sequence size.
                16.tStart - Alignment start position in query.
                17.tEnd - Alignment end position in query.
                18.blockCount - Number of blocks in the alignment.

                HTML Code:
                       V1               V2               V3          V4
                 Min.   : 30.00   Min.   : 0.000   Min.   :0   Min.   :  0.00000
                 1st Qu.: 54.00   1st Qu.: 2.000   1st Qu.:0   1st Qu.:  0.00000
                 Median : 86.00   Median : 5.000   Median :0   Median :  0.00000
                 Mean   : 95.91   Mean   : 6.035   Mean   :0   Mean   :  0.01775
                 3rd Qu.:126.00   3rd Qu.: 9.000   3rd Qu.:0   3rd Qu.:  0.00000
                 Max.   :272.00   Max.   :25.000   Max.   :0   Max.   :186.00000
                
                       V5               V6               V7               V8
                 Min.   :0.0000   Min.   :-40.00   Min.   : 0.000   Min.   :      0
                 1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:      0
                 Median :1.0000   Median :  1.00   Median : 1.000   Median :      2
                 Mean   :0.7156   Mean   : 12.89   Mean   : 1.015   Mean   :   4387
                 3rd Qu.:1.0000   3rd Qu.: 12.00   3rd Qu.: 2.000   3rd Qu.:     37
                 Max.   :9.0000   Max.   :211.00   Max.   :18.000   Max.   :1292032
                
                 V9                                                   V10
                 -:888733   M01558:18:000000000-A4GLM:1:1104:22686:11369:   2178
                 +:889831   M01558:18:000000000-A4GLM:1:1103:13877:23360:   1731
                            M01558:18:000000000-A4GLM:1:2106:9363:8478  :   1722
                            M01558:18:000000000-A4GLM:1:2110:21183:12995:   1689
                            M01558:18:000000000-A4GLM:1:2105:27871:14537:   1600
                            M01558:18:000000000-A4GLM:1:1104:18639:7680 :   1530
                            (Other)                                     :1768114
                      V11             V12              V13             V14
                 Min.   :231.0   Min.   :  0.00   Min.   : 30.0   1      : 121115
                 1st Qu.:252.0   1st Qu.:  5.00   1st Qu.:121.0   2      : 105838
                 Median :253.0   Median : 38.00   Median :189.0   17     : 103302
                 Mean   :252.4   Mean   : 61.82   Mean   :176.7   7      :  96979
                 3rd Qu.:253.0   3rd Qu.:110.00   3rd Qu.:244.0   3      :  93220
                 Max.   :253.0   Max.   :223.00   Max.   :253.0   12     :  86681
                                                                  (Other):1171429
                      V15                 V16                 V17                 V18
                 Min.   :     4262   Min.   :        2   Min.   :      122   Min.   : 1.000
                 1st Qu.: 90354753   1st Qu.: 30317200   1st Qu.: 30319283   1st Qu.: 1.000
                 Median :141213431   Median : 56869278   Median : 56875718   Median : 2.000
                 Mean   :142625131   Mean   : 70776742   Mean   : 70781231   Mean   : 2.191
                 3rd Qu.:180915260   3rd Qu.:102962065   3rd Qu.:102965836   3rd Qu.: 3.000
                 Max.   :249250621   Max.   :249234810   Max.   :249234945   Max.   :21.000

                Comment


                • #9
                  I have found this problem. I ran BLAT on the segments of reads which are soft-clipped and they map perfectly or almost perfectly to another chromosome.

                  An example is

                  TAGTGGAAATTCTTGGCCGCTTCTGAGCTGCTGATGGCAGTGTCTTGAGGGAGAGGCCTGGGAGAGGTTTCGTTATAACTGTGCAGAAAGAGTGCACAGTG

                  which has 57 matches followed by 44 soft-clipped bases. The 57 matches were mapped to chromosome 1, whereas the 44 other bases map perfectly to chromosome 2.

                  It's not a problem with adapters, but it seems BWA-MEM misses some easy alignments. The few other reads in the same region do not have that mapping characteristic. They map to that region from start to end.

                  Comment


                  • #10
                    For this particular read, the bwa-mem (0.7.10) alignment is:

                    1 0 1 25189065 60 57M44S
                    1 2048 2 169676055 60 51H50M


                    This is a chimeric alignment.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    66 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X