SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BWA Soft Clipping Bio.X2Y Bioinformatics 11 03-09-2015 09:08 AM
Soft-clipping in Stampy msalm Bioinformatics 0 01-23-2014 03:41 AM
bwasw soft clipping of exact matches petrie Bioinformatics 0 11-07-2013 03:17 PM
bwa mem segfault; bwa bwasw breaks MarkDuplicates ElenaN Bioinformatics 2 06-30-2013 10:23 PM
unmapped vs soft-clipping CNVboy Bioinformatics 2 04-08-2012 10:41 PM

Reply
 
Thread Tools
Old 02-07-2014, 03:52 PM   #1
logicthief
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default High Percentage of Soft Clipping, BWA-MEM, MiSeq

Hi everyone,

I have 1,377,239 paired reads, 250bp each, 82,486 of them have been soft clipped for more than 200bp (eg a read with CIGAR string like 150S30M70S), is this normal?

These data come from a small targeted sequencing project (~160Kb region). I checked my exome data set, none of the 50 millions reads show >200bp soft clipping. I kind of guess this could be due to contamination, but am not sure since I'm not the library preparation guy.

command for counting soft clipping reads
Code:
perl -ne '@f=split /\t/; @a= $f[5]=~/(\d+)S/g; print "$_\n" if $a[0]+$a[1]>200' 1.1.sort.sam | wc -l
aligner:
BWA-MEM 0.7.4-r385

sequencer:
MiSeq

Thanks a lot!
logicthief is offline   Reply With Quote
Old 02-07-2014, 03:55 PM   #2
logicthief
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

ps I think the adapter has been removed, otherwise there will be much more soft clipped reads. Also the soft-clipping length shouldn't be that long if it is caused by adapter.
logicthief is offline   Reply With Quote
Old 02-08-2014, 03:11 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

Have you looked at the quality score distribution of the reads? Was that normal?
GenoMax is offline   Reply With Quote
Old 02-08-2014, 11:28 AM   #4
logicthief
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

Quote:
Originally Posted by GenoMax View Post
Have you looked at the quality score distribution of the reads? Was that normal?
Overall distribution of quality score looks okay to me
per_base_quality.png

The distribution of quality for the reads with >200 soft clipping does look a little weird, though.
per_base_quality2.png
logicthief is offline   Reply With Quote
Old 02-08-2014, 03:47 PM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.
GenoMax is offline   Reply With Quote
Old 02-09-2014, 12:21 PM   #6
logicthief
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

Quote:
Originally Posted by GenoMax View Post
Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.
Thanks for your reply.

1. I'm checking the sequencing chemistry, we used phiX spike and V2 chemistry.

2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?

3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.

Last edited by logicthief; 02-09-2014 at 11:03 PM.
logicthief is offline   Reply With Quote
Old 02-09-2014, 02:43 PM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,780
Default

Quote:
Originally Posted by logicthief View Post
2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?
If the multiple samples are from the same region, that is not going to help with the low nucleotide diversity problem. You may want to consider deliberately under-loading the library and/or adding a phiX spike, if you happen to re-run. Are these amplicons or pull-down libraries?
Quote:
3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.
Can you blat the soft clipped reads against the reference to check that there is nothing strange going on (chimeras)?

Last edited by GenoMax; 02-09-2014 at 02:55 PM.
GenoMax is offline   Reply With Quote
Old 02-11-2014, 09:18 PM   #8
logicthief
Member
 
Location: California

Join Date: Jan 2012
Posts: 11
Default

Hi GenoMax,

Thanks for your reply.

Our library is prepared from amplicon.

I ran blat with default settings on the reads with more than 200 soft clipped reads. The first 18 columns of output are summarized by R. It seems the mapping rate is relatively low in these reads, any clue?

1.matches - Number of matching bases that aren't repeats.
2.misMatches - Number of bases that don't match.
3.repMatches - Number of matching bases that are part of repeats.
4.nCount - Number of 'N' bases.
5.qNumInsert - Number of inserts in query.
6.qBaseInsert - Number of bases inserted into query.
7.tNumInsert - Number of inserts in target.
8.tBaseInsert - Number of bases inserted into target.
9.strand - defined as + (forward) or - (reverse) for query strand. In mouse, a second '+' or '-' indecates genomic strand.
10.qName - Query sequence name.
11.qSize - Query sequence size.
12.qStart - Alignment start position in query.
13.qEnd - Alignment end position in query.
14.tName - Target sequence name.
15.tSize - Target sequence size.
16.tStart - Alignment start position in query.
17.tEnd - Alignment end position in query.
18.blockCount - Number of blocks in the alignment.

HTML Code:
       V1               V2               V3          V4
 Min.   : 30.00   Min.   : 0.000   Min.   :0   Min.   :  0.00000
 1st Qu.: 54.00   1st Qu.: 2.000   1st Qu.:0   1st Qu.:  0.00000
 Median : 86.00   Median : 5.000   Median :0   Median :  0.00000
 Mean   : 95.91   Mean   : 6.035   Mean   :0   Mean   :  0.01775
 3rd Qu.:126.00   3rd Qu.: 9.000   3rd Qu.:0   3rd Qu.:  0.00000
 Max.   :272.00   Max.   :25.000   Max.   :0   Max.   :186.00000

       V5               V6               V7               V8
 Min.   :0.0000   Min.   :-40.00   Min.   : 0.000   Min.   :      0
 1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:      0
 Median :1.0000   Median :  1.00   Median : 1.000   Median :      2
 Mean   :0.7156   Mean   : 12.89   Mean   : 1.015   Mean   :   4387
 3rd Qu.:1.0000   3rd Qu.: 12.00   3rd Qu.: 2.000   3rd Qu.:     37
 Max.   :9.0000   Max.   :211.00   Max.   :18.000   Max.   :1292032

 V9                                                   V10
 -:888733   M01558:18:000000000-A4GLM:1:1104:22686:11369:   2178
 +:889831   M01558:18:000000000-A4GLM:1:1103:13877:23360:   1731
            M01558:18:000000000-A4GLM:1:2106:9363:8478  :   1722
            M01558:18:000000000-A4GLM:1:2110:21183:12995:   1689
            M01558:18:000000000-A4GLM:1:2105:27871:14537:   1600
            M01558:18:000000000-A4GLM:1:1104:18639:7680 :   1530
            (Other)                                     :1768114
      V11             V12              V13             V14
 Min.   :231.0   Min.   :  0.00   Min.   : 30.0   1      : 121115
 1st Qu.:252.0   1st Qu.:  5.00   1st Qu.:121.0   2      : 105838
 Median :253.0   Median : 38.00   Median :189.0   17     : 103302
 Mean   :252.4   Mean   : 61.82   Mean   :176.7   7      :  96979
 3rd Qu.:253.0   3rd Qu.:110.00   3rd Qu.:244.0   3      :  93220
 Max.   :253.0   Max.   :223.00   Max.   :253.0   12     :  86681
                                                  (Other):1171429
      V15                 V16                 V17                 V18
 Min.   :     4262   Min.   :        2   Min.   :      122   Min.   : 1.000
 1st Qu.: 90354753   1st Qu.: 30317200   1st Qu.: 30319283   1st Qu.: 1.000
 Median :141213431   Median : 56869278   Median : 56875718   Median : 2.000
 Mean   :142625131   Mean   : 70776742   Mean   : 70781231   Mean   : 2.191
 3rd Qu.:180915260   3rd Qu.:102962065   3rd Qu.:102965836   3rd Qu.: 3.000
 Max.   :249250621   Max.   :249234810   Max.   :249234945   Max.   :21.000
logicthief is offline   Reply With Quote
Old 09-10-2014, 10:00 PM   #9
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 163
Default

I have found this problem. I ran BLAT on the segments of reads which are soft-clipped and they map perfectly or almost perfectly to another chromosome.

An example is

TAGTGGAAATTCTTGGCCGCTTCTGAGCTGCTGATGGCAGTGTCTTGAGGGAGAGGCCTGGGAGAGGTTTCGTTATAACTGTGCAGAAAGAGTGCACAGTG

which has 57 matches followed by 44 soft-clipped bases. The 57 matches were mapped to chromosome 1, whereas the 44 other bases map perfectly to chromosome 2.

It's not a problem with adapters, but it seems BWA-MEM misses some easy alignments. The few other reads in the same region do not have that mapping characteristic. They map to that region from start to end.
Dario1984 is offline   Reply With Quote
Old 09-11-2014, 05:19 PM   #10
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

For this particular read, the bwa-mem (0.7.10) alignment is:

1 0 1 25189065 60 57M44S
1 2048 2 169676055 60 51H50M


This is a chimeric alignment.
lh3 is offline   Reply With Quote
Reply

Tags
bwa-mem, miseq 250bp, soft clipping

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:02 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO