![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
BWA Soft Clipping | Bio.X2Y | Bioinformatics | 11 | 03-09-2015 10:08 AM |
Soft-clipping in Stampy | msalm | Bioinformatics | 0 | 01-23-2014 04:41 AM |
bwasw soft clipping of exact matches | petrie | Bioinformatics | 0 | 11-07-2013 04:17 PM |
bwa mem segfault; bwa bwasw breaks MarkDuplicates | ElenaN | Bioinformatics | 2 | 06-30-2013 11:23 PM |
unmapped vs soft-clipping | CNVboy | Bioinformatics | 2 | 04-08-2012 11:41 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: California Join Date: Jan 2012
Posts: 11
|
![]()
Hi everyone,
I have 1,377,239 paired reads, 250bp each, 82,486 of them have been soft clipped for more than 200bp (eg a read with CIGAR string like 150S30M70S), is this normal? These data come from a small targeted sequencing project (~160Kb region). I checked my exome data set, none of the 50 millions reads show >200bp soft clipping. I kind of guess this could be due to contamination, but am not sure since I'm not the library preparation guy. command for counting soft clipping reads Code:
perl -ne '@f=split /\t/; @a= $f[5]=~/(\d+)S/g; print "$_\n" if $a[0]+$a[1]>200' 1.1.sort.sam | wc -l BWA-MEM 0.7.4-r385 sequencer: MiSeq Thanks a lot! |
![]() |
![]() |
![]() |
#2 |
Member
Location: California Join Date: Jan 2012
Posts: 11
|
![]()
ps I think the adapter has been removed, otherwise there will be much more soft clipped reads. Also the soft-clipping length shouldn't be that long if it is caused by adapter.
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,087
|
![]()
Have you looked at the quality score distribution of the reads? Was that normal?
|
![]() |
![]() |
![]() |
#4 | |
Member
Location: California Join Date: Jan 2012
Posts: 11
|
![]() Quote:
per_base_quality.png The distribution of quality for the reads with >200 soft clipping does look a little weird, though. per_base_quality2.png |
|
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,087
|
![]()
Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?
It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters. Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: California Join Date: Jan 2012
Posts: 11
|
![]() Quote:
1. I'm checking the sequencing chemistry, we used phiX spike and V2 chemistry. 2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason? 3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases. Last edited by logicthief; 02-10-2014 at 12:03 AM. |
|
![]() |
![]() |
![]() |
#7 | ||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,087
|
![]() Quote:
Quote:
Last edited by GenoMax; 02-09-2014 at 03:55 PM. |
||
![]() |
![]() |
![]() |
#8 |
Member
Location: California Join Date: Jan 2012
Posts: 11
|
![]()
Hi GenoMax,
Thanks for your reply. Our library is prepared from amplicon. I ran blat with default settings on the reads with more than 200 soft clipped reads. The first 18 columns of output are summarized by R. It seems the mapping rate is relatively low in these reads, any clue? 1.matches - Number of matching bases that aren't repeats. 2.misMatches - Number of bases that don't match. 3.repMatches - Number of matching bases that are part of repeats. 4.nCount - Number of 'N' bases. 5.qNumInsert - Number of inserts in query. 6.qBaseInsert - Number of bases inserted into query. 7.tNumInsert - Number of inserts in target. 8.tBaseInsert - Number of bases inserted into target. 9.strand - defined as + (forward) or - (reverse) for query strand. In mouse, a second '+' or '-' indecates genomic strand. 10.qName - Query sequence name. 11.qSize - Query sequence size. 12.qStart - Alignment start position in query. 13.qEnd - Alignment end position in query. 14.tName - Target sequence name. 15.tSize - Target sequence size. 16.tStart - Alignment start position in query. 17.tEnd - Alignment end position in query. 18.blockCount - Number of blocks in the alignment. HTML Code:
V1 V2 V3 V4 Min. : 30.00 Min. : 0.000 Min. :0 Min. : 0.00000 1st Qu.: 54.00 1st Qu.: 2.000 1st Qu.:0 1st Qu.: 0.00000 Median : 86.00 Median : 5.000 Median :0 Median : 0.00000 Mean : 95.91 Mean : 6.035 Mean :0 Mean : 0.01775 3rd Qu.:126.00 3rd Qu.: 9.000 3rd Qu.:0 3rd Qu.: 0.00000 Max. :272.00 Max. :25.000 Max. :0 Max. :186.00000 V5 V6 V7 V8 Min. :0.0000 Min. :-40.00 Min. : 0.000 Min. : 0 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0 Median :1.0000 Median : 1.00 Median : 1.000 Median : 2 Mean :0.7156 Mean : 12.89 Mean : 1.015 Mean : 4387 3rd Qu.:1.0000 3rd Qu.: 12.00 3rd Qu.: 2.000 3rd Qu.: 37 Max. :9.0000 Max. :211.00 Max. :18.000 Max. :1292032 V9 V10 -:888733 M01558:18:000000000-A4GLM:1:1104:22686:11369: 2178 +:889831 M01558:18:000000000-A4GLM:1:1103:13877:23360: 1731 M01558:18:000000000-A4GLM:1:2106:9363:8478 : 1722 M01558:18:000000000-A4GLM:1:2110:21183:12995: 1689 M01558:18:000000000-A4GLM:1:2105:27871:14537: 1600 M01558:18:000000000-A4GLM:1:1104:18639:7680 : 1530 (Other) :1768114 V11 V12 V13 V14 Min. :231.0 Min. : 0.00 Min. : 30.0 1 : 121115 1st Qu.:252.0 1st Qu.: 5.00 1st Qu.:121.0 2 : 105838 Median :253.0 Median : 38.00 Median :189.0 17 : 103302 Mean :252.4 Mean : 61.82 Mean :176.7 7 : 96979 3rd Qu.:253.0 3rd Qu.:110.00 3rd Qu.:244.0 3 : 93220 Max. :253.0 Max. :223.00 Max. :253.0 12 : 86681 (Other):1171429 V15 V16 V17 V18 Min. : 4262 Min. : 2 Min. : 122 Min. : 1.000 1st Qu.: 90354753 1st Qu.: 30317200 1st Qu.: 30319283 1st Qu.: 1.000 Median :141213431 Median : 56869278 Median : 56875718 Median : 2.000 Mean :142625131 Mean : 70776742 Mean : 70781231 Mean : 2.191 3rd Qu.:180915260 3rd Qu.:102962065 3rd Qu.:102965836 3rd Qu.: 3.000 Max. :249250621 Max. :249234810 Max. :249234945 Max. :21.000 |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Sydney, Australia Join Date: Jun 2011
Posts: 166
|
![]()
I have found this problem. I ran BLAT on the segments of reads which are soft-clipped and they map perfectly or almost perfectly to another chromosome.
An example is TAGTGGAAATTCTTGGCCGCTTCTGAGCTGCTGATGGCAGTGTCTTGAGGGAGAGGCCTGGGAGAGGTTTCGTTATAACTGTGCAGAAAGAGTGCACAGTG which has 57 matches followed by 44 soft-clipped bases. The 57 matches were mapped to chromosome 1, whereas the 44 other bases map perfectly to chromosome 2. It's not a problem with adapters, but it seems BWA-MEM misses some easy alignments. The few other reads in the same region do not have that mapping characteristic. They map to that region from start to end. |
![]() |
![]() |
![]() |
#10 |
Senior Member
Location: Boston Join Date: Feb 2008
Posts: 693
|
![]()
For this particular read, the bwa-mem (0.7.10) alignment is:
1 0 1 25189065 60 57M44S 1 2048 2 169676055 60 51H50M This is a chimeric alignment. |
![]() |
![]() |
![]() |
Tags |
bwa-mem, miseq 250bp, soft clipping |
Thread Tools | |
|
|