SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimming or filtering the data from Solid anusha Bioinformatics 4 12-19-2012 09:00 AM
Trimming SoLiD 50bp Reads = Doubling Our Mapping bacdirector Bioinformatics 25 10-30-2012 01:27 AM
SOLiD PE 50/25 50/35 mapping using BFAST Patidar SOLiD 5 03-24-2012 09:18 AM
Mapping SOLiD Reads using BFast. kasthuri Bioinformatics 0 06-04-2011 06:30 AM
Trimming or filtering the data from Solid anusha SOLiD 1 01-21-2010 10:19 AM

Reply
 
Thread Tools
Old 06-01-2012, 09:55 PM   #1
sonia.bao
Member
 
Location: US

Join Date: May 2012
Posts: 12
Wink Is trimming necessary before mapping SOLiD data with BFAST?

Hi,

I am wondering whether trimming is necessary if using BFAST for aligning SOLiD reads? BFAST has been pretty powerful in mapping cs reads accurately while tolerating mismatches and indels - ~90% of my reads (single-end, 75bp, from genomic DNA) were uniquely mapped with best alignment score.

However, the base quality drops dramatically at 3' end of the SOLiD reads. Here are the base quality distributions of raw reads (csfastq format) and mapped reads (bam format)

- Raw reads: I suspect it detects the wrong quality format - Illumina not Sanger - though (not sure)
Code:
fastqc $reads.csfastq


- BFAST mapped reads: I think this time it detects the correct quality format - Sanger
Code:
fastqc $reads.mapped.srt.rmdup.bam


If mapping without trimming, the low-quality 3' end of the reads are mapped, but with lots of mismatches. Here is a snapshot from the bam file in GenomeView (all mismatches are at the 3' end of the mapped reads)

- green: Read mapped to the forward strand
- blue: Read mapped to the reverse strand


I am debating whether to trim the reads before mapping, in the purpose of accurate multi-sample variant calling, as my preliminary analysis (without re-align or re-cal) using varscan returned lots of false positive snps, which are located in the the noisy mapping regions....on the other hand, I would like not to throw out reads that are indeed mapped (even with mismatches).

here are the thresholds that I set for variant calling:
Code:
java -Xmx4g -jar VarScan.v2.2.11.jar  mpileup2snp \
$reads.mapped.srt.rmdup.merged.mpileup \
--min-coverage 8 \
--min-reads2 2 \
--min-var-freq 0.2 \
--min-avg-qual 20 \
--p-value 0.01 \
> $reads.snp
Any help is highly appreciated! Much thanks in advance.

Sonia

Last edited by sonia.bao; 06-01-2012 at 10:21 PM.
sonia.bao is offline   Reply With Quote
Old 06-02-2012, 09:10 AM   #2
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

You should definitely try a re-aligner like SRMA to clean up the alignments, especially for indel-calling. You could also try a higher cutoff for base quality (e.g. -Q 20) or do some post-alignment filtering (e.g. remove reads with lots of mismatches, low mapping quality or low average quality score) to get more reliable results.
Chipper is offline   Reply With Quote
Old 06-02-2012, 09:44 AM   #3
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Default

I have found it to be fairly advantageous to trim reads in order to obtain a higher percentage of mapped reads since the ambiguous mapping percentage isn't significantly different, at least in my data sets.

However, when calling variants using varscan, I have found it to be fairly pointless to trim the reads, though it is probably dependent upon the average coverage of you reference. For one example, I had 35-50 paired-end shotgun reads on a ~4MB genome with ~300x average coverage. I tried trimming and filtering and it did not produce a significant number of different SNV counts (as seen below). I didn't use BFAST, though I did use bowtie and shrimp.

Unfiltered untrimmed: 11697
Untrimmed filtered: 11198
Trimmed filtered: 11340
Trimmed unfiltered: 11366

I've also notice a drastic difference in assembly for those that are crazy enough to go de novo with solid.

I did notice, however, that when using varscan, turning the strand filter off eliminated a lot of the false positives. Also note that it may act different on indels as I have not tried it. However, I have the data ready so I can try it out and let you know. I hope that this helps!

Last edited by twaddlac; 06-02-2012 at 09:45 AM. Reason: Forgot to mention assembly!
twaddlac is offline   Reply With Quote
Old 06-02-2012, 09:47 AM   #4
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Default

This thread may also be of interest to you:

http://seqanswers.com/forums/showthread.php?t=16745
twaddlac is offline   Reply With Quote
Old 06-02-2012, 06:39 PM   #5
sonia.bao
Member
 
Location: US

Join Date: May 2012
Posts: 12
Default

Quote:
Originally Posted by Chipper View Post
You should definitely try a re-aligner like SRMA to clean up the alignments, especially for indel-calling. You could also try a higher cutoff for base quality (e.g. -Q 20) or do some post-alignment filtering (e.g. remove reads with lots of mismatches, low mapping quality or low average quality score) to get more reliable results.
Thank you Chipper! I tried re-align with GATK and SRMA, and the alignment was not improved much. Here is a snapshot of SRMA output (same as GATK output):



I am not sure is that because I have a small genome (~4MB) with ~10M mapped reads only?

Any comments and suggestions are highly appreciated!

Last edited by sonia.bao; 06-02-2012 at 08:25 PM. Reason: re-align finished
sonia.bao is offline   Reply With Quote
Old 06-02-2012, 07:02 PM   #6
sonia.bao
Member
 
Location: US

Join Date: May 2012
Posts: 12
Default

Quote:
Originally Posted by twaddlac View Post
I have found it to be fairly advantageous to trim reads in order to obtain a higher percentage of mapped reads since the ambiguous mapping percentage isn't significantly different, at least in my data sets.

However, when calling variants using varscan, I have found it to be fairly pointless to trim the reads, though it is probably dependent upon the average coverage of you reference. For one example, I had 35-50 paired-end shotgun reads on a ~4MB genome with ~300x average coverage. I tried trimming and filtering and it did not produce a significant number of different SNV counts (as seen below). I didn't use BFAST, though I did use bowtie and shrimp.

Unfiltered untrimmed: 11697
Untrimmed filtered: 11198
Trimmed filtered: 11340
Trimmed unfiltered: 11366

I've also notice a drastic difference in assembly for those that are crazy enough to go de novo with solid.

I did notice, however, that when using varscan, turning the strand filter off eliminated a lot of the false positives. Also note that it may act different on indels as I have not tried it. However, I have the data ready so I can try it out and let you know. I hope that this helps!
Thank you very much twaddlac! This is of great help Will try varscan again and get back here.

I am curious about why turning the the strand filter off reduced the false positive rate. Isn't it set for filtering out extreme strand bias?

Would you mind sharing which tools you used for trimming SOLiD reads? I saw people using cutadapt and clc bio but have not tried them yet. Thanks much!

Last edited by sonia.bao; 06-02-2012 at 07:31 PM.
sonia.bao is offline   Reply With Quote
Old 06-02-2012, 07:25 PM   #7
sonia.bao
Member
 
Location: US

Join Date: May 2012
Posts: 12
Thumbs up

Quote:
Originally Posted by twaddlac View Post
This thread may also be of interest to you:

http://seqanswers.com/forums/showthread.php?t=16745
Thank you very much! The thread is very useful. I've got similar results with various mappers too - BFAST and LifeScope map the most, and bowtie works best for the high-quality reads (the most "clean" alignment). We finally decided to go with BFAST, as LifeScope was a pain to install on the server (and runs relatively slow), and novoalignCS was too slow with 1 thread only (the free version) though it gave pretty good results.

BFAST has been good at aligning reads from genomic DNA, haven't tried it with RNA-Seq though....

Last edited by sonia.bao; 06-02-2012 at 07:36 PM.
sonia.bao is offline   Reply With Quote
Old 06-04-2012, 08:51 AM   #8
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Default

Quote:
I am curious about why turning the the strand filter off reduced the false positive rate. Isn't it set for filtering out extreme strand bias?
Sorry, I need to mind my terminology more. It didn't produce false positives per se - it mainly provided more junk than without it. However, I found it advantageous to turn it off when using VarScan on my datasets. I'd be interested to hear how it works for you!

Quote:
Would you mind sharing which tools you used for trimming SOLiD reads?
I use SOLiD_preprocess_meanFilter_SOPRA_v1.pl which comes along with the Sopra package (http://www.physics.rutgers.edu/~anirvans/SOPRA/). I used the defaults but there are many tuning parameters, too. This also does the trimming if you opt for it.
twaddlac is offline   Reply With Quote
Old 12-13-2012, 12:14 PM   #9
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Hi everyone on this great thread!

What did you en up using to trim your SOLiD reads? I've read elsewhere that the chemistry of SOLiD gives for a very weird-looking per-base quality boxblot in FastQC -- so does anybody know what is the standard for a good quality run?

Thanks!!

Carmen
carmeyeii is offline   Reply With Quote
Old 12-13-2012, 12:18 PM   #10
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Default

For the SOLiD 4 I've had good results with 35bp and 50bp with the SOLiD 5. As I recall, especially with the SOLiD 4, those are where the noticeable quality drop-offs were.
twaddlac is offline   Reply With Quote
Old 12-13-2012, 04:55 PM   #11
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Thanks, twaddlac!

Would you mind sharing what program ou use to trim your reads?

Thanks,

Carmen
carmeyeii is offline   Reply With Quote
Old 12-14-2012, 08:02 AM   #12
twaddlac
Member
 
Location: Pittsburgh, PA

Join Date: Feb 2011
Posts: 49
Smile

Quote:
Originally Posted by twaddlac View Post
I use SOLiD_preprocess_meanFilter_SOPRA_v1.pl which comes along with the Sopra package (http://www.physics.rutgers.edu/~anirvans/SOPRA/). I used the defaults but there are many tuning parameters, too. This also does the trimming if you opt for it.
It's a perl script that came with the SOPRA download - you don't have to install it to use it.
twaddlac is offline   Reply With Quote
Old 12-14-2012, 09:02 AM   #13
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Cool.

For anyone out there looking for a way to do this, here is the simple command that will help you out in case you already have your .csfasta and .qual files merged into a single colorspace fastq file:

To keep from base 1 and length 25:

awk 'NR % 2 == 0 { print substr($1, 1, 25) } NR % 2 == 1' SRR388230_2.fastq > SRR388230_2_TRIM.fastq

It doesn't take too long to run!
carmeyeii is offline   Reply With Quote
Reply

Tags
bfast, fastqc, solid reads, trimming, varscan

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO