SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
minimum and maximum CNV size by Read Depth? eleven Bioinformatics 2 05-29-2013 01:36 AM
vcfutils.pl -specifying minimum read depth? Lspoor Bioinformatics 3 05-27-2013 01:20 AM
how is the minimum read counts to do a transcriptome assembly? mruizm Bioinformatics 3 05-07-2013 10:31 AM
picard error: Mismatch between read length and quals length writing read shawpa Bioinformatics 0 08-20-2012 05:52 AM
BWA Read Length AnamikaDarwin Bioinformatics 1 04-10-2009 11:47 PM

Reply
 
Thread Tools
Old 10-03-2013, 09:54 AM   #1
persorrels
Junior Member
 
Location: San Jose, CA

Join Date: Oct 2013
Posts: 6
Default Minimum Read Length for BWA

Hi all,

I am preprocessing a dataset from a human sample sequenced by Illumina HiSeq 2500 (Paired-end reads, 100bp each). I first trim each read based on quality. If the trimmed sequence is too short, I just discard it.

My question is how do you pick the threshold length to discard? Would you discard reads shorter than 50, 40, or 30? What is the right approach to pick a threshold?

I haven't been able to find any information on this on the web. (By the way, I am using BWA for alignment.)

Thanks in advance.

Last edited by persorrels; 10-03-2013 at 03:38 PM. Reason: Clarification
persorrels is offline   Reply With Quote
Old 10-03-2013, 10:55 PM   #2
vishnuamaram
Member
 
Location: india

Join Date: Jun 2013
Posts: 42
Default

Hey persorrels,

It looks like your approach of trimming the bases and discarding reads is not appropriate.

Illumina gives more or less good quality reads except at the 1st 3-4 base positions and may be at last 2 base positions.

--> It doesnt make sense of getting reads of size 30,40,50 after trimming 100 bp size reads.

--> Just be concerned about the first few bases quality, if they are above Q20, you may proceed with alignment. if not above Q20, just trim those bases, that will do.

Good luck ahead,
Vishnu.
vishnuamaram is offline   Reply With Quote
Old 10-04-2013, 09:22 AM   #3
persorrels
Junior Member
 
Location: San Jose, CA

Join Date: Oct 2013
Posts: 6
Default

Vishnu, thanks for your reply.

It is true that most forward reads are of high quality. But the reverse reads aren't. Below are two examples:


NTATATTTTCCTCTTGGTGGTATTGAAAACCAGTGAGCAGAGAGCATAAGAACAGAACTTCAAGACCGTGGCAGGAGCTTGTATTTGTACAGCACAAACCC
+
#+12??A;A>@>C>CBBBA=?CBBBBBBCBAAA@ABBBB>ABB;=BBBB<=A;=AA>==AAAB=>AAAA################################




NAAGGAGCAGCTGCGTGCCGCGTGAGCTTTAGCAGGAGGACCAGTGATTAGCATTTACGATGCAAAGACAGAACAACTTCGTATAGGACTGTACCCCTGGA
+
#+1<?7AA<CBABBC<CCAAA=)?153*=A?A#####################################################################


It is an extreme example, but you can recover a 18bp region (AA<CBABBC<CCAAA=) from the second read. I guess what you're saying is, if I had to trim a large portion of the read, I should ignore it entirely.

Is that correct?

Also, I suspect short reads like this will affect the performance of BWA. I want to understand how the read length distibution affects the performance of BWA.

Any comments on that?

Cheers,

Per
persorrels is offline   Reply With Quote
Old 10-04-2013, 09:27 AM   #4
vishnuamaram
Member
 
Location: india

Join Date: Jun 2013
Posts: 42
Default

Hey persorrels,

what is the organism you have sequenced, what is the coverage.

how much is the data size. how many reads does your data have ?

why are you seeing individual reads.

Initially, you need to run a quality check on your entire reads together of Read1 and Read2 individually.
then see the quality output chart, then proceed for trimming.

Best,
vishnu.
vishnuamaram is offline   Reply With Quote
Old 10-04-2013, 09:47 AM   #5
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You can usually use the default minimum size of whatever trimmer you're using as a guide. Often minimum sizes of 20 or 30 are used (I wouldn't bother going much lower than that, since anything shorter will probably just become a multi-mapper).
dpryan is offline   Reply With Quote
Old 10-04-2013, 11:16 AM   #6
persorrels
Junior Member
 
Location: San Jose, CA

Join Date: Oct 2013
Posts: 6
Default

Human, 10x exome. Total sequence size is about 30Gb. 100bp per read. Attached are some sample statistics from reverse reads.
Attached Images
File Type: png per_base_quality.png (11.2 KB, 16 views)
File Type: png per_sequence_quality.png (22.9 KB, 10 views)
persorrels is offline   Reply With Quote
Old 10-05-2013, 11:03 AM   #7
vishnuamaram
Member
 
Location: india

Join Date: Jun 2013
Posts: 42
Default

Hey persorrels,

If i was in your place, i proceed this way.

I suggest you trim the first 4 bases and run FASTQC on the trimmed file and checked the QC chart.

--> If you see the mean quality- the blue line it falls more or less near 30, above 28.
--> The median quality- the red line is above 30 for all bases.

That means, only certain reads bases of the file are not of good quality.

But, any how trim the first 4 bases, align and see.
Do it. You will learn a lot by going.

Best,
Vishnu.
vishnuamaram is offline   Reply With Quote
Reply

Tags
alignment, quality control, trimming

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO