SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to randomly select 20m reads out of a FASTQ file angerusso RNA Sequencing 9 08-15-2013 12:26 PM
Remove the adapter sequence by fastx_clipper in fastq file Jiafen Bioinformatics 14 08-08-2013 02:16 AM
remove reads in fasta file JQL Bioinformatics 25 07-25-2013 07:16 AM
Extract index reads from raw Fastq file ostrakon Bioinformatics 6 02-13-2013 01:54 PM
How to remove the newlines in pacific biosciences fastq file zszong@hotmail.com Pacific Biosciences 7 11-14-2012 07:03 PM

Reply
 
Thread Tools
Old 01-07-2014, 02:36 PM   #1
choijae3
Junior Member
 
Location: ithaca

Join Date: Jan 2014
Posts: 8
Default How to randomly remove portions of the raw reads from the FASTQ file

Hi everyone

I'm a graduate student just started to do some NGS for my thesis project.

Most of the problems I had I could have searched and found it here on seq answer but I think I have a situation where I might need some help.

I have done a 2X150 PE Hiseq sequening by pooling 3 different populations of Drosophila. Using a reference genome based reassembly I used bwa and yada yada in the end I've had pretty good coverage where at least only for chromosome 2L on average there was about 70X coverage.

This is really good but I think its alittle overkill for me since running the fastq files through fastqc indicated the level of duplication for the library was around~25% and I'm tending to think now that I'm not really "learning" new and many of the sequencing is being wasted.

I'm on a very limited budget and I'm pretty much having a dilema on whether I can pool more samples (maybe 4 or even 5 samples) during my sequencing reaction so I can sequence more populations.

With this in mind I was trying to mimic a situation where I've initially pooled 4 or 5 populations by decreasing the number of reads in my current fastq file.
So it was a long way to explain how I can randomly delete a significant proportion of paired reads from my initial fastq file?

Thanks again for reading this far!
choijae3 is offline   Reply With Quote
Old 01-07-2014, 02:45 PM   #2
choijae3
Junior Member
 
Location: ithaca

Join Date: Jan 2014
Posts: 8
Default

nvm so I've found useful links to solve my problem from here and here

however is my approach makes sense in that decreasing the library size be a valid approach to see if more pooling would be beneficial?

sorry if I'm derailing the post...
choijae3 is offline   Reply With Quote
Old 01-07-2014, 03:00 PM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

FYI, I assume you mean "multiplexing" rather than "pooling". While there is pooling in both cases, the former is probably a more exact description of what you're doing (I assume you're looking for sequence differences between strains or something like that, so being able to separate reads by strain would be useful).

Regarding your strategy, it's often termed "saturation analysis" or "making a saturation/rarefaction curve" or various permutations thereof. It's a very good thing to do and I've seen a few papers (mostly RNAseq) specifically doing that to estimate maximal statistical power. 70x is overkill for a lot of common things, so I wouldn't be surprised if you can get away with throwing more samples on there.
dpryan is offline   Reply With Quote
Old 01-07-2014, 03:22 PM   #4
choijae3
Junior Member
 
Location: ithaca

Join Date: Jan 2014
Posts: 8
Default

Yes I should have ment multiplexing instead of pooling. I'm conducting a population genomic type project and trying to sequence as much populations without sacrificing coverage too much.

Thanks dpryan!
choijae3 is offline   Reply With Quote
Old 01-08-2014, 08:14 AM   #5
barkasn
Junior Member
 
Location: Oxford, UK

Join Date: Mar 2012
Posts: 9
Default

Hi choijae3,

70X coverage is very high and an overkill for most applications so in general you are better off sequencing more samples as opposed to sequencing the same thing over and over again.

With respect to the duplication rate, I would recommend you do no trust FastQC. FastQC estimates duplication rate by looking at the first and second reads independently. Given your high coverage it is very likely that you will get 1st reads starting at the exact same spot. I would recommend you use Picard MarkDuplicates.jar to estimate the duplication rate after alignment as this takes into account both the first and second reads of each pair.
barkasn is offline   Reply With Quote
Old 01-08-2014, 08:27 AM   #6
choijae3
Junior Member
 
Location: ithaca

Join Date: Jan 2014
Posts: 8
Default

Hi barkasn

thanks for the reply! I've been following best practice from broad institute and have done the mark duplicate steps. I haven't paid much attention to it (I really should have) and found that to be more helpful. Thanks again for the advice!
choijae3 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO