SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 02:32 AM
Removing contaminating sequence from 16s reads microgirl123 Bioinformatics 3 03-22-2013 11:17 AM
Removing :Y: flagged reads lukas1848 Bioinformatics 3 04-04-2012 10:41 AM
Removing similar sequence reads loba17 Bioinformatics 4 10-17-2011 07:31 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 12:46 AM

Reply
 
Thread Tools
Old 08-07-2014, 02:57 AM   #1
travelk
Member
 
Location: France

Join Date: Jul 2013
Posts: 20
Default Removing contaminating reads

Hey all,

I have a very basic question that I just cannot seem to find a straight answer to despite scouring Google and SEQanswers.

When I ran my dataset through Fastq_screen I discovered I have E. coli contamination. I ran my files in Tophat2 against the E.coli genome and determined that about 5% of my reads positively map to E. coli.

I would like to remove these reads but I cannot figure out how.

Most forum posts suggest to remove them but do not specify the exact method to do that. My initial thought was to take my unmapped.bam file from my E.coli alignment and use that as my "clean" data but it means converting the data back to fastq in paired reads format and adding various processing steps so I'm not really sure what is left in that file. Naturally, I'd like to lose as little data as possible.

Is there a simple way to just take my original file fastq file, extract the E.coli+ reads and have a clean data set?

Or alternatively, does having a contamination like that not really matter because they won't map to the mouse genome anyway?

Obviously I'm pretty new to this and teaching myself so any advice, however simple it may seem, is much appreciated.

Thanks for your help!
travelk is offline   Reply With Quote
Old 08-07-2014, 03:24 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Just align to the E coli genome with bowtie2, which can be told to write unmapped reads in fastq format. You can then directly align the resulting fastq files.
dpryan is offline   Reply With Quote
Old 08-07-2014, 03:29 AM   #3
a.kmg
Member
 
Location: Internet

Join Date: Aug 2014
Posts: 15
Default

You can run Bowtie2 with your fastq file on E.coli to suppress reads corresponding to E.coli :

Quote:
bowtie2 -U rawFastq -p nbproc --un fasqtFileWithoutEcoli indexEcoli -S ecoli_reads.sam
This command treats the raw fastq file, creating a new fastq containing not E.coli reads (--un = unmapped reads) and recovering reads aligned on E. coli in a sam file (-S option). Delete -S option if you do not want to get the E.coli reads.
a.kmg is offline   Reply With Quote
Old 08-07-2014, 03:42 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,057
Default

BBMap example (remain in the same directory or adjust paths accordingly). Will work on PC/Mac/*nix.

1. Get Ecoli genome fasta file.

2. Build an index for Ecoli genome using BBMap.

Code:
$ bbmap.sh ref=./E_coli_genome.fa
3. Align against the Ecoli genome index saving reads that don't align to a new file (these are the reads you want, the outu file below)

Code:
$ bbmap.sh in=your_fastq_file path=./ outm=E_coli_reads.fastq outu=reads_you_want.fastq qin=33
GenoMax is offline   Reply With Quote
Old 08-07-2014, 07:56 AM   #5
travelk
Member
 
Location: France

Join Date: Jul 2013
Posts: 20
Default

Thank you! This totally did the trick. I processed the files and then reran them through fastq_screen and all the E.coli reads were gone.

And thank you a.kmg and GenoMax for your clear step by step instructions.

I ended up using bowtie2 because that it what I've been working with so far. I initially got an error message but realized I was missing the -x in the command line. So, in the end I used:

Quote:
bowtie2 -U raw.fastq -p nbproc --un FileWithoutEcoli.fastq -x indexEcoli -S ecoli_reads.sam
travelk is offline   Reply With Quote
Old 12-10-2014, 11:53 AM   #6
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Default

Hi fellow,

I am trying to find rRNA gene from my input file:

Step1: Create Index

#bowtie2-build rRNA.fasta rRNA.index

Step 2: Align to rRNA index inorder to get rRNA free fastq file.

#bowtie2-align -p 2 -k 1 -q -U /filter_clean.fastq --un fasqFileWithoutrRNA -x rRNA.index

When I run second step, it comes with error saying:

" bowtie2-align: option '--un' is ambiguous; possibilities: '--ungapped' '--unpaired' "

So I replace --un with --unpaired, But it is not working.

Can anyone please shed some light on this.

I would really appreciate your help.

Thanks,
Naresh
nareshvasani is offline   Reply With Quote
Old 12-11-2014, 03:51 AM   #7
jpnm
Junior Member
 
Location: Portugal

Join Date: May 2013
Posts: 1
Default

As far as I know "--un" is not an option of bowtie2-align...but only of bowtie2!

So if you run, it should work fine!

bowtie2 -p 2 -k 1 -q -U /filter_clean.fastq --un fasqFileWithoutrRNA -x rRNA.index

I hope it helps!
jpnm is offline   Reply With Quote
Old 12-11-2014, 06:04 AM   #8
nareshvasani
Member
 
Location: NC

Join Date: Apr 2013
Posts: 57
Default

Quote:
Originally Posted by jpnm View Post
As far as I know "--un" is not an option of bowtie2-align...but only of bowtie2!

So if you run, it should work fine!

bowtie2 -p 2 -k 1 -q -U /filter_clean.fastq --un fasqFileWithoutrRNA -x rRNA.index

I hope it helps!
Hi jpnm,

Thanks for your input, I am trying you suggestion. Will keep you posted.


Naresh
nareshvasani is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:28 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO