Hey all,
I have a very basic question that I just cannot seem to find a straight answer to despite scouring Google and SEQanswers.
When I ran my dataset through Fastq_screen I discovered I have E. coli contamination. I ran my files in Tophat2 against the E.coli genome and determined that about 5% of my reads positively map to E. coli.
I would like to remove these reads but I cannot figure out how.
Most forum posts suggest to remove them but do not specify the exact method to do that. My initial thought was to take my unmapped.bam file from my E.coli alignment and use that as my "clean" data but it means converting the data back to fastq in paired reads format and adding various processing steps so I'm not really sure what is left in that file. Naturally, I'd like to lose as little data as possible.
Is there a simple way to just take my original file fastq file, extract the E.coli+ reads and have a clean data set?
Or alternatively, does having a contamination like that not really matter because they won't map to the mouse genome anyway?
Obviously I'm pretty new to this and teaching myself so any advice, however simple it may seem, is much appreciated.
Thanks for your help!
I have a very basic question that I just cannot seem to find a straight answer to despite scouring Google and SEQanswers.
When I ran my dataset through Fastq_screen I discovered I have E. coli contamination. I ran my files in Tophat2 against the E.coli genome and determined that about 5% of my reads positively map to E. coli.
I would like to remove these reads but I cannot figure out how.
Most forum posts suggest to remove them but do not specify the exact method to do that. My initial thought was to take my unmapped.bam file from my E.coli alignment and use that as my "clean" data but it means converting the data back to fastq in paired reads format and adding various processing steps so I'm not really sure what is left in that file. Naturally, I'd like to lose as little data as possible.
Is there a simple way to just take my original file fastq file, extract the E.coli+ reads and have a clean data set?
Or alternatively, does having a contamination like that not really matter because they won't map to the mouse genome anyway?
Obviously I'm pretty new to this and teaching myself so any advice, however simple it may seem, is much appreciated.
Thanks for your help!
Comment