SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Split Large FASTQ file in small FASTQ files with user defined number of reads Windows deepbiomed Bioinformatics 3 04-04-2013 07:14 AM
find overlaps/common in multiple bed file epi Bioinformatics 11 02-05-2013 05:47 AM
Looking to find number overlapping sequences between fastq files jme Bioinformatics 2 01-17-2012 09:16 AM
Where can I find FASTQ files along with reference genomes for various species? gvivek Bioinformatics 1 09-09-2011 03:30 AM
program to find common secondary structure of RNA from many sequences zhlyang Bioinformatics 3 06-17-2010 07:10 PM

Reply
 
Thread Tools
Old 04-22-2013, 12:14 AM   #1
Fernas
Member
 
Location: Middle East

Join Date: Apr 2013
Posts: 74
Default Find Common Reads between two FASTQ files

Dear experts,

I have two fastq files contains RNASeq reads of two technical replicates (at the level of re-run the sequencing machine twice) for one sample. I want to select the reads that appear in both fastq files by comparing the sequence reads between two files. How can I do that? does any of the bioinformatics tools do it?
Fernas is offline   Reply With Quote
Old 04-22-2013, 07:26 AM   #2
mknut
Member
 
Location: UK

Join Date: Jul 2012
Posts: 23
Default

The easiest way to do this:
Code:
comm -12 "file1" "file2"
comm
Pipe to get the results in a file
Code:
comm -12 "file1" "file2">common_lines.txt
This command will compare two files and print the common lines.
Using -23 flag you can get lines unique to file1, using -13 flag you can get lines unique to file2.

May I ask why do you want to do this?
mknut is offline   Reply With Quote
Old 04-22-2013, 12:03 PM   #3
Fernas
Member
 
Location: Middle East

Join Date: Apr 2013
Posts: 74
Default

Thanks mknut for your reply.

Well... comm may not work here because each read has 4 lines and I need to select reads that have the same sequence (but may differ in their quality string). So, wondering if there is any alternative?

Regarding your question about why I am trying to do so:
Actually, I have two technical replicates for RNAseq sequencing of my sample. So, I have to library of reads (2 fastq files). In order to represent the sample by one library, I have two options:
1) to combine reads from both files. This will increase coverage. However, artifact reads generated in any of the two libraries will not be detected
2) to select those reads that appear in both libraries (the same sequence, but, may differ in sequencing quality). This option is applicable in my case because my technical replicates do have the same library preparation procedure (same fragmentation...etc). The only difference is just in run the illumina sequencing machine twice!

Maybe there is another option that is better. If you or anyone have another suggestion, I will appreciate if he can reply to this message.
Fernas is offline   Reply With Quote
Old 04-22-2013, 01:02 PM   #4
mknut
Member
 
Location: UK

Join Date: Jul 2012
Posts: 23
Default

I am not entirely sure why do you want to make one library from two technical replicates. If you preserve the reads as technical replicates, you will preserve information about variability introduced by the method - this is the idea behind having technical replicates in the first place. I think that it would be better to just continue with the analysis without any merging of the files, so do QC and mapping for them separately and use software that accommodates replicate usage (majority does) in further analysis (e.g. cufflinks, cuffdiff). What exactly are you investigating, differential gene expression or something else?

One other thing -
Quote:
my technical replicates do have the same library preparation procedure
Correct me if I'm wrong, but I understand that you had one sample, divided it into two, then they both went through the same library prep protocol and sequencing. This means that In this case you will see not only variability originating from the sequencing, but variability originating from library prep as well. Have a look at this thread.
mknut is offline   Reply With Quote
Old 04-22-2013, 10:38 PM   #5
Fernas
Member
 
Location: Middle East

Join Date: Apr 2013
Posts: 74
Default

Yes. I am studying differential gene expression between samples. I have n samples where each one has m technical replicates, where m differs from sample to another.

That thread you referred to is useful. My technical replicates of each sample are just "the same library was sequenced on m different lanes". So as per the thread you referred to, the variance between these replicates introduces the technical variance between illumina sequencing. Consequently, my purpose is to remove such technical sequencing variance between technical replicates (e.g. artifact reads) and focus on the biological variance between samples. That is what I am still convinced to do, but, still looking to criticize it to investigate if this methodology can be replaced by a better one.

Just found this thread..
http://seqanswers.com/forums/showthread.php?t=16918
Fernas is offline   Reply With Quote
Old 04-23-2013, 05:50 AM   #6
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default Find Common Reads between two FASTQ files

I would just combine the files for the technical replicates.

For example, see the vignette (documentation) for the Bioconductor package DESeq:

http://bioconductor.org/packages/2.1...tml/DESeq.html
mastal is offline   Reply With Quote
Old 04-24-2013, 03:16 AM   #7
syfo
Just a member
 
Location: Southern EU

Join Date: Nov 2012
Posts: 103
Default

Quote:
Originally Posted by Fernas View Post
My technical replicates of each sample are just "the same library was sequenced on m different lanes".
I would just merge them then, probably after the mapping step. You can still run some QC on them separately to make sure you observe a nice correlation between these technical replicates but my understanding is that results should not be different from what you would get by sequencing deeper.
syfo is offline   Reply With Quote
Old 01-21-2014, 03:52 AM   #8
emp
Member
 
Location: india

Join Date: Jan 2014
Posts: 11
Default

hello all,

I have similar problem. I have two fastq files R1 and R2 which are not equal in size.

I want to separate out equal reads from them for further mappign.


is there any tool or command.
Kindly help...
emp is offline   Reply With Quote
Old 01-21-2014, 04:03 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by emp View Post
hello all,

I have similar problem. I have two fastq files R1 and R2 which are not equal in size.

I want to separate out equal reads from them for further mappign.


is there any tool or command.
Kindly help...
Have a look at this or this thread on biostars for a number of ways to do this.
dpryan is offline   Reply With Quote
Reply

Tags
fastq files, intersect fastq files

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:30 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO