Seqanswers Leaderboard Ad

**mknut** · 04-22-2013, 07:26 AM

The easiest way to do this:

Code:

comm -12 "file1" "file2"

comm
Pipe to get the results in a file

Code:

comm -12 "file1" "file2">common_lines.txt

This command will compare two files and print the common lines.
Using -23 flag you can get lines unique to file1, using -13 flag you can get lines unique to file2.

May I ask why do you want to do this?

**Fernas** · 04-22-2013, 12:03 PM

Thanks mknut for your reply.

Well... comm may not work here because each read has 4 lines and I need to select reads that have the same sequence (but may differ in their quality string). So, wondering if there is any alternative?

Regarding your question about why I am trying to do so:
Actually, I have two technical replicates for RNAseq sequencing of my sample. So, I have to library of reads (2 fastq files). In order to represent the sample by one library, I have two options:
1) to combine reads from both files. This will increase coverage. However, artifact reads generated in any of the two libraries will not be detected
2) to select those reads that appear in both libraries (the same sequence, but, may differ in sequencing quality). This option is applicable in my case because my technical replicates do have the same library preparation procedure (same fragmentation...etc). The only difference is just in run the illumina sequencing machine twice!

Maybe there is another option that is better. If you or anyone have another suggestion, I will appreciate if he can reply to this message.

**mknut** · 04-22-2013, 01:02 PM

I am not entirely sure why do you want to make one library from two technical replicates. If you preserve the reads as technical replicates, you will preserve information about variability introduced by the method - this is the idea behind having technical replicates in the first place. I think that it would be better to just continue with the analysis without any merging of the files, so do QC and mapping for them separately and use software that accommodates replicate usage (majority does) in further analysis (e.g. cufflinks, cuffdiff). What exactly are you investigating, differential gene expression or something else?

One other thing -

my technical replicates do have the same library preparation procedure

Correct me if I'm wrong, but I understand that you had one sample, divided it into two, then they both went through the same library prep protocol and sequencing. This means that In this case you will see not only variability originating from the sequencing, but variability originating from library prep as well. Have a look at this thread.

**Fernas** · 04-22-2013, 10:38 PM

Yes. I am studying differential gene expression between samples. I have n samples where each one has m technical replicates, where m differs from sample to another.

That thread you referred to is useful. My technical replicates of each sample are just "the same library was sequenced on m different lanes". So as per the thread you referred to, the variance between these replicates introduces the technical variance between illumina sequencing. Consequently, my purpose is to remove such technical sequencing variance between technical replicates (e.g. artifact reads) and focus on the biological variance between samples. That is what I am still convinced to do, but, still looking to criticize it to investigate if this methodology can be replaced by a better one.

Just found this thread..
http://seqanswers.com/forums/showthread.php?t=16918

**mastal** · 04-23-2013, 05:50 AM

Find Common Reads between two FASTQ files

I would just combine the files for the technical replicates.

For example, see the vignette (documentation) for the Bioconductor package DESeq:

DESeq

http://bioconductor.org/packages/2.11/bioc/html/DESeq.html

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution

**syfo** · 04-24-2013, 03:16 AM

Originally posted by Fernas View Post

My technical replicates of each sample are just "the same library was sequenced on m different lanes".

I would just merge them then, probably after the mapping step. You can still run some QC on them separately to make sure you observe a nice correlation between these technical replicates but my understanding is that results should not be different from what you would get by sequencing deeper.

**emp** · 01-21-2014, 04:52 AM

hello all,

I have similar problem. I have two fastq files R1 and R2 which are not equal in size.

I want to separate out equal reads from them for further mappign.

is there any tool or command.
Kindly help...

**dpryan** · 01-21-2014, 05:03 AM

Originally posted by emp View Post

hello all,

I have similar problem. I have two fastq files R1 and R2 which are not equal in size.

I want to separate out equal reads from them for further mappign.

is there any tool or command.
Kindly help...

Have a look at this or this thread on biostars for a number of ways to do this.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Find Common Reads between two FASTQ files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News