SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Data Analysis Suggestion for a Newbie? (http://seqanswers.com/forums/showthread.php?t=28913)

krispy 04-02-2013 11:27 AM

Data Analysis Suggestion for a Newbie?
 
Hi all, I am a newbie in Bioinformatics. I've got data for paired-end DNA sequencing on Illumina platform and I am wondering if anyone is kind enough to give me some suggestions on how to start with the analyzing part. :p It is in fastq format, with average of 2 millions reads (~50bp) each for read 1 and read 2.

For preprocessing, I will not trim any of the reads (for downstream analysis purposes) but I am concerned about the low quality. It's either I trim all the reads with a fixed length, say...5bp(?), or I just leave them that way. The next I try to filter reads with at least 70% Q20 (any comment?)

Then I will have to map all these processed reads with a reference genome. How do I combine read 1 and read 2 and map them as a single read? I was told to use <cat> but is it the right way? How do I specify the gap between read 1 and read 2 then?

And for mapping, someone suggested me to use Tophat (any others?) and I did a prerun with the reads(after command cat) and map it to a reference (built with bowtie previously with reference.fasta) and I've got a bam file.

Here's the important part. How do I merge all the same repeated reads into one? After this, is there a way to calculate how many unique reads are there per gene? By unique I mean reads which start/mapped at different site, in one gene. Kinda similar to determining expression profile but maybe a little different.


ABCDEFGHIJKLMNOPQRSTUVWXYZ (reference gene)
CDEFGHIJ (1)
1234ABCDEFG (2)
KLMNOPQR (3)
UVWXYZ (4)
XYZ1234567 (5)
MNOPQRST (6)

For example, unique reads = 5.
Read 2: 1234ABCDEFG is not counted because it doesn't start from that gene.
Read 5: XYZ1234567 is counted because it starts from the gene region, although the sequence continues to span the adjacent region.

Is there a way to do so and how?

Sorry if this is getting too long. I just need some ideas on what kind of tools should I use, because by far I only know the names of a few like fastx toolkit and bowtie and tophat, I know there are many more but I am not familiar with their functions and etc. It will be great if anyone can give me a brief guideline, very much appreciated! :)

Have a nice day! :D

swbarnes2 04-02-2013 01:57 PM

For starters, don't worry about quality filtering your reads.

Software like bwa and bowtie are designed to take two separate fastqs for paired end data. If you cat them together, that's like submitting a single end file. Which is fine, but you lose the benefit of paired end reads. As long as the order of the reads in the two separate files is undisturbed, bwa and bowtie will understand that they are paired.

You can get rid of exact duplicate pairs with samtools rmdup, or Picard's MarkDuplicates.

samtools idxstats will tell you how many reads there are per sequence in your reference.

krispy 04-03-2013 12:05 AM

Thanks for the reply! :D

But would you mind further explain what do you mean by "lose the benefit of paired end reads"? Does it mean it will be better if I don't cat the two files into one and just let bwa or bowtie to take them separately since bwa/bowtie can recognize them as pair? :o

Simon Anders 04-03-2013 05:43 AM

Yes, this is what he means.

shi 04-05-2013 06:21 PM

Hi Krispy,

The Rsubread package (http://www.bioconductor.org/packages.../Rsubread.html) seems to be able to answer most of your questions, if you know how to program in R. Have a look at its vignette that describes that this package can do.

Cheers,
Wei


All times are GMT -8. The time now is 09:31 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.