Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
suggestion for hardware configuration for NGS analysis sunsnow86 Bioinformatics 3 04-24-2015 08:04 AM
Newbie needs help with DGElist for EdgeR analysis Germwise Bioinformatics 2 01-01-2013 03:46 PM
Wish to do differential analysis, complete newbie. Kotoro Bioinformatics 7 07-07-2011 11:41 AM
Maone, newbie in exome sequencing and data analysis Maone Introductions 0 06-15-2011 08:11 AM
SOLID analysis Newbie El_rna SOLiD 2 11-19-2009 10:36 PM

Thread Tools
Old 04-02-2013, 12:27 PM   #1
Junior Member
Location: Singapore

Join Date: Apr 2013
Posts: 7
Question Data Analysis Suggestion for a Newbie?

Hi all, I am a newbie in Bioinformatics. I've got data for paired-end DNA sequencing on Illumina platform and I am wondering if anyone is kind enough to give me some suggestions on how to start with the analyzing part. It is in fastq format, with average of 2 millions reads (~50bp) each for read 1 and read 2.

For preprocessing, I will not trim any of the reads (for downstream analysis purposes) but I am concerned about the low quality. It's either I trim all the reads with a fixed length, say...5bp(?), or I just leave them that way. The next I try to filter reads with at least 70% Q20 (any comment?)

Then I will have to map all these processed reads with a reference genome. How do I combine read 1 and read 2 and map them as a single read? I was told to use <cat> but is it the right way? How do I specify the gap between read 1 and read 2 then?

And for mapping, someone suggested me to use Tophat (any others?) and I did a prerun with the reads(after command cat) and map it to a reference (built with bowtie previously with reference.fasta) and I've got a bam file.

Here's the important part. How do I merge all the same repeated reads into one? After this, is there a way to calculate how many unique reads are there per gene? By unique I mean reads which start/mapped at different site, in one gene. Kinda similar to determining expression profile but maybe a little different.

1234ABCDEFG (2)
XYZ1234567 (5)

For example, unique reads = 5.
Read 2: 1234ABCDEFG is not counted because it doesn't start from that gene.
Read 5: XYZ1234567 is counted because it starts from the gene region, although the sequence continues to span the adjacent region.

Is there a way to do so and how?

Sorry if this is getting too long. I just need some ideas on what kind of tools should I use, because by far I only know the names of a few like fastx toolkit and bowtie and tophat, I know there are many more but I am not familiar with their functions and etc. It will be great if anyone can give me a brief guideline, very much appreciated!

Have a nice day!
krispy is offline   Reply With Quote
Old 04-02-2013, 02:57 PM   #2
Senior Member
Location: San Diego

Join Date: May 2008
Posts: 912

For starters, don't worry about quality filtering your reads.

Software like bwa and bowtie are designed to take two separate fastqs for paired end data. If you cat them together, that's like submitting a single end file. Which is fine, but you lose the benefit of paired end reads. As long as the order of the reads in the two separate files is undisturbed, bwa and bowtie will understand that they are paired.

You can get rid of exact duplicate pairs with samtools rmdup, or Picard's MarkDuplicates.

samtools idxstats will tell you how many reads there are per sequence in your reference.
swbarnes2 is offline   Reply With Quote
Old 04-03-2013, 01:05 AM   #3
Junior Member
Location: Singapore

Join Date: Apr 2013
Posts: 7

Thanks for the reply!

But would you mind further explain what do you mean by "lose the benefit of paired end reads"? Does it mean it will be better if I don't cat the two files into one and just let bwa or bowtie to take them separately since bwa/bowtie can recognize them as pair?
krispy is offline   Reply With Quote
Old 04-03-2013, 06:43 AM   #4
Simon Anders
Senior Member
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994

Yes, this is what he means.
Simon Anders is offline   Reply With Quote
Old 04-05-2013, 07:21 PM   #5
Wei Shi
Location: Australia

Join Date: Feb 2010
Posts: 235

Hi Krispy,

The Rsubread package ( seems to be able to answer most of your questions, if you know how to program in R. Have a look at its vignette that describes that this package can do.

shi is offline   Reply With Quote

bioinformatics analysis, illumina fastq, mapping, read alignment, read count

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 12:41 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO