Hi all, I am a newbie in Bioinformatics. I've got data for paired-end DNA sequencing on Illumina platform and I am wondering if anyone is kind enough to give me some suggestions on how to start with the analyzing part. It is in fastq format, with average of 2 millions reads (~50bp) each for read 1 and read 2.
For preprocessing, I will not trim any of the reads (for downstream analysis purposes) but I am concerned about the low quality. It's either I trim all the reads with a fixed length, say...5bp(?), or I just leave them that way. The next I try to filter reads with at least 70% Q20 (any comment?)
Then I will have to map all these processed reads with a reference genome. How do I combine read 1 and read 2 and map them as a single read? I was told to use <cat> but is it the right way? How do I specify the gap between read 1 and read 2 then?
And for mapping, someone suggested me to use Tophat (any others?) and I did a prerun with the reads(after command cat) and map it to a reference (built with bowtie previously with reference.fasta) and I've got a bam file.
Here's the important part. How do I merge all the same repeated reads into one? After this, is there a way to calculate how many unique reads are there per gene? By unique I mean reads which start/mapped at different site, in one gene. Kinda similar to determining expression profile but maybe a little different.
ABCDEFGHIJKLMNOPQRSTUVWXYZ (reference gene)
CDEFGHIJ (1)
1234ABCDEFG (2)
KLMNOPQR (3)
UVWXYZ (4)
XYZ1234567 (5)
MNOPQRST (6)
For example, unique reads = 5.
Read 2: 1234ABCDEFG is not counted because it doesn't start from that gene.
Read 5: XYZ1234567 is counted because it starts from the gene region, although the sequence continues to span the adjacent region.
Is there a way to do so and how?
Sorry if this is getting too long. I just need some ideas on what kind of tools should I use, because by far I only know the names of a few like fastx toolkit and bowtie and tophat, I know there are many more but I am not familiar with their functions and etc. It will be great if anyone can give me a brief guideline, very much appreciated!
Have a nice day!
For preprocessing, I will not trim any of the reads (for downstream analysis purposes) but I am concerned about the low quality. It's either I trim all the reads with a fixed length, say...5bp(?), or I just leave them that way. The next I try to filter reads with at least 70% Q20 (any comment?)
Then I will have to map all these processed reads with a reference genome. How do I combine read 1 and read 2 and map them as a single read? I was told to use <cat> but is it the right way? How do I specify the gap between read 1 and read 2 then?
And for mapping, someone suggested me to use Tophat (any others?) and I did a prerun with the reads(after command cat) and map it to a reference (built with bowtie previously with reference.fasta) and I've got a bam file.
Here's the important part. How do I merge all the same repeated reads into one? After this, is there a way to calculate how many unique reads are there per gene? By unique I mean reads which start/mapped at different site, in one gene. Kinda similar to determining expression profile but maybe a little different.
ABCDEFGHIJKLMNOPQRSTUVWXYZ (reference gene)
CDEFGHIJ (1)
1234ABCDEFG (2)
KLMNOPQR (3)
UVWXYZ (4)
XYZ1234567 (5)
MNOPQRST (6)
For example, unique reads = 5.
Read 2: 1234ABCDEFG is not counted because it doesn't start from that gene.
Read 5: XYZ1234567 is counted because it starts from the gene region, although the sequence continues to span the adjacent region.
Is there a way to do so and how?
Sorry if this is getting too long. I just need some ideas on what kind of tools should I use, because by far I only know the names of a few like fastx toolkit and bowtie and tophat, I know there are many more but I am not familiar with their functions and etc. It will be great if anyone can give me a brief guideline, very much appreciated!
Have a nice day!
Comment