Go Back   SEQanswers > Introductions

Similar Threads
Thread Thread Starter Forum Replies Last Post
conversion of total DNA (ng) into nM HGENETIC Sample Prep / Library Generation 10 04-19-2016 09:51 AM
How much total RNA for a transcriptome? Dangermouse 454 Pyrosequencing 13 05-10-2011 05:32 PM
tophat total alignment zorph Bioinformatics 4 12-09-2010 04:09 AM
polyA+ yields from total RNA? pmiguel RNA Sequencing 10 04-28-2010 04:47 AM
from russia with love! dr_hogart Introductions 0 02-19-2010 08:02 AM

Thread Tools
Old 05-19-2011, 10:26 AM   #1
Location: london, england

Join Date: May 2011
Posts: 12
Unhappy I'm a total newbie and would love some help!

Hello everyone,

First of all, you have no idea how great it feels to find this place. I'm a programmer/comp. scientist and totally new to bioinformatics. I need to do some analysis for a job and I have no clue about many things.

Here's the story:

I am given a huge data set for 15 subjects with some sort of disease and I'm supposed to analyse it to find out common SNPs for that disease. Each data set contains a huge amount of fastq sequences such as this one:


My questions are (I apologize in advance if they are too many and too stupid to ask, please bear with me):

1- I know now that I need to do adapter trimming and quality filtering. For that, don't I need to be given the 'adapter' sequences? if not, where do I find common adapter sequences? Also, how long should the adapter sequences be? as you can see, my fastq sequence is 36 bp long. How do I determine how much of it as adapter?

2- What are indexed samples? or better said: what is my reference for indexing my cleaned up sequence?

3- Can you give an outline for the logical steps that i will need to follow to implement my analysis? (all I will need is the order of each step at a very high level and I will dig into it).

4- Why do I need two files each with a read from a different direction, for each subject?

Thank you, much appreciated and sorry
zhacker is offline   Reply With Quote
Old 05-19-2011, 08:24 PM   #2
Senior Member
Location: WashU

Join Date: Aug 2010
Posts: 115

Welcome to the wonderful world of bioinformatics zhacker.

Many of the questions you ask have probably been answered elsewhere in SeqAnswers. As you search the forums you will also learn many other useful things that you may not have thought about yet. In the future, I would also recommend that you structure your posts a bit more to cover one topic at a time and use a descriptive subject line. This will be the most useful to future SeqAnswers newcomers. Briefly, here are some comments on your specific questions.

1.) Adapter trimming. You may or may not need to do this. It depends on how your libraries were constructed. You don't say what platform your sequence comes from but I'm going to assume Illumina. Even within this platform there are a number of different ways that libraries are being made. There are some popular approaches, but by no means a universal standard. Yes, ideally you would be told what the adapter sequences are. If possible, get in contact with the person/group that made the libraries and learn as much as you can about the details. This will help you understand where the data is coming from. If that is not possible, perhaps you can at least determine which kits were used and you can read the manuals yourself. If your reads are only 36 bp long, hopefully none of the sequence is adapter or they will be very short by today's standards.

2.) Not sure about this question. Indexing in the context of 'samples' usually refers to adding a linker to each DNA fragment in a library. Each library (e.g. patient) is constructed with a different index. The index thus acts as a barcode for the library. This allows you to physically mix multiple libraries and sequence them as a pool within the same lane of a flowcell. Then using the barcode, during analysis you can separate the reads computationally and figure out which patient each came from. Before indexing was possible you could not sequence more than 8 patients on a flowcell (because each flowcell only has 8 lanes).

3.) It is not possible to give a good outline of analysis steps required without knowing more about the experiment. You say you want to look for SNPs. Do you really mean SNPs (polymorphisms) or are you looking for de novo mutations (commonly referred to as SNVs)? Is the disease a cancer? Do you have a matched normal or tumor. Or is it another disease. In which case do you have an affected child and unaffected parent to compare? Again I would suggest re-posting this question specifically with a more detailed explanation of the experiment and goals. But in general the analysis might go something like this:

Align each pair of fastq files to a reference genome (using BWA or Tophat for example), identify SNVs in each library relative to the reference genome (using SNVmix for example), determine the subset of these that are not already known to be SNPs (by comparing to dbSNP for example), if you have a comparison sample, determine those SNVs that are present in the disease and not in the healthy sample (in cancer we would call these the 'somatic' SNVs), classify the SNVs according to their locations in the genome (i.e. which are within exons, splice sites, introns, UTRs, intergenic space), for those that are within exons, determine which are likely to affect protein sequence/function (e.g. which are non-synonymous or truncating).

4.) Why do you have two files for each subject? This may mean that you have 'paired' read data. A common approach is to take DNA from a subject and fragment it into small pieces. Chromosomes are huge (many millions of bases long) but these fragments will be small (e.g. 200-500 bp). These fragments are what you are actually sequencing. But the read lengths of next-gen sequencers are generally still too short to sequence the entire fragment. For example, you might have reads of 36bp - 150 bp using the Illumina sequencer. Note that the length is not variable within a sequencing run but rather you run the sequencing reaction for a certain number of cycles and this determines your read length for that data set. Anyway, the important point is that you can not get all the way through the fragments. But what you can do is start a second read from the other side of each fragment. This is a common strategy because it gives you two reads that you know are separated by a certain distance and came from the same physical fragment of DNA. This greatly improves your ability to map the reads back to a reference genome and infer where the fragment actually came from. In the fastq format, these two reads are stored in two separate files where each line in file1 corresponds to each line in file2.

Note that your data may or may not be paired-end data. It is also possible that you just have two lanes of data for each library. You may be able to tell from the read names or file names whether your data is paired or not. You can also tell by mapping them and seeing if they appear to be paired. Really this is information that you should confirm from whoever did the sequencing...
malachig is offline   Reply With Quote
Old 05-24-2011, 06:59 AM   #3
Location: london, england

Join Date: May 2011
Posts: 12

Thank you very much. After re-reading your post many times, I now have a better idea of many things that I was oblivious to.
zhacker is offline   Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 12:09 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO