Seqanswers Leaderboard Ad

**malachig** · 05-19-2011, 08:24 PM

Welcome to the wonderful world of bioinformatics zhacker.

Many of the questions you ask have probably been answered elsewhere in SeqAnswers. As you search the forums you will also learn many other useful things that you may not have thought about yet. In the future, I would also recommend that you structure your posts a bit more to cover one topic at a time and use a descriptive subject line. This will be the most useful to future SeqAnswers newcomers. Briefly, here are some comments on your specific questions.

1.) Adapter trimming. You may or may not need to do this. It depends on how your libraries were constructed. You don't say what platform your sequence comes from but I'm going to assume Illumina. Even within this platform there are a number of different ways that libraries are being made. There are some popular approaches, but by no means a universal standard. Yes, ideally you would be told what the adapter sequences are. If possible, get in contact with the person/group that made the libraries and learn as much as you can about the details. This will help you understand where the data is coming from. If that is not possible, perhaps you can at least determine which kits were used and you can read the manuals yourself. If your reads are only 36 bp long, hopefully none of the sequence is adapter or they will be very short by today's standards.

2.) Not sure about this question. Indexing in the context of 'samples' usually refers to adding a linker to each DNA fragment in a library. Each library (e.g. patient) is constructed with a different index. The index thus acts as a barcode for the library. This allows you to physically mix multiple libraries and sequence them as a pool within the same lane of a flowcell. Then using the barcode, during analysis you can separate the reads computationally and figure out which patient each came from. Before indexing was possible you could not sequence more than 8 patients on a flowcell (because each flowcell only has 8 lanes).

3.) It is not possible to give a good outline of analysis steps required without knowing more about the experiment. You say you want to look for SNPs. Do you really mean SNPs (polymorphisms) or are you looking for de novo mutations (commonly referred to as SNVs)? Is the disease a cancer? Do you have a matched normal or tumor. Or is it another disease. In which case do you have an affected child and unaffected parent to compare? Again I would suggest re-posting this question specifically with a more detailed explanation of the experiment and goals. But in general the analysis might go something like this:

Align each pair of fastq files to a reference genome (using BWA or Tophat for example), identify SNVs in each library relative to the reference genome (using SNVmix for example), determine the subset of these that are not already known to be SNPs (by comparing to dbSNP for example), if you have a comparison sample, determine those SNVs that are present in the disease and not in the healthy sample (in cancer we would call these the 'somatic' SNVs), classify the SNVs according to their locations in the genome (i.e. which are within exons, splice sites, introns, UTRs, intergenic space), for those that are within exons, determine which are likely to affect protein sequence/function (e.g. which are non-synonymous or truncating).

4.) Why do you have two files for each subject? This may mean that you have 'paired' read data. A common approach is to take DNA from a subject and fragment it into small pieces. Chromosomes are huge (many millions of bases long) but these fragments will be small (e.g. 200-500 bp). These fragments are what you are actually sequencing. But the read lengths of next-gen sequencers are generally still too short to sequence the entire fragment. For example, you might have reads of 36bp - 150 bp using the Illumina sequencer. Note that the length is not variable within a sequencing run but rather you run the sequencing reaction for a certain number of cycles and this determines your read length for that data set. Anyway, the important point is that you can not get all the way through the fragments. But what you can do is start a second read from the other side of each fragment. This is a common strategy because it gives you two reads that you know are separated by a certain distance and came from the same physical fragment of DNA. This greatly improves your ability to map the reads back to a reference genome and infer where the fragment actually came from. In the fastq format, these two reads are stored in two separate files where each line in file1 corresponds to each line in file2.

Note that your data may or may not be paired-end data. It is also possible that you just have two lanes of data for each library. You may be able to tell from the read names or file names whether your data is paired or not. You can also tell by mapping them and seeing if they appear to be paired. Really this is information that you should confirm from whoever did the sequencing...

**zhacker** · 05-24-2011, 06:59 AM

Thank you very much. After re-reading your post many times, I now have a better idea of many things that I was oblivious to.

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 31 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 41 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 33 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

I'm a total newbie and would love some help!

Comment

Comment

Latest Articles

ad_right_rmr

News