Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I'm a total newbie and would love some help!

    Hello everyone,

    First of all, you have no idea how great it feels to find this place. I'm a programmer/comp. scientist and totally new to bioinformatics. I need to do some analysis for a job and I have no clue about many things.

    Here's the story:

    I am given a huge data set for 15 subjects with some sort of disease and I'm supposed to analyse it to find out common SNPs for that disease. Each data set contains a huge amount of fastq sequences such as this one:

    @1:1:1166:20230:Y
    GAATGTAGATTTCTTCTAACACACAACACATNCATG
    +
    DDD?DBEEEED?DEE?EEDE?B5?CC########


    My questions are (I apologize in advance if they are too many and too stupid to ask, please bear with me):

    1- I know now that I need to do adapter trimming and quality filtering. For that, don't I need to be given the 'adapter' sequences? if not, where do I find common adapter sequences? Also, how long should the adapter sequences be? as you can see, my fastq sequence is 36 bp long. How do I determine how much of it as adapter?

    2- What are indexed samples? or better said: what is my reference for indexing my cleaned up sequence?


    3- Can you give an outline for the logical steps that i will need to follow to implement my analysis? (all I will need is the order of each step at a very high level and I will dig into it).

    4- Why do I need two files each with a read from a different direction, for each subject?


    Thank you, much appreciated and sorry

  • #2
    Welcome to the wonderful world of bioinformatics zhacker.

    Many of the questions you ask have probably been answered elsewhere in SeqAnswers. As you search the forums you will also learn many other useful things that you may not have thought about yet. In the future, I would also recommend that you structure your posts a bit more to cover one topic at a time and use a descriptive subject line. This will be the most useful to future SeqAnswers newcomers. Briefly, here are some comments on your specific questions.

    1.) Adapter trimming. You may or may not need to do this. It depends on how your libraries were constructed. You don't say what platform your sequence comes from but I'm going to assume Illumina. Even within this platform there are a number of different ways that libraries are being made. There are some popular approaches, but by no means a universal standard. Yes, ideally you would be told what the adapter sequences are. If possible, get in contact with the person/group that made the libraries and learn as much as you can about the details. This will help you understand where the data is coming from. If that is not possible, perhaps you can at least determine which kits were used and you can read the manuals yourself. If your reads are only 36 bp long, hopefully none of the sequence is adapter or they will be very short by today's standards.

    2.) Not sure about this question. Indexing in the context of 'samples' usually refers to adding a linker to each DNA fragment in a library. Each library (e.g. patient) is constructed with a different index. The index thus acts as a barcode for the library. This allows you to physically mix multiple libraries and sequence them as a pool within the same lane of a flowcell. Then using the barcode, during analysis you can separate the reads computationally and figure out which patient each came from. Before indexing was possible you could not sequence more than 8 patients on a flowcell (because each flowcell only has 8 lanes).

    3.) It is not possible to give a good outline of analysis steps required without knowing more about the experiment. You say you want to look for SNPs. Do you really mean SNPs (polymorphisms) or are you looking for de novo mutations (commonly referred to as SNVs)? Is the disease a cancer? Do you have a matched normal or tumor. Or is it another disease. In which case do you have an affected child and unaffected parent to compare? Again I would suggest re-posting this question specifically with a more detailed explanation of the experiment and goals. But in general the analysis might go something like this:

    Align each pair of fastq files to a reference genome (using BWA or Tophat for example), identify SNVs in each library relative to the reference genome (using SNVmix for example), determine the subset of these that are not already known to be SNPs (by comparing to dbSNP for example), if you have a comparison sample, determine those SNVs that are present in the disease and not in the healthy sample (in cancer we would call these the 'somatic' SNVs), classify the SNVs according to their locations in the genome (i.e. which are within exons, splice sites, introns, UTRs, intergenic space), for those that are within exons, determine which are likely to affect protein sequence/function (e.g. which are non-synonymous or truncating).

    4.) Why do you have two files for each subject? This may mean that you have 'paired' read data. A common approach is to take DNA from a subject and fragment it into small pieces. Chromosomes are huge (many millions of bases long) but these fragments will be small (e.g. 200-500 bp). These fragments are what you are actually sequencing. But the read lengths of next-gen sequencers are generally still too short to sequence the entire fragment. For example, you might have reads of 36bp - 150 bp using the Illumina sequencer. Note that the length is not variable within a sequencing run but rather you run the sequencing reaction for a certain number of cycles and this determines your read length for that data set. Anyway, the important point is that you can not get all the way through the fragments. But what you can do is start a second read from the other side of each fragment. This is a common strategy because it gives you two reads that you know are separated by a certain distance and came from the same physical fragment of DNA. This greatly improves your ability to map the reads back to a reference genome and infer where the fragment actually came from. In the fastq format, these two reads are stored in two separate files where each line in file1 corresponds to each line in file2.

    Note that your data may or may not be paired-end data. It is also possible that you just have two lanes of data for each library. You may be able to tell from the read names or file names whether your data is paired or not. You can also tell by mapping them and seeing if they appear to be paired. Really this is information that you should confirm from whoever did the sequencing...

    Comment


    • #3
      Thank you very much. After re-reading your post many times, I now have a better idea of many things that I was oblivious to.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      51 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X