Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • use of export/sequence data

    Dear all,
    I am analysing sequencing data for pooled samples for a candidate gene to look for rare variants. Using the data from the illumina pipeline I first used the s_N_sequence.txt filtered data and mapped it to my candidate gene. If I understand correctly It is filtered by how well it aligns to the human genome, using certain parameters.
    If I repeat my analysis using the unfiltered data which is s_N_export.txt I get a better depth of coverage.
    Is it OK to use this data, or am I introducing errors?
    Because I already have some PCR introduced errors I am filtering out very low frequency snps from my data, so any very low frequency errors from the sequencing data will be filtered out here too.

    Any thoughts would be greatly appreciated.

    Best Wishes
    Michelle

  • #2
    Originally posted by mimi_lupton View Post
    Dear all,
    If I understand correctly It is filtered by how well it aligns to the human genome, using certain parameters.
    Michelle
    Michelle,

    The filtering is independent of alignment; it is based solely on the relative intensity of the fluorescent signals. There are two methods Illumina uses to calculate relative intensities called Chastity and Purity. Chastity is defined as the ratio of the intensity of the most intense base for a cluster divided by the sum of the most intense plus the second most intense signal. Purity is defined as the ratio of the most intense signal divided by the sum of all four fluorescent signals. The default parameter used by GERALD when filtering reads is CHASTITY ≥ 0.6. Stated another way (after doing a little algebra) the most intense signal must be at least 1.5x higher than the second most intense signal. Also, filter passing is only based on the signals over the first 12 cycles. I am not sure whether this means that the value must be ≥ 0.6 for each of those 12 cycles or that average is ≥ 0.6.

    You may have confused the read filtering with quality score calculation. Initial quality scores are based on the observed intensities but the scores may then be calibrated based on the alignment of the control sample to its reference sequence. Reads which do not pass filtering will have lower overall quality scores.

    Now given all that, I don't think I'm the one to answer your real question, can you use unfiltered reads to identify rare variants. I do know that MAQ uses the quality score information when calculating its alignment but I don't know if this carries over into their SNP calling algorithm(s). Hopefully someone with more experience in SNP analysis will offer some input.

    Comment


    • #3
      Thanks for your reply, that makes thing clearer.
      Because I am looking at pools of lots of individuals I am not using the MAQ SNP calling algorithm, but calling my own SNPs using the pile up function. So the main question I am asking really is whether the non filtered data aligned to the reference is reliable.

      Any thoughts would be greatly appreciated.

      Thanks
      Michelle

      Comment


      • #4
        The current filter is quite strong in that it may filter a lot of good data. People are arguing a lot whether/how to use unfiltered data, but I think most of them agree we should at least apply some filters. If you do not want to invent time on studying better filters, I would recommend to use the filter implemented in the pipeline.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin


          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
          Today, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        37 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        41 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        35 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X