Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding the genomic location of an insert

    Is there some way to use RNA-seq and/or whole genome sequencing data (I have both for the relevant samples) to find the genomic location of an insert with an unknown location? The insert itself is of known sequence, and aligns correctly to a reference containing only itself + some minor control sequences.

    I was told that one thing I might do is to align my data to the reference containing only the insert sequences, but split my (paired-end) data into two, i.e. only align one pair at a time as a single end ("..._1"-files and "..._2"-files separately). I should then take out all the reads that align (by name) and subset the other original fastq files by them so I get their mates (i.e. subset "..._2" by aligned reads in "..._1") and align those to the normal reference genome, again single-end. I would then, hopefully, get reads aligning to the same region, and I would know the location of my insert (after which I could create some PCR primers and validate the results).

    I have done this with my WGS-data, but the reads map more or less randomly across all chromosomes... I feel I might be subsetting the read names wrong, somehow, mostly because I don't think I'm sure exactly how they are given names and how to find the pairs properly. At the moment, this is what I'm doing:

    Code:
    (... alignment with BWA)
    
    samtools view mapped.sorted.rmdup.input_1.bam | \
    	gawk '{print $1}' | \
    	sort | \
    	uniq > unique.txt
    
    fastqutils filter -whitelist unique.txt input_2.fastq > 1-to-2.fastq
    Am I doing something wrong with the analysis, or is the idea somehow flawed? I am being fairly stringent in the first alignment step, using the -B 40 -O 60 -E 10 options (with BWA), in order to hopefully only align more exact matches (I have also done without this stringency, with more or less the same results).

    Does anybody have any idea what I'm doing wrong, what's wrong with the idea, or have any other idea on how to find an unknown insert?

  • #2
    This is quite difficult in general and leads to false positive hits in my experience.

    It's difficult to have an idea how many false positives you can expect without knowning the read length and genome size / repetitivity.

    Maybe you've tried this, but doing a couple of de novo assemblies and looking for the - if present - flanking genomic regions around your insert would probably be more helpful. If these are mappable and unique in the genome, then that is good evidence.

    Comment


    • #3
      Ah, interesting... I have never done a de novo assembly before, either on genomic or transcriptome level. I assume you're advicing I do it on the genomic level, or? Could you point me towards some tool(s) that I could use for this?

      Comment


      • #4
        For RNA-seq, a good de novo tool is Trinity. For genomic assemblies, perhaps Abyss, Minia or Soap de novo might suit your needs. Perhaps you can find these on a Galaxy instance somewhere if you have no experience, maybe at Iplant. I think Sweden has a very good infrastructure setup you could get time on too though (I forget what it's called).

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 08:47 AM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X