Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [Help] How to get those reads containing specified SNP?

    Hi, all,

    I am a new drummy for bioinformatics.

    After SNP calling using GATK/freebayes, we usually get a SNP list. Now I have some interest SNP sites. Does anyone know how to identify those reads containing these interest SNPs?

    Please note these SNP might be heterozygous. And now I mapped the reads to a reference, and get sorted bam file.

    Would anyone tell me how to achieve that in detail or just tell me your thought and some tools might be helpful

  • #2
    Assuming you have mapped your reads and now have a SAM/BAM file [this is the usual case] then the samtools program using the 'view' option will pull out reads in the region of your choice.

    Comment


    • #3
      Might not be understanding you but you can pull out all the reads + info with
      grep -B 1 -A 2 GCCTATCGCAGATACACTCC sample.fastq > SNVreads.fastqish
      (the nuc string contains your SNP)

      need to remove the -- printed between reads
      grep -v -e -- SNVreads.fastqish > SNVreads.fastq

      You might have to tweek the length of your grep nuc pattern for specificity and avoiding other SNPs (dont know what you are sequencing). A couple cross platform visualization tools is Ugene.

      Hope this is what you are looking for.

      Earl
      --Please take everything thing I say with a grain of salt, because, if grad school has taught me anything, it's that I'm an idiot--

      Comment


      • #4
        reference -----------------------------------------------------------
        read1 ----------T-------------
        read2 -------------------------
        read3 ------T------------------
        read4 --------------------------

        I want to extract all the read id having the T snp

        Comment


        • #5
          If your read file looks like that then you can use

          [your/Directory]$ grep -------T------ YourReadFile.txt > YourSNPReadFile.txt

          output:
          [your/Directory]$ more YourSNPReadFile.txt
          read1 ----------T-------------
          read3 ------T------------------

          _________________________________________________________________________
          If you have a .fastq file, all you need is the first line, which is just before the nuc string like:

          @M01472:34:000000000-A40FG:1:1101:17765:1645 1:N:0:9
          NTTCCAGCGAGGTTCTGAGTTCTTAGTCTGGTGTCGGCGTACCCACACGGTG
          +
          #>>>ABFFB?DBGGGGGCEGGGHHHGHHHHHFAGHEEGGGGGGHHGFDEEFG


          just use:

          [your/Directory]$ grep -B 1 GCCTATCGCAGATACACTCC YourSample.fastq > NamesAndReads.txt
          #where "-B 1" prints the line before the pattern
          #and the pattern "GCCTATCGCAGATACACTCC" contains the SNP somewhere in the middle.

          [your/Directory]$ grep @M01472 NamesAndReads.txt > Names.txt
          # "@M01472" is something in all the names but not in any reads
          # for instance if your read names are actually read1, read2, read3, and read4 you could use "read"

          #output for my command
          [your/Directory]$ more Names.txt
          @M01472:34:000000000-A40FG:1:1101:17765:1645 1:N:0:9
          @M01472:34:000000000-A40FG:1:1101:18453:1656 1:N:0:9
          @M01472:34:000000000-A40FG:1:1101:16266:1658 1:N:0:9
          --More--(0%)

          NOTE: this is a quick solution, if your genome is repetitive or if the SNP is in a duplicated region this approach might not be the best method. If that is the case. Something a little more involved from a .sam file might be necessary.

          hope that helps
          --Please take everything thing I say with a grain of salt, because, if grad school has taught me anything, it's that I'm an idiot--

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          28 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          53 views
          0 likes
          Last Post seqadmin  
          Working...
          X