Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting features above a certain coverage

    Hi,

    I have a bam file of all my accepted hits (output from tophat run on data from strand specific libraries/HiSeq) and an gtf file with my genes of interest for which I am trying to find potential antisense transcripts. I would like to create a list - preferably one that can be visualized in a genome browser - that shows all genes for which there are antisense reads in the accepted hits.bam file provided that there are more than a certain number of reads mapping to that gene. I can easily use bedtools intersect to get all genes that have antisense reads mapping to them, but this gives me way too many hits. I have also exported my resulting hits to excel to try to sort the file there (bioinformatics newbie that I am) but I can't get it back to a format that works for visualizing in a genome browser.

    Thank you.

  • #2
    Hi again,

    I have come up with a possible solution, but if anyone had some input to give I would very much appreciate it. If I did bedtools coverage to get the antisense coverage for each of my genes, and then sorted the resulting file on the coverage value? And then extracted everything in the first x lines? Could that work?

    I am unsure how to do the sorting and extracting. "sort" and "grep"? Any suggestions on the exact "phrasing"?

    Thank you!

    Comment


    • #3
      Once you have a coverage file, you can sort based on the result with "sort". the -n flag will sort numbers numerically (eg: 10<20<100 instead of 10<100<11<20) and the -kX flag will allow you to choose which key/column you wish to do the actual sort on. -k 4 would choose the 4th. -r would allow you to reverse the output to get the largest numbers first instead of last.

      I would suggest "head -X <filename>" to extract the first X lines, although i imagine grep could also do it relatively easily.

      Comment


      • #4
        This worked very well! Thank you!

        Comment


        • #5
          Originally posted by annavilborg View Post
          Hi,
          ... provided that there are more than a certain number of reads mapping to that gene...
          Thank you.
          Hi annavilborg,
          you can additionally pipe your sorted output to awk to get only those lines (genes) with read count above your threshold. Figure out which column (field) stores the read counts and then e.g. for field 4 and Threshold 100 apply something like:
          Code:
          bedtools coverage ... | awk '{if ($4>100) print $0}' | sort -nr -k 4
          $4 is the 4th field
          100 the read count threshold
          print $0 prints the content of the whole line
          sort -nr -k 4 will then numerically, reverse sort all these gene lines with readcount above threshold based on the corresponding column (4).

          Comment


          • #6
            Neat! Thank you!

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            21 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X