Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping reads to reference genome + count reads of genes

    Hello,

    I have environmental metatranscriptome data (Illumina HiSeq reads), that contains transcript reads of a mixture of prokaryotic organisms. I now want to identify all reads that belong to a certain bacterial genome via mapping (e.g. via MUMmer promer) the reads from the metatranscriptome to a reference genome. The mapped reads I want to count (works well with MUMmer results for the total number) specifically for every gene.

    The problem is now the annotation part to get the number of reads for the specific genes. I would like to not only have the information on which part of the genome (bases 12345-12377) the read is mapped but also which gene (locus: gene_1, gene_2 etc.) that is.

    In the end I basically want to end up with a result like this:
    Code:
    gene_1 23 hits
    gene_3 0 hits
    gene_4 237 hits
    ...
    MUMer output is producing a table looking like this:

    Code:
        [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  [% SIM]  [% STP]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [FRM]  [TAGS]
    ========================================================================================================================================================
       17690    17770  |       19       99  |       81       81  |    88.89    88.89     0.00  |  1645259       99  |     0.00    81.82  |  2  1  REF_genome	HWI-ST908_0052:3:1101:2512:19931#CGATGG/1
       27104    27190  |       98       12  |       87       87  |    89.66    89.66     0.00  |  1645259      100  |     0.01    87.00  |  2 -3  REF_genome	HWI-ST908_0052:3:1101:21043:110201#CGATGT/1
       27158    27238  |       18       98  |       81       81  |    85.19    88.89     0.00  |  1645259      100  |     0.00    81.00  |  2  3  REF_genome	HWI-ST908_0052:3:1101:20301:105330#CGATGT/1
       72735    72640  |       99        4  |       96       96  |    78.12    84.38     0.00  |  1645259      100  |     0.01    96.00  | -3 -2  REF_genome	HWI-ST908_0052:3:1101:11083:19486#CGATGT/1
    
    ...

    I know that there is GFF files that basically give you the information on which region on the genome there what gene (gene_1, gene_2 etc.) - now I would need a script or tool that brings both information together.
    Since I am more a user and not so advanced at scripting it would be great to find a tool that already exits doing this. Maybe someone knows of such a tool, this would be great!

    Thank you very much!
    cumulonimbus

  • #2
    If you can get your reads and your genes into UCSC BED format, then you can use the BEDOPS bedmap tool to map reads to genes.

    http://code.google.com/p/bedops/wiki/bedmap

    The bedmap tool comes with a --count operator, which you can use here to count the number of reads that map to a gene.

    http://code.google.com/p/bedops/wiki...lap_statistics

    If your inputs are not in BED format, BEDOPS offers conversion scripts for converting common genomic formats to sorted BED files, which you can use as inputs to bedmap.

    http://code.google.com/p/bedops/wiki/conversion

    To briefly demo how this might work for you, let's say your genes are in GFF format and your reads are in BAM format. We can convert like so:

    $ gff2bed < genes.gff > genes.bed
    $ bam2bed < reads.bam > reads.bed


    Now we can map the reads to the genes and count them:

    $ bedmap --echo --count genes.bed reads.bed > answer.bed

    The file answer.bed is a BED-formatted list of genes. Each line contains a gene and the number of reads that overlap that gene:

    $ more answer.bed
    chr1 1000 2000 gene-1 ... | 8
    chr1 4000 5000 gene-2 ... | 5
    ...


    (In other words, eight reads overlap gene-1, five reads overlap gene-2 — and so on.)

    If you want more information than just read counts, there are several operators that bedmap offers. For example, you might add the --echo-map-id operand if you want the IDs of all overlapping reads. The bedmap documentation describes various statistical and element operators in more detail.

    The default overlap criterion between read and gene is one or more bases. You can set this to be more stringent with appropriate overlap settings:

    http://code.google.com/p/bedops/wiki...erlap_criteria
    Last edited by AlexReynolds; 06-05-2013, 04:44 AM.

    Comment


    • #3
      I think HTSeq will do what you want.

      Comment


      • #4
        Dear Alex,

        thank you very much, this is exactly what I was looking for (I tried already and it works great), highly appreciated, very nice program!

        Jeremy: Thanks, this looks also good, I will have a look, too!

        Thank you for this fast help
        cumulonimbus

        Comment


        • #5
          Dear Alex,

          I am now working through my datasets with bedmap for counting the reads and for some of my files I get this error when I use

          bedmap --echo --count genes.bed reads.bed > hits.tab

          Code:
          dyld: lazy symbol binding failed: Symbol not found: __ZNSt8__detail15_List_node_base7_M_hookEPS0_
            Referenced from: /usr/local/bin/bedmap
            Expected in: /usr/lib/libstdc++.6.dylib
          
          dyld: Symbol not found: __ZNSt8__detail15_List_node_base7_M_hookEPS0_
            Referenced from: /usr/local/bin/bedmap
            Expected in: /usr/lib/libstdc++.6.dylib
          I checked the data and it looks normal. I also split up the files into smaller parts to see whether there is something wrong with the format in the file, but when I find the line where the error disappears, I can see no difference in the format.
          For some data files (up to 120 MB in size) it works fine, for some not, do you have a solution why?

          I am using Mac OS X 10.8.3

          Thank you for your help!

          Comment


          • #6
            Can you indicate what version of bedmap you are running? This is an error usually seen with an older version of BEDOPS for OS X, and upgrading to a current version may help resolve this.

            Comment


            • #7
              Oh, I forgot to mention this, it is: version: 2.2.0
              which is the current version, right?

              Comment


              • #8
                That's the current version. Can you post the results from the following command?

                $ otool -L /usr/local/bin/bedmap

                You may need to install otool via Xcode or Apple's command-line developer tools installer.

                Also, do you have MacPorts and GCC installed?

                Comment


                • #9
                  $ otool -L /usr/local/bin/bedmap:
                  Code:
                  /usr/local/bin/bedmap:
                  	/Library/Application Support/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.17.0)
                  	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
                  	/Library/Application Support/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)
                  $ gcc --version:
                  Code:
                  i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
                  Copyright (C) 2007 Free Software Foundation, Inc.
                  This is free software; see the source for copying conditions.  There is NO
                  warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                  XCode Command Line tools are installed, GCC as well and I just installed MacPorts (MacPorts-2.1.3-10.8-MountainLion). XCode version 4.6.2.

                  So far the error is still occurring.

                  Comment


                  • #10
                    Thanks, the paths that otool are reporting are wrong. I'm researching why that is.

                    In the meantime, to resolve this issue please run the following commands:

                    $ sudo install_name_tool -change /Library/Application\ Support/libstdc++.6.dylib /Library/Application\ Support/BEDOPS/libstdc++.6.dylib /usr/local/bin/bedmap

                    $ sudo install_name_tool -change /Library/Application\ Support/libgcc_s.1.dylib /Library/Application\ Support/BEDOPS/libgcc_s.1.dylib /usr/local/bin/bedmap

                    These two commands tell these binaries where to find the required library files.

                    Assuming this is a problem with the 2.2 installer, you will likely want to repeat these commands for the following binaries, replacing /usr/local/bin/bedmap with:

                    /usr/local/bin/bedops
                    /usr/local/bin/bedextract
                    /usr/local/bin/closest-features
                    /usr/local/bin/sort-bed
                    /usr/local/bin/starch
                    /usr/local/bin/unstarch
                    /usr/local/bin/starchcat

                    Alternatively, you could wait until I put out a 2.2.1 installer, once I find the cause of this issue.

                    Thanks for the report!

                    Comment


                    • #11
                      Thanks a lot for your fast support, this solved the problem!

                      Comment


                      • #12
                        For others, I posted a new OS X installer ("2.2.0b") that fixes this issue:

                        http://code.google.com/p/bedops/down....2.0b.mpkg.zip

                        This patches some scripts that do the equivalent of the aforementioned fix.

                        Comment


                        • #13
                          We have posted new builds of BEDOPS v2.3:

                          Downloads are available at the bottom of this page. Please read the BEDOPS v2.3.0 revision history, which summarizes new features and fixes in this release. Linux bedops_linux_x86_64-v2.3.0.tar.bz...


                          A more complete revision history is available here:



                          Feel free to send us feedback at: [email protected]

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          20 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          52 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X