Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    ge_SF, thanks for the guide, i've been working up from zero R familiarity, so it's taken me a bit of reading to get to this point...

    the farthest i've gotten is trying to align my reads to my gff file:

    my_bamfile_.io <- interval_overlap(my_bamfile, my_gff)

    the error i get when i try this is:

    Error in newAGIoverlap(from, to) :
    NA(s) present in the strand of 'to' or 'from'.

    Is this what you were referring to when you said you ran into problems because your gff file had formatting differences from what the program was expecting?

    I'm a bit confused by the sample they provide in the girafe package (from the prepareAnnotation.R file) from miRBase (mouse microRNA sequences "mmu_miR13.gff") because it appears that it's a GFF2 version and they're using it with readGff3. I know that the gff I'm using is GFF3, but I'm not sure how to make my Genome_intervals_stranded object readable by interval_overlap.

    Any hints/help you could offer on this?

    Comment


    • #32
      Possible gff issues

      I am attaching a pdf to illustrate a bit better. The first GIS_object is mm.gi (from the sample data). You can see that the "annotation" slot has 10 columns ("seq_name" "strand" "inter_base" "source" "type" "score" "phase" "gffAttributes" "family" "class" ). The second GIS_object is created from my gff as first read in. It is missing those last 2 columns.

      So to create a GIS_object that worked I had to generate those- I pulled out the gene ID's (outside of R- using grep in plain old unix-land), reimported that into R, generated a "class" column (basically just writing "mRNA" ~10,000 times), and added those two columns back to the GIS_object, creating the file as is stands in the third GIS_object in my pdf (the one just called "Ei.gff").

      I would try typing head(my_gff) to see if you are haivng an issue like this. If you need more detailed instructions let me know (or someone else can chime in - I probably am NOT doing it the best way). If your GIS_object looks like the mm.gi, you may be having another issue.

      One more thing you might try: The isRightOpen parameter. I actually didn't fool around much with that, but it may make a difference depending on what your gff looks like. You could try it both ways (T or F) and see if one works.
      Attached Files

      Comment


      • #33
        ok, i was able to add the "family" and "class" columns to my gff file (I assume it doesn't matter whether the "family" column is called "ID" or "family")?

        and to actually add something to this thread instead of constantly ask basic questions, i added the new "family" column containing the "ID" tag from the attributes column using this command:

        my_gff$family <- getGffAttribute(my_gff, "ID")

        and that seemed to work fine, though it looks like that generates a column called "ID" rather than "family" (as is seen in mm.gi). I found this on the prepareAnnotation.R script included with girafe.

        and now for some more errors....

        i'm not getting past the command to change the column names. quick question re: some lines from your non-model orgs pdf.

        "my_count_df <-* data.frame(my_counts)
        col.names(my_count.df) <-* c("gene_ID", "sample_name")"

        Was the "my_count.df" above supposed to be "my_count_df"? I assumed it was just a typo, but I've done that before at my own peril.

        Specifically, I'm getting the error:

        Error in col.names(my_count_df) <- c("gene_ID", "sample_name") :
        could not find function "col.names<-"

        If I use:

        names(my_count_df) <-* c("gene_ID", "sample_name")

        I get the error:

        Error in names(my_count.df) <- c("gene_ID", "sample_name") :
        'names' attribute [2] must be the same length as the vector [1]

        So... getting closer!

        Comment


        • #34
          Actually, that is 100% my fault- both are typos.
          Yes my_count.df was supposed to be my_count_df (or rather, whatever you actually named your data frame).

          And the function should be colnames(my_count_df) with no dot in the middle. The function as (correctly) written assigns those names to the columns in your data frame. The "names" function should work though- you may have some earlier problem. It is a good idea to check as you go along to make sure you are getting something that makes sense. You can peek into your R objects by using head(object) or for a data frame you can use object[1:10, ] to look at the first 10 rows. You can also check to make sure you have the right class and size of object using class(object) and dim(object). In this case you should get as output: "data.frame" and [1] some_large_number 2

          btw- the R help pages can be useful for some of this. The page for the base package should have links to help pages for functions like data.frame, and colnames so you can get a better idea what they are trying to do.

          EDIT: A couple more things I thought of (sorry if I am a little scattered about this)

          - You may want to change the "ID" column name to family - it actually is important for some functions, such as the plotting. I don't think it matters for what you are doing here, but just in case. If you don't change it make sure you change my_gff$family in the table function to my_gff$ID. If you do change it this should work:
          colnames(my_gff@annotation) <- colnames(mm.gi@annotation)

          - Looking over your error message it really seems like you haven't made a data frame with the correct dimensions. I would definitely look at the data before proceeding.

          btw- nice trick with the getGffAttributes. I tend just work until I get something usable and move on, so I seldom do things the most elegant way....
          Last edited by ge_SF; 03-29-2011, 09:24 PM. Reason: Adding some info

          Comment


          • #35
            gff format in girafe / interval_overlap issues

            I've identified what I think is a problem: I am getting ZERO overlaps between my aligned read intervals (from BAM file) and my annotation intervals from the interval_overlap function in girafe.

            I'm pretty sure that the BAM file is ok, it inputs fine and is recognized as an "AlignedGenomeIntervals" object using class(my_bamfile). Although I'm using a subset of my dataset for troubleshooting, I've checked (with my own eyes) and there are definitely intervals in this list that do overlap with intervals from my gff file.

            I'll keep searching, but if anyone has an idea as to what could be causing this discrepancy, it would be super.

            Comment


            • #36
              A couple of ideas:
              - Double check that your my_bamfile really has the data you think it does. Try head(my_bamfile@annotation). It should give you something like this:

              seq_name strand inter_base
              psu|EINV_1049 psu|EINV_1049 + FALSE
              psu|EINV_882.1 psu|EINV_882 + FALSE
              psu|EINV_882.2 psu|EINV_882 - FALSE
              psu|EINV_882.3 psu|EINV_882 - FALSE
              psu|EINV_882.4 psu|EINV_882 + FALSE
              psu|EINV_882.5 psu|EINV_882 + FALSE

              you can also try: length(my_bamfile@reads)- this should give you the # of reads the file contains.

              - Make sure that your chromosomes (or contigs or whatever you aligned to making your bam file) are named the same as they are in the gff. For instance, in this example it could be that the seq_names in the gff are just formatted as "EINV_1049" while in the bam file they are formatted like "psu|EINV_1049"

              Hope that helps!
              I really hope you get this to work, since it will show it is applicable to other systems and not just my data.

              Comment


              • #37
                Ah, so if the seq_name formats must match between the two, then that is probably the issue.

                here's what i get from head(my_bamfile@annotations):

                seq_name strand inter_base
                chromosome:AGPv2:1:1:301354135:1.1 chromosome:AGPv2:1:1:301354135:1 - FALSE
                chromosome:AGPv2:1:1:301354135:1.2 chromosome:AGPv2:1:1:301354135:1 - FALSE
                chromosome:AGPv2:1:1:301354135:1.3 chromosome:AGPv2:1:1:301354135:1 - FALSE
                chromosome:AGPv2:1:1:301354135:1.4 chromosome:AGPv2:1:1:301354135:1 - FALSE
                chromosome:AGPv2:1:1:301354135:1.5 chromosome:AGPv2:1:1:301354135:1 - FALSE
                chromosome:AGPv2:1:1:301354135:1.6 chromosome:AGPv2:1:1:301354135:1 - FALSE


                and here's what i get from head(my_gff@annotations):

                seq_name strand inter_base source type score phase
                1 chr9 - FALSE CoGe gene NA NA
                2 chr9 - FALSE CoGe mRNA NA NA
                3 chr9 - FALSE CoGe CDS NA 0
                4 chr9 - FALSE CoGe CDS NA 0
                5 chr9 - FALSE CoGe CDS NA 0
                6 chr9 - FALSE CoGe CDS NA 0
                gffAttributes
                1 ID=GRMZM2G354611;Name=GRMZM2G354611 Alias=GRMZM2G354611
                2 ID=GRMZM2G354611_T01;Name=GRMZM2G354611_T01;Parent=GRMZM2G354611 Alias=GRMZM2G354611_T01
                3 Parent=GRMZM2G354611_T01 Alias=NA
                4 Parent=GRMZM2G354611_T01 Alias=NA
                5 Parent=GRMZM2G354611_T01 Alias=NA
                6 Parent=GRMZM2G354611_T01 Alias=NA
                ID class
                1 GRMZM2G354611 mRNA
                2 GRMZM2G354611_T01 mRNA
                3 <NA> mRNA
                4 <NA> mRNA
                5 <NA> mRNA
                6 <NA> mRNA

                I suppose I just need a way to change that strange "chromosome:AGPv2:<int>" format in my BAM-derived G_is object (where <int> is the chromosome number) to the simpler "chr<int>" format. I will try...

                Comment


                • #38
                  Yes, that is likely the problem.

                  Honestly, I think the easiest is to use sed
                  (good tutorial here in case you need it: http://www.grymoire.com/Unix/Sed.html)
                  And then re-read into R.

                  There is a "replace" function in R but it is a bit of a PITA and not very intuitive, esp for an R novice (though I guess you won't be soon!)

                  Comment


                  • #39
                    more progress...

                    matching up the seq_name formatting between the bamfiles and the gff files definitely worked! i'm getting a data.frame that has something in it, so that's good.

                    however, when i try to generate row names from the gene_IDs column of the counts_merged object, i get the following error:

                    > gene_IDs <- counts_merged[,1]
                    > row.names(counts_merged) <- gene_IDs
                    Error in `row.names<-.data.frame`(`*tmp*`, value = c(1L, 2L, 3L, 4L, 4L, :
                    duplicate 'row.names' are not allowed
                    In addition: Warning message:
                    non-unique values when setting 'row.names': ‚ÄòAC149475.2_FG005‚Äô, ‚ÄòAC149475.2_FGT005‚Äô, ‚ÄòAC149829.2_FG006‚Äô, ‚ÄòAC149829.2_FGT006‚Äô, ‚ÄòAC159612.1_FG007‚Äô, ‚ÄòAC159612.1_FGT007‚Äô, ‚ÄòAC183932.3_FG007‚Äô, ‚ÄòAC183932.3_FGT007‚Äô, ‚ÄòAC190628.4_FG007‚Äô, ‚ÄòAC190628.4_FGT007‚Äô, ‚ÄòAC196475.3_FG004‚Äô, ‚ÄòAC196475.3_FG005‚Äô, ‚ÄòAC196475.3_FGT004‚Äô, ‚ÄòAC196475.3_FGT005‚Äô, ‚ÄòAC197705.4_FG006‚Äô, ‚ÄòAC197705.4_FGT006‚Äô, ‚ÄòAC198418.3_FG005‚Äô, ‚ÄòAC198418.3_FGT005‚Äô, ‚ÄòAC198725.4_FG007‚Äô, ‚ÄòAC198725.4_FGT007‚Äô, ‚ÄòAC199526.5_FG002‚Äô, ‚ÄòAC199526.5_FGT002‚Äô, ‚ÄòAC205419.3_FG001‚Äô, ‚ÄòAC205419.3_FGT001‚Äô, ‚ÄòAC207722.2_FG009‚Äô, ‚ÄòAC207722.2_FGT009‚Äô, ‚ÄòAC207890.3_FG002‚Äô, ‚ÄòAC207890.3_FGT002‚Äô, ‚ÄòAC208110.2_FG007‚Äô, ‚ÄòAC208110.2_FGT007‚Äô, ‚ÄòAC208201.3_FG002‚Äô, ‚ÄòAC208201.3_FGT002‚Äô, ‚ÄòAC208327.4_FG003‚Äô, ‚ÄòAC208327.4_FGT003‚Äô, ‚ÄòAC208897.3_FG004‚Äô, ‚ÄòAC208897.3_FGT004‚Äô, ‚ÄòAC210003.2_FG004‚Äô, ‚ÄòAC210003.2_FGT004‚Äô, ‚ÄòAC217050.4_FG006‚Äô, ‚ [... truncated]


                    when you look at the actual counts_merged object, it's apparent that there are duplicated gene_IDs:

                    > counts_merged
                    gene_ID sample_name
                    1 AC147602.5_FG004 1
                    2 AC147602.5_FGT004 1
                    3 AC149475.2_FG002 1
                    4 AC149475.2_FG005 2
                    5 AC149475.2_FG005 3
                    6 AC149475.2_FGT002 1
                    7 AC149475.2_FGT005 2
                    8 AC149475.2_FGT005 3

                    (sorry the formatting is funky here, but the integers at the end of each line are actually the data in the sample_name column)

                    I'm not sure what is wrong, but obviously row.names does not tolerate duplications. When the two counts_df files are merged, wouldn't this automatically create duplicates?

                    Closer still....

                    Comment


                    • #40
                      Merging should not create duplicates. Assuming you have two files:

                      f1:

                      geneID sample_1
                      A 10
                      B 70
                      D 2

                      f2:
                      geneID sample_2
                      A 15
                      C 20
                      D 80

                      You should get:

                      merged file:

                      geneID sample_1 sample_2
                      A 10 15
                      B 70 NA
                      C NA 20
                      D 2 80

                      note that NAs are created- you get rid of them in the next step with:
                      counts_merged <-replace(counts_merged,is.na(counts_merged),0)

                      Merging could get screwed up if:
                      - data columns have the same name (ie just sample instead of sample_1 and sample_2)
                      - The column you are trying to merge on (geneID in this ex.) has a different name in each sample
                      - you forgot to choose all=T (doesn't sound like your problem- this would just eliminate any row that doesn't appear in both)
                      - they are not both data frames

                      I am guessing your issue is one of the first two.
                      Last edited by ge_SF; 04-01-2011, 02:33 PM. Reason: cleaning up weird characters

                      Comment


                      • #41
                        success

                        Originally posted by ge_SF View Post
                        Merging could get screwed up if:
                        - data columns have the same name (ie just sample instead of sample_1 and sample_2)
                        - The column you are trying to merge on (geneID in this ex.) has a different name in each sample
                        - you forgot to choose all=T (doesn't sound like your problem- this would just eliminate any row that doesn't appear in both)
                        - they are not both data frames

                        I am guessing your issue is one of the first two.
                        Right you were, I had named both of the data columns the same thing, woops. After fixing that, I was able to make a nice counts table for my data. So yes, your method is certainly applicable to other systems.

                        Thanks so much for the troubleshooting help. A very good learning experience for working in R in general, I certainly have a ways to go. Though now at least I can apply the basics I learned while working with other packages - bring on edgeR!

                        Comment


                        • #42
                          Great! I am glad this worked for you.

                          Actually, thank you for helping to troubleshoot my protocol. I will update it at least to remove the typos, and hopefully make it a little more clear.

                          Happy R adventures...

                          Comment


                          • #43
                            Hi Matt, it's a great introduction. May I ask why "the scaling factor normalization method ... preserves the count nature of the data"? I read the Robinson's paper but my understanding is that it still may produce non-integer numbers after scaling. Could you elaborate on this a little bit? Thanks

                            Comment


                            • #44
                              Hi RNAer,
                              The scaling factor normalization preserves the count data because in the statistical tests for DE the data is not actually multiplied by the scaling factor. In edgeR (where TMM is implemented) the scaling factor is used as an "offset" in the negative binomial model and the original count data is entered into the model. In this way we can preserve the count data but include the appropriate normalization between samples. So we never actually multiply the counts by the scaling factor. Hope this helps but it may become clearer when you implement the whole edgeR testing procedure.

                              Comment


                              • #45
                                Thanks, Alicia.In your paper, it's said in Methods section "Their analysis utilizes an offset to account for the library size and a likelihood ratio (LR) statistic to test for differences in expression between libraries ... In order to use TMM normalization, we augment the original offset with the estimated normalization factor". Do you mean this?

                                But I guess that augmenting with the original offset with the scaling factor would produce a non-integer number, right? Is it rounded to an integer? Will it cause a problem especially for the genes with few reads.

                                Originally posted by A Oshlack View Post
                                Hi RNAer,
                                The scaling factor normalization preserves the count data because in the statistical tests for DE the data is not actually multiplied by the scaling factor. In edgeR (where TMM is implemented) the scaling factor is used as an "offset" in the negative binomial model and the original count data is entered into the model. In this way we can preserve the count data but include the appropriate normalization between samples. So we never actually multiply the counts by the scaling factor. Hope this helps but it may become clearer when you implement the whole edgeR testing procedure.
                                Last edited by RNAer; 04-11-2011, 05:59 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-27-2024, 06:37 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-27-2024, 06:07 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                69 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X