Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Homer / getDiffExpression.pl / edgeR

    To complement our cuffdiff analysis we thought to take a parallel route using Homer:

    To do so we basically followed this protocol:

    makeTagDirectory pairend.sam -format sam -fragLength 250 -removeSpikes 10000 5 -genome /data//projects/mm9/ref.fasta -checkGC -tbp 1 -sspe -flip



    Then, we run analyzeRepeats.pl to make read summaries across genes:
    analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 > Scramble_vs_Target_analyzeRepeats.txt

    We run this step as recommended by the Homer tutorial with the unnormalized data so the edgeR analysis, which is the following step, would be able to use an unnormalized data.


    Next step we run:
    getDiffExpression.pl Scramble_vs_Target_analyzeRepeats.txt MIa9_1 MIa9_2 -repeats -edgeR > Scramble_vs_Target_analyzeRepeats_edgeR.txt


    After testing the execution of this command and checking closely the perl script for getDiffExpression we run into the following understanding:
    The getDiffExpression step takes unnormalized data and transmits it to edgeR so edgeR can calculate log2FDR, p-value, and FDR. off the original unnormalized data. edgeR creates a model for the data and uses pseudo values in cases of zero values. edgeR finally outputs the log2FDR, p-value, and FDR as calculated after transferring the original data into a modeled space.

    Now, the output of getDiffExpression.pl is composed of two parts:
    The three columns at the very right end, which contain the edgeR's output (Log2FC; p-value, FDR).
    AND the output also contains - the rest of the table (all the left side-end columns) which forms a matrix. The matrix is of normalized values. BUT these are not normalized values created by edgeR, but instead - normalized values that were created as part of getDiffExpression.pl script execution.
    This data is only presented in the final output table next to the edgeR output (three right columns) and is aimed only for the user to see, but this normalized values were NOT actually been used as part of the edgeR analysis (unlike what I originally thought...).

    My issue is that we are currently not sure in what manner the data which presented in this table was actually normalized.

    Please see this link for relevant indications:

    Look for explanatory indication such as the following which exactly states what I've described above:
    "This command will basically add 3 columns to the end of your original file per pairwise comparison. Most important are the log2 fold change and the p-value and FDR values. The raw read counts will be normalized in this file based on the total reads in each column so that the values are more easily to compare when looking through the results."

    However, our attempts to re-run the previous step again - (run analyzeRepeats.pl) - using this time the normalization options in order to obtain the normalized values that appear in the output from getDiffExpression.pl did not prove to be successful...
    We tried:
    analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -normMatrix 1e6

    And also,
    analyzeRepeats.pl rna mm9 -noadj -count genes -strand + -d MIa9_1 MIa9_2 -norm 1e7

    but the normalized values that we obtained were quite different from the ones that were obtained after running the getDiffExpression.pl script...

    We think that it would become important to know how the data that appears in the output table from getDiffExpression.pl was actually normalized. Was it per all reads? was it per all reads in each individual library? ...

    If any of the users have some more information and ideas regarding this matter we'll be glad to hear!
    Alternatively, we could ignore the normalized values content of the output table, and refer only to the results of edgeR that show the statistics.

    It would be interesting to hear how other users decided to normalize their rna-seq data using this approach. Which selections were finally chosen.

    Thanks a lot!

    Roy

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
51 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
67 views
0 likes
Last Post seqadmin  
Working...
X