Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by rkk View Post
    command has to identify min and max value from col1 values.. and then bin that into 100bp regions...
    I am afraid then your bins would be like this

    Code:
    10175-10275 8
    10276-10375 1
    10376-10475 1
    10476-10575 2

    Comment


    • #17
      Once minimum value is identified.. then nearest 100 should be calculated.. for example in this case min value is 10175 so the bins starting value should be 10100.. hope this helps

      Comment


      • #18
        Originally posted by rkk View Post
        I should use that command in LINUX...

        Now, I have another issue

        I have a file like following..I need to bin the first column in 100bp regions and count the second column value for that bin
        10175 1
        10179 1
        10189 1
        10191 1
        10201 1
        10243 1
        10249 1
        10262 1
        10313 1
        10414 1
        10485 1
        10499 1

        The output should be something like this..

        10101-10200 4
        10201-10300 4
        10301-10400 1
        10401-10500 3

        Can someone help with this..

        Thanks in advance..
        @rkk,

        Your input can have two solutions

        Code:
        [COLOR="DarkOrchid"]Solution 1(Considering your minimum and maximum value from col1:
        
        cat input
        10175	1
        10179	1
        10189	1
        10191	1
        10201	1
        10243	1
        10249	1
        10262	1
        10313	1
        10414	1
        10485	1
        10499	1[/COLOR]


        Code:
        awk 'NR == 1 {max=$1 ; min=$1} $1 >= max {max = $1} $1 <= min {min = $1} END { print min"\t"max}' 1 | awk '{ print $1, i=$1+100;while(i++<$2) print i, i+=99}' > intermediate
        Code:
        cat intermediate
        
        10175 10275
        10276 10375
        10376 10475
        10476 10575
        Now, consider the above intermediate file and run the following code

        Code:
        awk 'NR==FNR{
           C[NR]=$1 " " $2
           L[C[NR]]=0
           next
        }
        {
         for (t in C) {
            split(C[t],v," ")
            if($1>=v[1] && $1<=v[2])
               L[C[t]]+=$2
         }
        }
        END {
           for(i=1;i in C;i++)
               print C[i] " " L[C[i]]
        }' intermediate input > output

        Code:
        cat output
        
        10175 10275 8
        10276 10375 1
        10376 10475 1
        10476 10575 2


        ###########################################


        Code:
        Solution 2 (Considering minimum value and nearest 100 and maximum value and nearest 100 from column1):
        
        cat input
        10175	1
        10179	1
        10189	1
        10191	1
        10201	1
        10243	1
        10249	1
        10262	1
        10313	1
        10414	1
        10485	1
        10499	1
        Code:
        awk '{       
            min=$1<min||!min?$1:min
            max=$1>max||!max?$1:max
        }      
        END {
          s=int(min/100)*100
          e=int(max/100)*100+100
          print s " " s+100
          for(i=s+101;i<e;i+=100)
             print i " " i+99
        }' input > intermediate
        Code:
        cat intermediate
        10100 10200
        10201 10300
        10301 10400
        10401 10500

        Now, consider the above intermediate file and run the following code

        Code:
        awk 'NR==FNR{
           C[NR]=$1 " " $2
           L[C[NR]]=0
           next
        }
        {
         for (t in C) {
            split(C[t],v," ")
            if($1>=v[1] && $1<=v[2])
               L[C[t]]+=$2
         }
        }
        END {
           for(i=1;i in C;i++)
               print C[i] " " L[C[i]]
        }' intermediate input > output
        Code:
        cat output
        
        10100 10200 4
        10201 10300 4
        10301 10400 1
        10401 10500 3

        Comment


        • #19
          Originally posted by rkk View Post
          Hello,

          I have a file like the following

          chr1 1234
          chr1 2345
          chr2 94837
          chr2 73457

          how can I split this data into two files

          chr1.txt

          chr1 1234
          chr1 2345

          chr2.txt

          chr2 94837
          chr2 73457

          Thanks in advance.
          What about a simple grep ?

          grep 'chr1' FILE > chr1.txt
          grep 'chr2' FILE > chr2.txt
          Francois Sabot, PhD

          Be realistic. Demand the Impossible.
          www.wikiposon.org

          Comment


          • #20
            Originally posted by francois.sabot View Post
            What about a simple grep ?

            grep 'chr1' FILE > chr1.txt
            grep 'chr2' FILE > chr2.txt
            Francois,

            Grep is a handy tool. But, you have to repeat that command for each chromosome in ur first column. And with awk, a simple command when used once, will do the task easily.

            After all, it's a life worth counting on the clock. No one wants to sit there typing each chromosome, at least myself.

            Comment


            • #21
              Originally posted by gokhulkrishnakilaru View Post
              Code:
              awk '{print > $1".txt"}' input
              This is the correct and the best answer to the original question of the thread. The other awk command that was posted almost at the same time has a space in the output file name after "$1", it should not change anything but if you got an error try it as quoted here.

              As for the second problem, since you already know the resolution you want you don't need to compute min and max. Everything in one step:

              Code:
              awk '{bin[int($1/100)]+=$2}END{for (i in bin)print i*100+1"-"(i+1)*100,bin[i]}' input
              This line should give exactly the output you want. Pipe it on a sort -n if needed and/or change the separator "-".

              Comment


              • #22
                Originally posted by syfo View Post
                The other awk command that was posted almost at the same time has a space in the output file name after "$1", it should not change anything but if you got an error try it as quoted here.
                The space in $1 ".txt" is perfectly valid and cannot cause any problems. When you concatenate strings in awk, you separate them by spaces in the right-hand side: http://www.gnu.org/software/gawk/man...atenation.html
                Leaving space out in this case does not cause a problem, however it is a better practice to have space between concatenated strings. For example, if you concatenate several awk variables, you have to have space between them: v3=v1 v2. Of course, v3=v1v2 will not work.

                Comment


                • #23
                  OK good, thanks Alex for the precision. Both commands should work then, I don't see any reason for an error either. Maybe try \awk instead of awk in case of some alias or shortcut?

                  Rkk, let me know if there is any issue with my one-liner for your second task.

                  Comment


                  • #24
                    Originally posted by francois.sabot View Post
                    What about a simple grep ?

                    grep 'chr1' FILE > chr1.txt
                    grep 'chr2' FILE > chr2.txt
                    A more generic grep solution could be something like
                    Code:
                    for i in `cut -d" " -f1 input | sort -u`; do grep -w $i input > $i.txt ; done
                    But the awk alternative is better.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    23 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X