SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Editing the content of an ab1 file thedamian Sanger/Dye Terminator 3 04-12-2012 08:14 AM
Save error report as text file in BWA CNVboy Bioinformatics 0 06-27-2011 04:03 PM
Biomedical Text Mining Engineer Ingenuity Industry Jobs! 0 02-04-2011 05:05 PM
Parsing Pileup with Text:CSV in Perl guavajuice Bioinformatics 0 08-23-2010 08:50 AM
Re: New to programming need help for inserting text kapoormanav Bioinformatics 5 07-09-2010 01:23 PM

Reply
 
Thread Tools
Old 06-13-2012, 01:32 PM   #1
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Question Question about editing a tabbed text file..

I have a tabbed text file with several rows and several columns.

If x rows have the same contents under columns - A, B, G - then, I want to delete (x-1) rows fully and retain only 1 (sort of duplicate removal).

Any command in linux / windows to do that?
shyam_la is offline   Reply With Quote
Old 06-13-2012, 02:00 PM   #2
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

You might have to move some fields around, but look into using "sort" and "uniq". Is this a homework question?
Heisman is offline   Reply With Quote
Old 06-13-2012, 02:31 PM   #3
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by Heisman View Post
You might have to move some fields around, but look into using "sort" and "uniq". Is this a homework question?
No. Its for an annotated mutation list I have generated by SNPEff. Most mutation loci have been assigned four or five lines each, because multiple transcripts are known to occur at that locus. I need want to get rid of the redundancy when the effect of the mutation is the same, irrespective of transcript..
shyam_la is offline   Reply With Quote
Old 06-13-2012, 02:39 PM   #4
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Quote:
Originally Posted by shyam_la View Post
No. Its for an annotated mutation list I have generated by SNPEff. Most mutation loci have been assigned four or five lines each, because multiple transcripts are known to occur at that locus. I need want to get rid of the redundancy when the effect of the mutation is the same, irrespective of transcript..
There's probably a better way but you could use awk like this:

awk '{print $0"\t"$A","$B","$G}' [input_file] | sort

and then pipe that into uniq and use the -f option. That might work although I'm not sure you could easily specify which transcript you keep.
Heisman is offline   Reply With Quote
Old 06-13-2012, 04:15 PM   #5
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

I am sorry, but I don't code. I am not a Bioinformatician; am a MD, working as a research associate and have no specialised computer training.

I did this:

$ awk '{print $0"\t"$A","$B","$J","$P}' out.txt | sort > out_mod.txt

$ uniq --skip-fields=1 out_mod.txt out2.txt

awk did appear to have sorted the file by A, then B, then J, then P, but it also messed things up by copying all the columns A to U over and over for 6 times, side by side..

uniq didn't do anything.

Is my syntax correct?

Thank you.
shyam_la is offline   Reply With Quote
Old 06-13-2012, 04:24 PM   #6
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

I see. Is it possible for you to do this in excel? If you have a bunch of different files you can concatenate them together in linux and then have one file to work with in excel. If you want to add a column to each file with a sample specific ID that can be done pretty easily in linux before concatenating the files and putting it into excel.

Otherwise, could you provide a few sample lines from one of your files that has lines you want to keep and get rid of? And then for those sample lines also provide a smaller subset of lines with the desired output? I could probably write a quick bash script to do it.
Heisman is offline   Reply With Quote
Old 06-13-2012, 04:28 PM   #7
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

I'd just do it in Python. Let's say you have 10 fields on each line and you care about the 1st, 2nd, and 7th in terms of defining uniqueness.

HTML Code:
inf = open("yourfile.txt")
outf = open("yourfile_unique.txt",'w')
uniqueValues = {}
for line in inf:
    fields = line.strip().split('\t')
    keyTuple = (fields[0],fields[1],fields[6])
    if keyTuple not in uniqueValues:
        uniq[keyTuple] = None
        outf.write(line)
Yup. That'll do it.

EDIT: The indentations were all off-- I cleaned up a bit.

Last edited by ucpete; 06-13-2012 at 04:39 PM.
ucpete is offline   Reply With Quote
Old 06-13-2012, 04:45 PM   #8
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by Heisman View Post
I see. Is it possible for you to do this in excel? If you have a bunch of different files you can concatenate them together in linux and then have one file to work with in excel. If you want to add a column to each file with a sample specific ID that can be done pretty easily in linux before concatenating the files and putting it into excel.

Otherwise, could you provide a few sample lines from one of your files that has lines you want to keep and get rid of? And then for those sample lines also provide a smaller subset of lines with the desired output? I could probably write a quick bash script to do it.
Yes, I have been using excel to view my results. I have only one sample in so far. So, there aren't multiple files to merge.. Just one.

Just one list of mutations. I am experimenting with the different tools and callers to get a pipeline at the moment. Using the Exome manual here for pre processing and MuTect from Broad gave excellent mutation calls. After annotation, the type of mutations expected (UV signature) were found in huge amounts and also some of the genes to be mutated in this type of tumor were found mutated. I think I have a viable pipeline to run things through, once more sequences start coming in..

Anyway, story aside - few lines as you asked..

1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033493 NM_033493.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2310
1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033492 NM_033492.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2337
1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033486 NM_033486.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2343
1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033487 NM_033487.ex.16 3 UTR_5_PRIME: 380 bases from TSS


There are columns, A to U in there. If columns, A, B, J, O, P, S, T are the same, like the first three lines in the example above, I want only one line to be retained and the remaining two to be discarded.

Thank you.

PS: Three columns are mostly empty; thats why you see fewer than U columns there..

Last edited by shyam_la; 06-13-2012 at 04:53 PM. Reason: To make it clear, why there were fewer columns,..
shyam_la is offline   Reply With Quote
Old 06-13-2012, 04:48 PM   #9
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by ucpete View Post
I'd just do it in Python. Let's say you have 10 fields on each line and you care about the 1st, 2nd, and 7th in terms of defining uniqueness.

HTML Code:
inf = open("yourfile.txt")
outf = open("yourfile_unique.txt",'w')
uniqueValues = {}
for line in inf:
    fields = line.strip().split('\t')
    keyTuple = (fields[0],fields[1],fields[6])
    if keyTuple not in uniqueValues:
        uniq[keyTuple] = None
        outf.write(line)
Yup. That'll do it.

EDIT: The indentations were all off-- I cleaned up a bit.
I don't code; not a programmer. I just installed Python2.7.
Can do it only if you are willing to take me through it step by step!! :P
shyam_la is offline   Reply With Quote
Old 06-13-2012, 04:55 PM   #10
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

Type "python" from within the directory containing your file. Enter the above line-by-line, replacing "yourfile.txt" with whatever your file name is, and give it a descriptive output file name as well (not "yourfile_unique.txt"). Hit ctrl-d to exit python and boom, you got what you wanted. Then go to the python website and look for the basic tutorials. Once you've read a little bit there, come back and try to understand the code above. I'm a biologist and it took me about one week of practice to be able to write code like that above to accomplish simple tasks quickly. If you're not using computers to do your research, you're probably doing it wrong.
ucpete is offline   Reply With Quote
Old 06-13-2012, 05:20 PM   #11
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by ucpete View Post
Type "python" from within the directory containing your file. Enter the above line-by-line, replacing "yourfile.txt" with whatever your file name is, and give it a descriptive output file name as well (not "yourfile_unique.txt"). Hit ctrl-d to exit python and boom, you got what you wanted. Then go to the python website and look for the basic tutorials. Once you've read a little bit there, come back and try to understand the code above. I'm a biologist and it took me about one week of practice to be able to write code like that above to accomplish simple tasks quickly. If you're not using computers to do your research, you're probably doing it wrong.
I believe its Ctrl Z in windows. Anyway, did what you said, line by line and got this:

Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IndexError: list index out of range

If you're not using computers to do your research, you're probably doing it wrong - Agreed. I am using it; I am just not coding, yet.
shyam_la is offline   Reply With Quote
Old 06-13-2012, 05:24 PM   #12
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

You'll be coding soon! It seems like your file might have a header line. Try this instead:

HTML Code:
inf = open("yourfile.txt")
outf = open("yourfile_unique.txt",'w')
uniqueValues = {}
for line in inf:
    fields = line.strip().split('\t')
    if len(fields) > 6:
        keyTuple = (fields[0],fields[1],fields[6])
        if keyTuple not in uniqueValues:
            uniq[keyTuple] = None
            outf.write(line)
ucpete is offline   Reply With Quote
Old 06-13-2012, 05:26 PM   #13
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

There is no header line. Atleast none that is visible in Excel..
shyam_la is offline   Reply With Quote
Old 06-13-2012, 05:29 PM   #14
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

This is the error now:

Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IndexError: list index out of range
shyam_la is offline   Reply With Quote
Old 06-13-2012, 05:33 PM   #15
ucpete
Member
 
Location: San Francisco Bay Area

Join Date: Dec 2008
Posts: 35
Default

I'm guessing now that your file isn't tab-delimited-- that it's space-delimited. Try changing to the split('\t') statement to just split(). I can't really help without much more information, sorry.
ucpete is offline   Reply With Quote
Old 06-13-2012, 05:43 PM   #16
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Did that too:

Traceback (most recent call last):
File "<stdin>", line 4, in <module)
IndexError: list index out of range

Its 100% still tab delimited..
shyam_la is offline   Reply With Quote
Old 06-13-2012, 05:45 PM   #17
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

I:\Exome\Annotations>C:\Python27\Python
Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> inf = open("out2.txt")
>>> outf = open("out2mod.txt",'w')
>>> for line in inf:
... fields = line.strip().split()
... if len(fields) > 6:
... keyTuple = (fields[1],fields[2],fields[7],fields[12],fields[13],
fields[16],fields[17])
... if keyTuple not in uniqueValues:
... uniq[keyTuple] = None
... outf.write(line)
... ^Z

Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IndexError: list index out of range

PS: I typed the indentations correctly, even though they aren't showing here..
shyam_la is offline   Reply With Quote
Old 06-13-2012, 09:02 PM   #18
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Quote:
Originally Posted by shyam_la View Post
Yes, I have been using excel to view my results. I have only one sample in so far. So, there aren't multiple files to merge.. Just one.

Just one list of mutations. I am experimenting with the different tools and callers to get a pipeline at the moment. Using the Exome manual here for pre processing and MuTect from Broad gave excellent mutation calls. After annotation, the type of mutations expected (UV signature) were found in huge amounts and also some of the genes to be mutated in this type of tumor were found mutated. I think I have a viable pipeline to run things through, once more sequences start coming in..

Anyway, story aside - few lines as you asked..

1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033493 NM_033493.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2310
1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033492 NM_033492.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2337
1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033486 NM_033486.ex.18 3 SYNONYMOUS_CODING D/D gaC/gaT 40 1 2343
1 1653142 G A SNP Hom CDK11B.1 CDK11B mRNA NM_033487 NM_033487.ex.16 3 UTR_5_PRIME: 380 bases from TSS


There are columns, A to U in there. If columns, A, B, J, O, P, S, T are the same, like the first three lines in the example above, I want only one line to be retained and the remaining two to be discarded.

Thank you.

PS: Three columns are mostly empty; thats why you see fewer than U columns there..
So if a column is empty it doesn't give a delimiter? If you use ANNOVAR to annotate variants (which I highly recommend), it will create a .csv file that at least will have a comma for an empty column. Without delimiters it's trickier as column "O" in one row may correspond to column "N" in another, for example.
Heisman is offline   Reply With Quote
Old 06-13-2012, 09:39 PM   #19
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

It gives a delimiter - as in excel and notepad display it correctly. But when I copy the four lines and then paste here, the gap vanishes.

But there is no character to represent a null entry, if thats what you mean.

Last edited by shyam_la; 06-13-2012 at 09:43 PM.
shyam_la is offline   Reply With Quote
Old 06-13-2012, 09:46 PM   #20
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 534
Default

Then I'm curious why you don't just use excel? It has a remove duplicates function where you can select what columns it considers.
Heisman is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:20 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO