Good morning everyone.
I am a newb and this is my first time touching any sequencing data. It just so happens that the best way to learn is to dive right in. Incidentally I have hit a bit of a stumbling block and was hoping someone here could shed some light on it.
I am trying to extract the raw read counts for transcripts mapped only to the forward strand. The mapping was done using Gsnap (I did not do the mapping.) We did paired-end sequencing using Illumina's HiSeq platform. Again, this was my first time working with sam/bam files and it took some time to understand the formats. Here is what I did:
To get only the forward read counts I took the sam file I was supplied and converted it back to a bam file. Then using the 0x0040 flag I used a command to get just the forward reads and redirected that output to another sam file. Then, looking at column 3 in the sam file, I saw either the name of the transcript or a *. It is my understanding that the "*" is for those reads that did not align. So I wrote a little bash script to filter those out. I was told by a co-worker that using column 3 I could get the raw counts of the mapped transcripts. So, I wrote another little script to do that. I know my script is working correctly, however, my numbers do not match the numbers my co-worker has. In fact, I have counts for some transcripts that my co-worker reported as not detected at all. I know there are tools to automatically get counts, but I want to actually understand these file formats before using some black box of tools.
This is how one line looks in the sam file:
[anitabow@kirby2 spikeIn]$ head -n 35 DR38spikeInForward.sam|tail -n 1
FCC1WG7ACXX:5:1101:3115:1979#GGACCCAT 99 ERCC-00130 851 40 100M = 958 207 CTTGAAGCAGATGATATCCCGATGCTGACGGCTTACAATAAACGTGATCAAAAACTGCCTGATTTTATACCGACCGCCGGAAGGGATCACATTATGGTCA CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIFFHHDGIJIJDEHJIIJJJJIHHIJJHHHHHHHFFFFFDDDDDDDDDDDDDDDDDDDDDDEDDCCD MD:Z:100 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
As you can see, the third column has the name of one transcript. This is what I have been using to extract my counts. So basically, my script will count every time it sees ERCC-00130 for example.
Thanks in advance for any help.
I am a newb and this is my first time touching any sequencing data. It just so happens that the best way to learn is to dive right in. Incidentally I have hit a bit of a stumbling block and was hoping someone here could shed some light on it.
I am trying to extract the raw read counts for transcripts mapped only to the forward strand. The mapping was done using Gsnap (I did not do the mapping.) We did paired-end sequencing using Illumina's HiSeq platform. Again, this was my first time working with sam/bam files and it took some time to understand the formats. Here is what I did:
To get only the forward read counts I took the sam file I was supplied and converted it back to a bam file. Then using the 0x0040 flag I used a command to get just the forward reads and redirected that output to another sam file. Then, looking at column 3 in the sam file, I saw either the name of the transcript or a *. It is my understanding that the "*" is for those reads that did not align. So I wrote a little bash script to filter those out. I was told by a co-worker that using column 3 I could get the raw counts of the mapped transcripts. So, I wrote another little script to do that. I know my script is working correctly, however, my numbers do not match the numbers my co-worker has. In fact, I have counts for some transcripts that my co-worker reported as not detected at all. I know there are tools to automatically get counts, but I want to actually understand these file formats before using some black box of tools.
This is how one line looks in the sam file:
[anitabow@kirby2 spikeIn]$ head -n 35 DR38spikeInForward.sam|tail -n 1
FCC1WG7ACXX:5:1101:3115:1979#GGACCCAT 99 ERCC-00130 851 40 100M = 958 207 CTTGAAGCAGATGATATCCCGATGCTGACGGCTTACAATAAACGTGATCAAAAACTGCCTGATTTTATACCGACCGCCGGAAGGGATCACATTATGGTCA CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIFFHHDGIJIJDEHJIIJJJJIHHIJJHHHHHHHFFFFFDDDDDDDDDDDDDDDDDDDDDDEDDCCD MD:Z:100 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
As you can see, the third column has the name of one transcript. This is what I have been using to extract my counts. So basically, my script will count every time it sees ERCC-00130 for example.
Thanks in advance for any help.
Comment