Hello to all! I have a little expirience in bioinformatics but I want to create coverage histogram after mapping. I have: F3.unique.25.2.coverage.unique and R3.unique.25.2.coverage.unique for several runs, also I have F3.unique.csfasta.ma.25.2 and R3.unique.csfasta.ma.25.2 files, but I haven't a some ideas what I can do next Please say to me what i need to do?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
It depends on what tools you have available, what organism you are sequencing and what you want to accomplish. Those 'coverage' files should be in the format of one line per base with each line being the coverage count. Thus you could import these into a spreadsheet program and draw graphs. Or toss the data into a genome browser that is loaded with all sorts of good stuff about your organism. Or ... well, there are lots of possibilities. The ma (match) files can be used for SNP calling, InDel detection, or transcriptome counting. What you need to do depends on what you want to get done.
-
Originally posted by nedoluzhko View PostHello to all! I have a little expirience in bioinformatics but I want to create coverage histogram after mapping. I have: F3.unique.25.2.coverage.unique and R3.unique.25.2.coverage.unique for several runs, also I have F3.unique.csfasta.ma.25.2 and R3.unique.csfasta.ma.25.2 files, but I haven't a some ideas what I can do next Please say to me what i need to do?
reads and the coordinates of the alignment:
Use that to keep track of how much coverage you have per each base.
Once you are done, dump the data in a fasta file:
>chr1
0
2
.
.
Each line represent the coverage you have for that genome at that position.
Once you have that file you can display that easily. I would use either gnuplot
or circos.
-drd-drd
Comment
-
SOLiD number of reads mapped in whole exome target locations
Hi,
I wrote a simple perl code to get the number of reads were mapped (SOLiD Corona_lite) in the target location. The perl code is working perfectly, but its taking long time to run with whole exome target locations (more than 5 days with 17million reads). If anybody can you help me to improve this code or easy way to find out the number of reads were mapped into the target location, then it will be great?
I copied the example input files for the perl program,,
*********example of whole exome target location SureSelect_All_Exon_G3362_with_names.v2.bed**********
chr exon_start exon_end totalBase ID
1 20138 20258 120 gb|BC048429
1 58932 59892 960 entg|OR4F5:ccds|CCDS30547.1
1 860956 861196 240 entg|SAMD11:ccds|CCDS2.2
1 864267 864387 120 entg|SAMD11:ccds|CCDS2.2
1 864490 864730 240 entg|SAMD11:ccds|CCDS2.2
2 9016029 9016269 240 entg|MBOAT2:ccds|CCDS1660.1
2 9061053 9061173 120 entg|MBOAT2:ccds|CCDS1660.1
3 15591480 15591600 120 entg|HACL1:ccds|CCDS2627.1
3 15596368 15596608 240 entg|HACL1:ccds|CCDS2627.1
3 15599383 15599503 120 entg|HACL1:ccds|CCDS2627.1
4 47648088 47648304 216 entg|CNGA1:ccds|CCDS43226.1
4 47713576 47713696 120 entg|NIPAL1:ccds|CCDS3479.1
4 47721794 47722154 360 entg|NIPAL1:ccds|CCDS3479.1
************************************************
******head of SureSelect_20091218_4_F3.unique.csfasta.ma.50.6*****************
>1_48_337_F3,3_137353925.5
T00120201000200020311112100111100230002023120222022
>1_49_100_F3,2_-86591224.5
T31202300221201300132312132112130032010203103221031
>1_50_326_F3,20_-43849935.6
T01311332010321311110222301111321200112222200011020
>1_51_297_F3,17_-26283180.4
T33003203132010100002232223221303000113022300222002
>1_54_263_F3,5_-102923763.3
T31120203312212121300220132222022210312001202012020
******************************************************
********targetMaps.pl**********
#usr/bin/perl
#AGRV[0] is the whole exome target locations file
open(FILE,$ARGV[0]) || die "unable to open file $ARGV[0]";
while (<FILE>)
{
$line = $_;
chomp $line;
$a[$i] =$line ;
$i++;
}
open (FILE2,$ARGV[1]) || die " unable to open the file $ARGV[1]";
#ARGV[1] is the .unique.csfasta.ma file
$count = 0; while (<FILE2>)
{
$line = $_;
chomp $line;
#>1303_25_81_F3,12_2661894.5
if($line =~ /^>\S+\_\S+\_\S+\_\S+\,(\S+)\_(\S+)\.\d+/)
{
$chr = $1;
$pos = $2;
if($pos =~ /^-(\d+)/){$pos = $1};
# print "$chr\t$pos\n";
for($j=0; $j<=$i;$j++)
{
$capture = $a[$j];
@d = split (/\t/, $capture);
if(($chr eq $d[0]) && ($pos >= $d[1] && $pos <= $d[2]))
{
# print "$chr\t$pos\n";
# print "$d[0]\t$d[1]\t$d[2]\n";
$count++;
}
}
}
}
print "\n Total reads mapped to target locations : $count \n";
*******************************
Perl program run :
perl targetMaps.pl SureSelect_All_Exon_G3362_with_names.v2.bed SureSelect_20091218_4_F3.unique.csfasta.ma.50.6
Thanks in advance.
Thanks,
ShibuLast edited by shibujohn; 03-31-2010, 03:07 AM.
Comment
-
17million reads compared to even a 1000 feature target file yields 17 billion comparisons in your loop in the worst case. 10000 features = 170 billion comparisons, etc.
You could implement a binning approach (see Kent et al 2002, the SAMTools paper, etc. for details). Alternatively, you could use intersectBed or coverageBed in BEDTools or use Galaxy.
This should run in a minute or so with BEDTools.
Comment
-
Help with .csfasta.ma parsing
Hi,
I understand from this thread that you're pretty familiar with .csfasta.ma format.
I'd like to convert it to a simple .bed.
Can someone help me with that. I see how you get the position in the perl script, but how do I understand where that position is in the reference?
Any help would be appreciated.
Thanks!
Comment
Latest Articles
Collapse
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
-
by seqadmin
Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...-
Channel: Articles
03-22-2024, 06:39 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
32 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
53 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment