Seqanswers Leaderboard Ad

**wilhelml** · 01-07-2011, 11:13 AM

Hi,

I can't help you regarding your first and third question, however this is how we handle question 2...

Split the soap output file into one file for each scaffold like so:

awk '{ print $0 >> $8".soap" }' alignment

(in this example the soap alignment file is titled 'alignment' and you'll wind up with a file for each scaffold such as scaffold_1.soap).

Then you could run this shell script (requires tcsh I believe)

foreach file (`ls -1 *.soap`)
sort -k 9 -n $file > $file.sorted
end

Or this perl script...
(no promises that this is the simplest most concisely written piece of code in the world but I know it works)

----------------------------
#!/usr/bin/perl
use strict;

# Sort SOAP alignment output files by position - 9th column
# lines like this:
# 4_100_10086_4044\1 TAACGGTAATTTTTTTTAAAAAACAAAATCATATTTTCCT ffddcffff`dfffffefffffffffffafdeffcdffff 1 a 40 - scaffold_9 1593868 0 40M 40

print STDERR "start at: ".localtime()."\n";

my $h = {};
my $f = $ARGV[0];
open(F,$f) or die "can't open input file: $f $!\n";
while (<F>){
my @l = split(/\t/,$_);
$h->{$l[0]}->{'pos'} = $l[8];
$h->{$l[0]}->{'line'} = $_;;
}
close F;
print STDERR scalar(keys %{$h})." lines read from $f\n";

foreach my $id (sort { $h->{$a}->{'pos'} <=> $h->{$b}->{'pos'} } keys %{$h}){
print $h->{$id}->{'line'};
}

print STDERR "sort_soap.pl completed sorting $f successfully at ".localtime()."\n";

---------------------------

Larry Wilhelm
Bioinformatics
Oregon State University

**superligang** · 01-07-2011, 11:31 AM

Interesting, but I don't quite understand what scaffold means in this context.
I hope that soapsnp handles the overlapping part of the two ends of a pair of read reasonably.
Thank you.

**wilhelml** · 01-07-2011, 11:47 AM

'Scaffold' doesn't mean too much in this context actually. It just happens to be how the sequences to which I aligned my reads were named.

After reading your original post more carefully I realize I'm not really addressing your issue as this was all done with single end reads. I suspect it should still work though.

Your situation with a 35bp insert and read lengths of apparently > 35bp is not something I've ever confronted before. Is this really the case? I don't quite understand why a library would be constructed like this.

Larry W.

**superligang** · 01-07-2011, 11:57 AM

Originally posted by wilhelml View Post

Your situation with a 35bp insert and read lengths of apparently > 35bp is not something I've ever confronted before. Is this really the case? I don't quite understand why a library would be constructed like this.
Larry W.

I don't know if I understand the insert size correctly, but I use insert size as the distance between the two farthest bases of the two ends of a pair of reads.
For read of 35bp long, the minimal insert size could be 35 when the two reads of a pair overlap completely. The insert size is larger than 35bp if there is a gap between the two reads.
Thanks.

**wilhelml** · 01-07-2011, 12:15 PM

I don't think that is correct actually, regarding insert size.

The DNA is fragmented and then adapters are ligated. The size of the fragment to which the adapters are ligated is your total size. So if you cut 400Kb fragments from a gel and attach adapters to that, then sequence 80mers, you'll have an actual insert size of 400-(80*2)=240. Though, Soap wants the insert size as the total size so you'd use the 400 number.

So the question becomes: To what size of DNA fragment were the adapters ligated? And.. What is the length that was sequenced?

Also, what technology was used for sequencing?

This article may shed light...

http://www.454.com/downloads/protocols/1_paired_end.pdf

Disclaimer: I've never actually made libraries, I work entirely on the software side of things, just trying to help a little.

**wilhelml** · 01-07-2011, 12:18 PM

oops... I didn't mean 400Kb fragments, I meant just 400nt.

And I guess I can assume your read length is 35.

**superligang** · 01-07-2011, 01:24 PM

I don't run experiments in biology either

I downloaded this DNA data, and I guess probably RNA-seq is different from DNA-seq. When I aligned some RNA-seq data, the two ends of a pair of read frequently overlap with each other.
soapsnp uses 400 as the default minimal insert size, and so the overlapping part is not a problem in their DNA data.
Thank you for your clarification.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Questions about soapaligner and soapsnp

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News