So my sequence data looks something like this. What I want to do is split up each end of 100 bp paired end read into two files and trick the aligner into thinking it is a paired end read. The only problem is all I don't know how to program very well.
So I have one file which looks like this
@HWI-ST375_119:5:2308:19782:7202/1
AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTTCTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
+
7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
@HWI-ST375_119:5:2305:20392:171561/1
CTTAATGTTTATATACTATGTTGCTATTATGAACTGAACTTATGTATATGTACCTCTAGCCTACTACAATTCGACCATCCTGACATTTAAGATCACCTG
+
@@@FDFFFGHDHFFGGIEHHEAFEHEHIHBHHIFHEHIGI@DE@FF?DD9?H@DFHEHIJJIGGCJGIJE>BFIF@@ADCAEFHEHGADCDDFECECCC
@HWI-ST375_119:5:2105:14681:149697/1
GAGAAGGTATTGAAAAATGATAGGTTTTTAATTAAAGATGTTGAAGGTTTTCAATTGGCGCGAAATCCATATCAAGGTGTTTGGAGTAGTCAAAATAAT
+
B@CFFFFDHHHHHJJJJJJIJJJJHJJJJJIJJIJJJIJJJJJJJJJFIIJJIJJJJJJJJJJHHHFFFFFFFEECECCDDDDDDD@CDFDDDDDDDDE
@HWI-ST375_119:5:1202:2525:15121/1
CTAGGGAGTAGAAAGTAAATTTGTATAATTAAAGATAAGAAAAGTAAAAAAAAAGGGTAACAAATAGCGGCTAAGAACAATGTAATAGTTATAAGTTGG
+
Basically what you see is the name of the read the thing that starts with HWI
Then there's the sequence that it decided the DNA was, the big block of CATG
Then there is a +
Then there is a bunch of characters which tell the computer how good looking each of those bases are so it can decide to use the data or ignore it if it is bad
What I want to do is split the data I have in half and align each of them separately. Another thing I want to do is rename the reads after I split them
What I really want to do is rename the read generated by the last 50 bases and you see how there is a /1 at the end of the name? I think that needs to become a /2 or vice versa if I use the second read as the input file.
Then I can do a paired end alignment and if all goes well I should have the data displayed so I can see indel containing reads as a paired end read on IGV and count them on there.
So what I think needs to happen is
@HWI-ST375_119:5:2308:19782:7202/1
AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTTCTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
+
7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
Needs to become
@HWI-ST375_119:5:2308:19782:7202/1
AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTT
+
7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<
and
@HWI-ST375_119:5:2308:19782:7202/2
CTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
+
<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
In two separate output files. I cannot do this by hand obviously.
If anyone could help me out, that would be awesome.
My sample is DNA and my indels are hundreds of kilobases and rare (but they are real, it isn't noise). If anyone could suggest a structural variant detector that is suitable for these kinds of indels I will be happy to try those as well.
So I have one file which looks like this
@HWI-ST375_119:5:2308:19782:7202/1
AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTTCTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
+
7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
@HWI-ST375_119:5:2305:20392:171561/1
CTTAATGTTTATATACTATGTTGCTATTATGAACTGAACTTATGTATATGTACCTCTAGCCTACTACAATTCGACCATCCTGACATTTAAGATCACCTG
+
@@@FDFFFGHDHFFGGIEHHEAFEHEHIHBHHIFHEHIGI@DE@FF?DD9?H@DFHEHIJJIGGCJGIJE>BFIF@@ADCAEFHEHGADCDDFECECCC
@HWI-ST375_119:5:2105:14681:149697/1
GAGAAGGTATTGAAAAATGATAGGTTTTTAATTAAAGATGTTGAAGGTTTTCAATTGGCGCGAAATCCATATCAAGGTGTTTGGAGTAGTCAAAATAAT
+
B@CFFFFDHHHHHJJJJJJIJJJJHJJJJJIJJIJJJIJJJJJJJJJFIIJJIJJJJJJJJJJHHHFFFFFFFEECECCDDDDDDD@CDFDDDDDDDDE
@HWI-ST375_119:5:1202:2525:15121/1
CTAGGGAGTAGAAAGTAAATTTGTATAATTAAAGATAAGAAAAGTAAAAAAAAAGGGTAACAAATAGCGGCTAAGAACAATGTAATAGTTATAAGTTGG
+
Basically what you see is the name of the read the thing that starts with HWI
Then there's the sequence that it decided the DNA was, the big block of CATG
Then there is a +
Then there is a bunch of characters which tell the computer how good looking each of those bases are so it can decide to use the data or ignore it if it is bad
What I want to do is split the data I have in half and align each of them separately. Another thing I want to do is rename the reads after I split them
What I really want to do is rename the read generated by the last 50 bases and you see how there is a /1 at the end of the name? I think that needs to become a /2 or vice versa if I use the second read as the input file.
Then I can do a paired end alignment and if all goes well I should have the data displayed so I can see indel containing reads as a paired end read on IGV and count them on there.
So what I think needs to happen is
@HWI-ST375_119:5:2308:19782:7202/1
AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTTCTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
+
7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
Needs to become
@HWI-ST375_119:5:2308:19782:7202/1
AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTT
+
7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<
and
@HWI-ST375_119:5:2308:19782:7202/2
CTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
+
<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
In two separate output files. I cannot do this by hand obviously.
If anyone could help me out, that would be awesome.
My sample is DNA and my indels are hundreds of kilobases and rare (but they are real, it isn't noise). If anyone could suggest a structural variant detector that is suitable for these kinds of indels I will be happy to try those as well.
Comment