Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting Sequencing Files

    So my sequence data looks something like this. What I want to do is split up each end of 100 bp paired end read into two files and trick the aligner into thinking it is a paired end read. The only problem is all I don't know how to program very well.

    So I have one file which looks like this

    @HWI-ST375_119:5:2308:19782:7202/1
    AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTTCTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
    +
    7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>
    @HWI-ST375_119:5:2305:20392:171561/1
    CTTAATGTTTATATACTATGTTGCTATTATGAACTGAACTTATGTATATGTACCTCTAGCCTACTACAATTCGACCATCCTGACATTTAAGATCACCTG
    +
    @@@FDFFFGHDHFFGGIEHHEAFEHEHIHBHHIFHEHIGI@DE@FF?DD9?H@DFHEHIJJIGGCJGIJE>BFIF@@ADCAEFHEHGADCDDFECECCC
    @HWI-ST375_119:5:2105:14681:149697/1
    GAGAAGGTATTGAAAAATGATAGGTTTTTAATTAAAGATGTTGAAGGTTTTCAATTGGCGCGAAATCCATATCAAGGTGTTTGGAGTAGTCAAAATAAT
    +
    B@CFFFFDHHHHHJJJJJJIJJJJHJJJJJIJJIJJJIJJJJJJJJJFIIJJIJJJJJJJJJJHHHFFFFFFFEECECCDDDDDDD@CDFDDDDDDDDE
    @HWI-ST375_119:5:1202:2525:15121/1
    CTAGGGAGTAGAAAGTAAATTTGTATAATTAAAGATAAGAAAAGTAAAAAAAAAGGGTAACAAATAGCGGCTAAGAACAATGTAATAGTTATAAGTTGG
    +

    Basically what you see is the name of the read the thing that starts with HWI

    Then there's the sequence that it decided the DNA was, the big block of CATG

    Then there is a +

    Then there is a bunch of characters which tell the computer how good looking each of those bases are so it can decide to use the data or ignore it if it is bad

    What I want to do is split the data I have in half and align each of them separately. Another thing I want to do is rename the reads after I split them

    What I really want to do is rename the read generated by the last 50 bases and you see how there is a /1 at the end of the name? I think that needs to become a /2 or vice versa if I use the second read as the input file.

    Then I can do a paired end alignment and if all goes well I should have the data displayed so I can see indel containing reads as a paired end read on IGV and count them on there.

    So what I think needs to happen is

    @HWI-ST375_119:5:2308:19782:7202/1
    AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTTCTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
    +
    7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<<F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>

    Needs to become

    @HWI-ST375_119:5:2308:19782:7202/1
    AGATCACCTGATCTTGACGAAATTTCCAACTTATAACTATTACATTGTT
    +
    7@;DD>DDHD>FABGH<F>F<G=H4A?+4CFE4BDA@CFC>B9FGHEF<

    and

    @HWI-ST375_119:5:2308:19782:7202/2

    CTTAGCCGCTATTTGTTACACTTTTTTTTTACTTTTCTTATCTTTAATTA
    +
    <F@D>4B66:5;@GHEHGIGIEEHCE?@BB95>C>@CCCCA@>>>ACC:>

    In two separate output files. I cannot do this by hand obviously.

    If anyone could help me out, that would be awesome.

    My sample is DNA and my indels are hundreds of kilobases and rare (but they are real, it isn't noise). If anyone could suggest a structural variant detector that is suitable for these kinds of indels I will be happy to try those as well.

  • #2
    Hi YOLO69SWAG,

    I assume you are staring with single end reads, not paired end as stated. Trimmomatic may be able to split 100bp single end reads into two 50bp read files.



    First run 'crop' with 50 for the 'forward' reads:

    CROP:<length>
    length: The number of bases to keep, from the start of the read.

    Then run headcrop with 50 for the reverse reads:

    HEADCROP:<length>
    length: The number of bases to remove from the start of the read.

    What I am not so sure about is turning the /1 into /2 in the 'headcrop' reads. perhaps try:

    sed 's/\/1$/\/2/g' headcrop.fastq >ending_in_2_headcrop.fastq

    which should turn all lines ending in /1 into /2

    One other consideration; does the headcropped sequence need to be reverse complimented? And another comment is that finding 'real' indels of hundreds of kilobases will be difficult - perhaps look into pindel or breakdancer or both.

    Comment


    • #3
      Splitting Sequencing Files

      yes, you would need to reverse complement the sequences for R2, and reverse the string with the base qualities.

      Comment


      • #4
        Thanks

        Thanks,

        I did start with a paired end experiment, but I'm just showing one of the output files (the one from the first reaction). I've got another file where it is the second reaction and the names end in /2. I hope I'm not missing something.....

        I will try out those programs now and try to write something to reverse compliment and rearrange the confidence scores. If I can't do it, there's always fivver

        Thanks again.

        If anyone really fluent in code sees this, feel free to whip out something for me to try.

        Comment


        • #5
          Gaaaaarrrrrhhhh, STOP!!!

          You already have paired end read data. The first file contains the forward reads and the second file contains the reverse reads.

          Comment


          • #6
            Splitting the files like that should not help much and may even result in less accurate mapping, if you do expect an indel in the middle of a read you should be able to easily identify it by mapping the 100 bp paired end reads anyway, one pair will partially map and the paired read will map a large distance away.

            Comment


            • #7
              Yes I do have the paired end data, but I need to find the junction reads to finish quantification of this phenomena. The reads containing the real indels are high enough quality to map correctly since I can blast them manually on my own.

              Comment


              • #8
                Maybe you should look at a viewer program such as IGV.

                Comment


                • #9
                  Also that sed command above will change some of the quality scores, as "/1" is a legitimate quality string.

                  Comment


                  • #10
                    Maybe I should just write my own programs to do what I need since none of these suggestions or structural variant programs are any use. Oh wait, I did. Message me if you want it.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:35 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-09-2024, 02:46 PM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-07-2024, 06:57 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-06-2024, 07:17 AM
                    0 responses
                    19 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X