Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Bfast to align paired end Illumina reads

    Hi,

    I am attempting to align paired end Illumina reads to the human genome with Bfast.

    The problem is that most aligners expect paired end reads in 2 fastq files while Bfast expects 1 file. Having searched this and other forums, the most commonly suggested/accepted solution appears to be conversion of the 2 files to 1 file using the ill2fastq script supplied with Bfast.

    I have now converted the files and run the alignment however there appear to be problems with the results. I have aligned the same sequences with Stampy and BWA (separately) and I get completely different results. The output of samtools flagstat is shown below.

    Does anyone have any suggestions on how best to utilise Bfast under such circumstances?

    BWA
    200000 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    191942 + 0 mapped (95.97%:nan%)
    200000 + 0 paired in sequencing
    100000 + 0 read1
    100000 + 0 read2
    190714 + 0 properly paired (95.36%:nan%)
    191364 + 0 with itself and mate mapped
    578 + 0 singletons (0.29%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    Stampy
    200000 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    199376 + 0 mapped (99.69%:nan%)
    200000 + 0 paired in sequencing
    100000 + 0 read1
    100000 + 0 read2
    197774 + 0 properly paired (98.89%:nan%)
    198768 + 0 with itself and mate mapped
    608 + 0 singletons (0.30%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    BFAST
    125396 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    123238 + 0 mapped (98.28%:nan%)
    50792 + 0 paired in sequencing
    25396 + 0 read1
    25396 + 0 read2
    0 + 0 properly paired (0.00%:nan%)
    49516 + 0 with itself and mate mapped
    326 + 0 singletons (0.64%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

  • #2
    If you have two files, then you need to create one file that is interleaved so that the second read of a pair is directly after the first. Then specify the pairing orientation on the command line.

    Comment


    • #3
      Thanks Nils - I'll write something to interleave the two files.

      Comment


      • #4
        Nils,

        I interleaved the two files and have run them using Bfast. However, whenever I reach the postprocess stage, I am not getting successful pairing of the reads regardless of what I specify on the command line with the -P/-S or -Y options.

        Here is a small selection of the reads:

        Code:
        @chr17:95861:F:237/1
        GTTGGCGAACACATCCATGTGCCGGGAGGATGGTGCACCCCAACTCCACAAGGACCCTTCCAGACCGCGGCCGCTCCAGCTCTCAAAGCC
        +
        @<@?DDDD>FF:DAF9FFFCAGF<F3AAFD>2ACEF?CFC@?;FB:?@?;D@>86';EE;AE376?########################
        GCCCCAGACTCCACAGGTTAAGGGCTCGCATCTCTTGAACAGGGATCTTGATTGCCCCGCGACCTACTGACAATCTGAATTCTGGGCGCT
        +
        +:+=+ADDFCFBF3C:BE+@@;1):C?###############################################################
        @chr17:42735:F:247/1
        TTAAAACTGGATATCACCCAGTGTTGGCAAGGTACAGGAAAATGGGAACTATCATATACCACAGGGGCTGGAAGAGCATAAACTGGTTTA
        +
        CCCFFFFFHHHHHJJJJJJJJIJJJJJJJJJIJGIJJJJJJJIJJJJIJJJJJJIJJJJJIJIJIJJJIJIIJJJJIIJJIJHGFHHFFF
        @chr17:42735:F:247/2
        TTAAGTGACTTCATTTTTAATTACTATATGGGATTCTATCTTTCCAGTGTATCATGATTTATTTGACCTATTGCTGAATGTTGGAGGTTT
        +
        @CCFFFDDHHHHGJJJJJHIJJJJIGJJJJCHJJJIHIJIJJJJJIIIIJIIIGIIJJJHHHHIIGEE@HC=?BE>;>CEEC>A=ACC;A
        @chr17:85755:R:-161/1
        AAAATAATAAACCAGTCATTAGAACCATAAACCTGTACTGTTTTTGACAATGTAATGCACTGCCCTGTAAAGCACTACAATAAAGACGTT
        +
        1:1+4??=DCFC>+A)++3+3CACG4??:*::?DCDEAGB1?DB?<?B<B@DEHFE=)8B8=77=@((6).=CC7=;@>D@3)7;>B###
        @chr17:85755:R:-161/2
        AATTACATGTCATATGAATGAATACTTGACGTCAGCAGGACTGCGTTTTGGTGGTGAACTTGGTTCTAGGTAGAAACAAAGAATGGAGAA
        +
        @@@DAD:DFDFFDF9A4CB<:A?:<FC)ACF6?<CFFF;GABAB93BDGCE<DB?*8=@=B@=FFGIF<1?@>?BB@;AC>;(5-;;>A>
        @chr17:76132:F:140/1
        AATGTAAAACACCATAAAAATTAATCTTAAGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGTGGGCGGA
        +
        CCCFFFFFHHHHHJJJJIJJIJJJJJJJJJJJJJJJJJJJJIJIJJJJIJJJJGIJJIJHIGJJJHGIJJJJJJJIHGIIFIHHHHHHFF
        @chr17:76132:F:140/2
        ACGCCATTCTCCTGCCTCGGCCCCCCAAGTAGCTGGGACTACAGGCGTCCACCACCACGCCCGGCTAATTTTTTGTATTTTTAGTAGAGA
        +
        @C@FFFFFFHHHGJJJJJJJJJJJJJJIJJJJEHIIJJJJJJJJJIJDGHGGIIGHIIIJJJJJJJJGIIJJJIHGHHHHFFFFFDEEEE
        @chr17:72239:F:208/1
        CAGACTGGCATGCAATGGTGCGATCTCGGCTCACTGCAACCTCTGCCTCCCAAGTTCAAGTGATTCTCCTCCCTCAGCCTCCCAGGTAGT
        +
        CCCFFFFFHGHHHJJJJJJIJJGIGEHGIIJJHIJJJJJJJJJJJGJJJJIJIJJIIJIJEGCGHIIIHHDHFHHHFFFFFFFDCEEECC
        @chr17:72239:F:208/2
        AGAAAAAGTAAGAGACACCTATAGATCAGAAGACACTTGGGGCTGGGCATGGTGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCAA
        +
        @CCFFDDFHHHHHJJJFHIJJGJIJJJJJEGGIIJJJJJGIIJJJJGEIIFHGHIFHIIJJJJJEIJIIJIIGHHFFEEHFFFFFFEFCC
        @chr17:124386:R:-215/1
        AAGGGAAAAGACCCAAAGGGTTGGAAGCAATATGTGAAAAAATACAGAATTTATATTGTCTAATTACAAAAAGCAACTTCTAGAACCTTT
        +
        CC@FFFFFHHHHGJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJIJHJJIJHIJJJJJJJJJJJIIIJJIIGIJIJIJGJJJIHHHHHHFF
        @chr17:124386:R:-215/2
        AGCCAGGCATGGTGGTGCATGCGTGTAATCCCAGCTACTCGGGAGCTGAGGCAGGAGAATCGCTTGAACCCACGAGTCAGAGGTTGCGGT
        +
        CCCFFFFFGHHHHJJJJJJHIJJJJJJIJIIIJJJIJJJJGIJIJJJJJJIJJIJJJJJJIIJJJIJJJHHHHHHHHFFFFFFFDFEEED
        @chr17:43455:F:112/1
        TTTTTGAGACAGGGTCTCACCATATCACCCAGGATGGAGTGCAGTGGCACCATCATGGCTCACCACGGCCTCAACTTGCTGGGATCAAGC
        +
        CCCFFFFFHHHHHJJIJGIJJJIJGIJJJJJJJJJJIIIJJJJIIJJJJJIJFGHGIJJJJEIIGHHHGHFFFFFEEEEEEEDDDDDDDC
        @chr17:43455:F:112/2
        ACGCTACTGCACTCTGTTCTAGGCAACCCCTGTCTGGGAAAAAAAAAAAAATTAGTGAGGCTTAGTGGTGCACACCTGTAGTCTCAGCTA
        +
        CCCFFFFFHHHHHJJJJIEIJJJJHGIJJIIEHGIIJJIIJ@FHIJJJIJGHIJIHHBGHGI=FIEDEGGIJHEHG=>AE?EHCEFDFE;
        @chr17:89085:R:-274/1
        TTTAAAGCACTACCCAGAGTTATTCAAAGCCAGGCAGGAGAACTGCAGGCATCAGAATGCCCTGGGGACGGGTCCAAAATGCAGAATCCT
        +
        @@BFFFFDHDHHHFGIJIGJJGJIJIHIEGJJJ)?@FHHJJI;FDHIIHH=>DFEDECEEECDDECCD55<@B>CDCBDD@CDCCDACCC
        @chr17:89085:R:-274/2
        GCTTGGTAAGGAATATTAGGTGAACACAGGTGCCTATGTGGGTTCCACCGTTTTCCATAAATGTTAATTTTCGGCACTGAGGACGGACAG
        +
        CCCFFFFFHHHHGIJJJJJIIIIJIJHF>CCE@3;(5>;?B?3:,5,3?((+3@ADCD::8:::4>@CD#####################
        Is there anything obviously problematic to be seen?

        Comment


        • #5
          That doesn't look like a valid FASTQ file (where's the read name for the second read). Also, the read names should be the same for multi-end reads. Please read the manual on the topic of the proper input format in the manual accompanying BFAST. There should be an example paired end read FASTQ (etc.).

          Comment


          • #6
            Nils, RE the second read name - it must have been a copy/paste error as the file actually does contain it:

            Code:
            @chr17:95861:F:237/1
            GTTGGCGAACACATCCATGTGCCGGGAGGATGGTGCACCCCAACTCCACAAGGACCCTTCCAGACCGCGGCCGCTCCAGCTCTCAAAGCC
            +
            @<@?DDDD>FF:DAF9FFFCAGF<F3AAFD>2ACEF?CFC@?;FB:?@?;D@>86';EE;AE376?########################
            @chr17:95861:F:237/2
            GCCCCAGACTCCACAGGTTAAGGGCTCGCATCTCTTGAACAGGGATCTTGATTGCCCCGCGACCTACTGACAATCTGAATTCTGGGCGCT
            +
            +:+=+ADDFCFBF3C:BE+@@;1):C?###############################################################
            I'll have a look at the FASTQ format in the Bfast manual and reattempt. I hadn't scrutinised the format too deeply as the same sequences have already been successfully aligned with a few other programs.

            EDIT: I have now rerun and the alignments look good. The problem seems to have been the presence of the /1 and /2 suffixes in the read names.
            Last edited by gavin.oliver; 01-09-2012, 01:56 AM.

            Comment


            • #7
              Great, I am glad it is working for you!

              Comment


              • #8
                Hi,

                I have actually the same problem. I have each pair end of the reads in separate files, but when they look as the following:

                --- pair_1 file:

                Code:
                @HWUSI-EAS1692_0001:1:1:1050:4451#0/1
                CAGATTCACANTCCTGAATATCATGTTTTCTTTCCAAGGNATGACATAACGTCTTGGGATCATCCCTTGCTTTAATGAAAATCGTGGCAAATGAA
                +HWUSI-EAS1692_0001:1:1:1050:4451#0/1
                Ybaac][T^YB[ZZ[SKVZT`bcYbccaccaaa_cZZ[ZB[Z[T_c`cYcc\bcccc^T\a`TcccbL\ac\^a\Ybb`^bY]bb_BBBBB
                -- pair_2 file:

                Code:
                @HWUSI-EAS1692_0001:1:1:1050:4451#0/2
                CATGATAATGCACTCCATCTCATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCACTAAAAAGCGGACCTTGGTGTGAAAACATAACACACAC
                +HWUSI-EAS1692_0001:1:1:1050:4451#0/2
                M_M^ZM\YL]U^L\^VQJIU\a__\``c\cW_aaaaa_R[_\_`W][__BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
                I use the ill2fastq.pl script contained in the bfast package to merge them and everything works fine. However, when the files are like:

                -- pair_1 file:

                Code:
                @ILLUMINA-A16956_0001:4:1:1099:4197#0/1
                CTGTTTTCAAAATAATATNCTTCTTGATTCCTTAGCTCTGTGCCTCAGAC
                +
                CCACCCACCCCCCC@<BA#@;:<<<CA=CCA=BBB@@C@ACCCCBCCCBC
                -- pair_2 file:

                @ILLUMINA-A16956_0001:4:1:1099:4197#0/2
                NNTATTTTCTAAAGTGAGGCGGCANGNANATNNTNGTAGGTANCNNNGNN
                +
                ###########################################################
                which seems more 'fastq - like' (with Phred+33 quality and '+' in the third line), then I have to merge them by my own script? I mean, I can't not merge them by using some argument of the bfast commands used for alignement?

                Thank you very much for your feedback,

                best
                David

                Comment


                • #9
                  The read names should be identical; they are not.

                  Comment


                  • #10
                    The de novo assembler velvet also comes with a little perl script to interleveave two separate fastqs into one.

                    Comment


                    • #11
                      Nils, sorry for that, but I would say that the reads name are the same. In the first example, they are 'illumina-like'. In the second example, they are 'fastq-like'.

                      So my guess is that in the first case they must be interleaved + converted, I did it by using the ill2fastq script contained in the bfast package, and everything went fine.

                      In the second example, they have to be interleaved (but not converted, since the quality is already +33).

                      So, sorry if I am missing something obvious, but my questions are the following:

                      - I noticed that the ill2fastq script also complements + reverts the sequence of the second pair. Is it necessary?

                      - is there any way to input the two fastq files to the bfast command(s), without having to interleave them before?

                      thanks,
                      david

                      Comment


                      • #12
                        No, BFAST requires the reads to be interleaved with the read names exactly the same. Supporting platform specific vendors input formats becomes difficult (look at all of these conversion scripts!), and so a generic format is better.

                        I haven't looked at ILMN data in a while, so I am not sure if the ill2fastq scripts needs updating. I would be immensely grateful for a patch.

                        Comment


                        • #13
                          Yes, you're right.
                          Last question: when interleaving, the second end must be reversed and complemented?

                          I am trying to figure out whether bfast requires all the pairs to be listed in order of 5' to 3' and on the same strand. I am using bfast+bwa-0.7.0a.

                          Comment


                          • #14
                            Prior versions required that, but now you can specify their orientation during the "postprocess" step.

                            Comment


                            • #15
                              ok, I will download the last version to simplify the interleave step.

                              Many thanks for your help, Nils.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X