Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FASTX Toolkit barcode splitter issue

    Hello All,

    I've been trying to use the FASTX Toolkit barcode splitter to demultiplex my illumina reads. The following command runs okay without any errors:

    [cat /home/johnathon/jda_ev_extended.txt | fastx_barcode_splitter.pl --bcfile /home/johnathon/mybarcodes.txt --bol --mismatches 1 --prefix /home/johnathon/split_bc/jda_ev_split_bc --suffix ".txt"]

    But none of the output files contain any reads except for the mismatched file.

    The following is mybarcode.txt file:

    [#I hope the following is the appropriate format for this txt file, it should contain the barcode identifier and the barcode sequence itself in a tab delimited fashion--Johnathon David Anderson
    BC1 ACCC
    BC2 CGTA
    BC3 GAGT
    BC4 TTAG]


    However, when I look at the extended.txt file i can see the right barcodes on the 5' end. I have also tried to use the export.txt file to no avail; apparently it is not formatted appropriately. I get an error message saying for the first character there is an "S" instead of an "<" or an "@".

    I have not converted these files from Solexa to Sanger Fastq. Could this be the issue?

    For my first data set that was not barcoded I was using the MAQ fq_all2std.pl script export2std command to convert the export.txt file. It worked just fine and I was able to visualize the data on IGV. I haven't had much success with MAQ patch ill2sanger and am wondering if this is the issue with FASTX Toolkit, then can anyone recommend a user friendly script to convert. I am using Solexa pipeline 1.6.

    Is anyone familiar with the FASTX Toolkit? Is the problem probably that the Illumina files need to be converted to Sanger FASTQ first?

    Any guidance would be most appreciated?

    Regards,
    Johnathon

  • #2
    Hello All,

    I am updating my progress in case this may help someone in the future.

    As previously mentioned I used the FASTX Toolkit on the export.txt and extended.txt files from Illumina pipeline 1.6 with minimal success and I suspected a formatting error in these files. I just tried using the same Barcode Splitting module on the sequence.txt file (prior to reformatting to Sanger Fastq) and it seems to have worked fine, with the caveat that there appears to be more reads in the unmatched file than I had expected (199,524 out of 28,223,602 or 0.7%), but perhaps this is normal. For reference, I had used the NuGen Ovation and Encore Kits for library prep.

    Regards,
    Johnathon

    Comment


    • #3
      sorry to hijack your thread but would fastx toolkit be able to demultiplex SOLiD reads as well?
      http://kevin-gattaca.blogspot.com/

      Comment


      • #4
        Fastx toolkit does not work for solid data. I wrote some perl scripts to demultiplex some solid data few months back. The code and the syntax weren't pretty. If you're interested, I can dig the scripts up and post them.

        Comment


        • #5
          Hello Kevin,

          I am not sure. I cannot directly tell from the documentation, however, i don't see any mention of color space reads. Maybe you could query the Hannon Lab if you don't get an immediate answer on here ([email protected]).

          -
          Johnathon

          Comment


          • #6
            Bump for the solid part of this thread.
            Once I run the solid2fastq.pl to convert my csfasta and qual to a fastq.gz file, can I use fastx to do QC on my solid PE reads?

            Comment


            • #7
              Hi jdanderson,
              I think your command looks good to me and i suspect the problem is with the barcode file.Try opening the barcode file with vi and see if there is anything werid going on. Sometimes you see ^M at the end of the line and if you see so then you can manually fix this and re-run the command. Good luck....

              Comment


              • #8
                Hi everyone,

                I've been using the FastX Barcode Splitter successfully, but regarding the --partial option, I have realized I'm losing some reads with a particular problem:

                With --partial 1

                The barcode

                Code:
                CGCGTCAGCATTGTTCATAC
                will pick up the read

                Code:
                [COLOR="purple"]GCGTCAGCATTGTTCATAC[/COLOR]AAAGCTACTTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACT
                since it is missing just one base at the left end to match the barcode exactly.

                However, the read:

                Code:
                C[COLOR="Purple"]CGCGTCAGCATTGTTCATAC[/COLOR]AAAGCTACTTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACT
                will not be taken as matching the barcode, since it has one extra base at the beginning. Unfortunately, there are many reads that fall into this category, but not all of them begin with the extra 'G'.

                Do you use anything else to get around this?

                Thanks!
                Carmen

                Comment


                • #9
                  A quick and dirty solution would be to trim of the first base pair of all your reads and then just use FastX barcode splitter with --partial

                  Comment


                  • #10
                    Thank you, chadn!

                    Of course this was the easiest solution.

                    The barcode is:
                    Code:
                    REVERSEPRIMER	[COLOR="red"]CGCGTCAGCATTGTTCATAC[/COLOR]
                    Read 1 begins with a perfect match to the barcode.

                    Code:
                    @HWI-M00149:16:000000000-A12VK:1:2114:17873:29127 2:N:0:
                    [COLOR="Red"]CGCGTCAGCATTGTTCATACAAAGCTAC[/COLOR]TTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACTCCCCCCTCTTGTTTTNNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
                    Read 2 has an extra base at the beginning, followed by a perfect match to the barcode.

                    Code:
                    @HWI-M00149:16:000000000-A12VK:1:2114:17873:29128 2:N:0:
                    A[COLOR="red"]CGCGTCAGCATTGTTCATAC[/COLOR]AAAGCTACTTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACTCCCCCCTCTTGTTTTNNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
                    Read 3 is missing the first base of the barcode.

                    Code:
                    @HWI-M00149:16:000000000-A12VK:1:2114:17873:29129 2:N:0:
                    [COLOR="red"]GCGTCAGCATTGTTCATAC[/COLOR]AAAGCTACTTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACTCCCCCCTCTTGTTTTNNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
                    By trimming the first base of every read,

                    we are left with

                    Code:
                    Read 1 [now missing 1 base at the beginning]
                    
                    [COLOR="Red"]GCGTCAGCATTGTTCATACAAAGCTAC[/COLOR]TTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACTCCCCCCTCTTGTTTTNNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
                    
                    Read 2 [now perfect match]
                    
                    [COLOR="red"]CGCGTCAGCATTGTTCATAC[/COLOR]AAAGCTACTTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACTCCCCCCTCTTGTTTTNNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
                    
                    Read 3 [now missing 2 bases at the beginning]
                    
                    [COLOR="red"]CGTCAGCATTGTTCATAC[/COLOR]AAAGCTACTTAGTTGCTACGAAGCAATACATTGTTAGTTGTTAACTACTCCCCCCTCTTGTTTTNNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNN
                    and by using

                    Code:
                    --mismatch [COLOR="red"]4[/COLOR] --partial [COLOR="red"]4[/COLOR]
                    all reads will be matched to the barcode.

                    The --4 doesn't make sense to me, as I thought this would be --2, but this is the only thing hat gets it to work, so...

                    Thanks a lot!

                    Carmen

                    Comment


                    • #11
                      fastx_barcodes_splitter issue with run

                      Hi,

                      I saw the post and I hope maybe some of you can help me

                      When I run fastx_barcode_splitter.pl with this script

                      /usr/local/bin/fastx_barcode_splitter.pl --bcfile ./Barcodes9nt.txt --prefix ./Rescued9nt --suffix .fq –bol

                      In the command line it looks like is running (no error message, no > sign), see attachment for screenshot.
                      However is not running at all, I can see with top that is not using any memory or CPUs and it has been ‘running’ for days on a very small file without producing any results.
                      The input file is in the STDIN folder as supposed to.

                      I would be very grateful if you could suggest what might be wrong.
                      Thanks in advance
                      Vivi

                      Comment


                      • #12
                        Hi vivi7,
                        I guess you need to provide your fastq or fasta file. You haven't provide that.
                        Use as
                        Code:
                        cat File.fastq | /usr/local/bin/fastx_barcode_splitter.pl --bcfile mybarcodes.txt ...other options if you want.

                        Comment


                        • #13
                          Hi Everybody,
                          I came back to this thread again as I am getting a very similar problem to the first post by janderson.

                          My code works fine:
                          cat test_R1.fastq | fastx_barcode_splitter.pl --bcfile mapping2_bcfile.txt --prefix /Volumes/Cristina/Mr.DNA_2016/fastq_files/testdata/ --bol --mismatches 1
                          But none of the output files contain any reads except for the mismatched file.

                          This data we got from Mr.DNA and raw fastq file for 10 sample together which I need to split. Johnathon's later suggestion din't help.
                          Can anybody help please?
                          Thanks,
                          smitra

                          Comment


                          • #14
                            Can post a few lines of your fastq file and the mapping file?

                            Comment


                            • #15
                              Thanks for replying GenoMax

                              Code:
                              #SampleID	BarcodeSequence
                              AP1E	CGTAACCA
                              AP25E	CGTACCCA
                              AP5D	CGTAAGAA
                              AP8C	CGTAGATA
                              P29F	CGTAGGCT
                              P30N	CGTATTCA
                              P31B	CGTCAAGA
                              P35C	CGTATTTC
                              V2A	CGTCCAGG
                              V3J	CGTCACAG
                              But as the fastq files look like (I assume the bold red part is the barcode with one N)

                              mitras$ less test_R1.fastq

                              Code:
                              @M02542:124:000000000-AKFBJ:1:1101:13841:1000 1:N:0:5
                              
                              NGTACCCAAGGGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACANNCNNGTCGAACGGTAGCNCAGAGAGCTTGCTCTNGGNTGACGAGTGGCGGACGGGNGANTAATGTCTGGGAAACTGCCCGATGGAGGGGGATANCTACTGGANANNGNNGCTAATACCGCATAACGNCGCAAGACCAAAGAGGGNGANNTCAGGGCCTCTTGNCATCGGATGNNCCCAGATGGGATNGGCTTGTAGGTGAGGTAAGNGCTCACGCNGGCGACGATCCCTAGCTTGGNNGNGAGG
                              
                              +
                              
                              #8ABCFGGGGGGGGEEGGGGGGGG<FGGGFFGFGFGFGGEG@FGEEGGCFGGGGG?##:##6:CFFGGGDG<CG#:CCFFGEGGGGFAFG#:<#:BBFF7FFGDGGGGGGGD#8+#+:BFGGGGGGGCFFGDGG<FGGGECCGDEGGGF@#611:D,>>#6##6##66<1CF@7FFFGEGF7E#41=8=EGFFG7*?CF>>#22##2*2;@;8C8CFC<#/2AC=E*:5##/2:CFCG+8**+#*1*1552<+*+0+8D6D4+#1**)**)*#*15/*//7>5:5<.*,*)0)##1#..73
                              
                              @M02542:124:000000000-AKFBJ:1:1101:12174:1002 1:N:0:5
                              
                              NGTAACCAAGGGTTTGATCCTGGCTCAGGATGAACGCTAGCTACAGGCTTAACACANNCNNGTCGAGGGGCAGCATTTCAGTTTGCTTGCNAANTGGAGATGGCGACCGGCGNACNGGTGAGTAACACGTATCCAACCTGCCGATAACTCNGGGATAGCNTNNCNNAAGAAAGATTGATACCCNATGGTATAATCAGACCGNATGGTCTTATTATTAAANAATTTCGGTNNTCGATGGGGATGNGTTCCATTAGGCAGTTGGTGTGTTAATGNCGCACCAAACCTTCCTGTGANNGNGTTT
                              
                              +
                              
                              #8ACCGGGGGGGCFGGGGGGGGGGGGGGGGGGGFGGGGDGGGGGGGGGFGGGGGGG##:##6:CFGFDEGGGGDGGGFGGGFGGGGGGGG#:C#66=,CFFFGGGG@FGEE7#++#:BBFFGGGFCFGGGGGGCGDGGGFGGGGGGGGC=#8@<<<FGG#5##8##86DCF<FCCC:BFCFFF#6>F>FGG92;@CFFGF@#116*=CF<CG?@CFFFG#3;5375:CG##212**<5C5/::#11:91A>+<>C6CE<FC:*****0:FB<#1*)//75<F30762*-2)**##1#0)0.


                              But as you can see I have N, so may be I need to allow 1 mismatch for the barcode.
                              Thus I tried code as:
                              cat test_R1.fastq | fastx_barcode_splitter.pl --bcfile mapping2_bcfile.txt --prefix /Volumes/Cristina/Mr.DNA_2016/fastq_files/testdata/ --bol --mismatches 1
                              Thanks for helping
                              smitra
                              Last edited by GenoMax; 01-25-2016, 09:10 AM. Reason: added CODE tags to improve readability

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              36 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X