Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting fasta file according to header

    Hi there,
    I have a fasta file like this:
    Code:
    [zillur@genomics filter]$ head new_12.fasta 
    >000000M00365:7:000000000-A48JK:1:1110:10044:9619
    TACGGAGGGTGCAAGCGTTATCCGGAATCACTGGGTTTAAAGGGTGCGTAGGCGGATATATAAGTCAGAGGTGAAAGCTCGCAGCTTAACTGCGGAATTGCCTTTGATACTGTTTATCTTGAATTATGTTGAGGTTAGCGGAATGAGTCAT
    >000000M00365:7:000000000-A48JK:1:2105:14983:8496
    TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTATGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCCTGTAAACTATCAGTCTAGAGTTCTGGAGAGGTGAGTGGAATTGCTAGG
    >000000M00365:7:000000000-A48JK:1:2113:12381:28279
    TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGATAAGTCAGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTTGATACTGTCAGACTAGAGTATGTTAGAGGAATGCGGAATTCCGGGT
    >000001M00365:7:000000000-A48JK:1:1110:15899:9619
    TACGAACTGTGCAAACGTTATTCGGAATCACTGGGCTTAAAGGGTGCGTAGGCGGGTTTGTAAGTCAGAGGTGAAAGTTTGCAGCTTAACTGTAAAATTGCCTTTGAAACTGTAGAACTTGAGTAGCGTTGAGGTCAGCGGAATGTGACAT
    >000001M00365:7:000000000-A48JK:1:2105:15157:8497
    TACGAAGGTCCCAAGCGTTATTCGGAATCACTGGGCGTAAAGGGAGCGTAGGCGGCGTGGAAAGTCAGATGTGAAATCTCAAGGCTCAACCTTGAAACTGCATCCGATACTTCCATGCTAGAGGACTGGAGAGGTGTTTGGAATTATCGGT
    I want to sort this file according to header informations. How can I do this?

    Best Regards
    Zillur

  • #2
    Can you be more specific about which header information? Alphabetical sorting?

    Comment


    • #3
      Thank you very much. alphabetically/numerically whichever convenient.

      Best Regards
      Zillur

      Comment


      • #4
        And the reason you want to do this, if I may ask?

        Comment


        • #5
          Thanks.
          And the reason you want to do this, if I may ask?
          Yeah sure. I wanted to create fastq file using my .qual ahd fasta file using qiime. But it gaves me:
          Code:
          KeyError: 'QUAL header (M00365:7:000000000-A48JK:1:1101:14885:1320) does not match FASTA header (M00365:7:000000000-A48JK:1:1101:16466:1388)
          In my qual file I have many other sequences including my fasta. So, I think sorting may resolve the issue. I appreciate your suggestions.

          Best Regards
          Zillur

          Comment


          • #6
            I guess sort on linux will work.
            cat file.fasta|paste - -|sort|sed 's/\t/\n/g'
            Try this.
            Persistent LABS

            Comment


            • #7
              Following is untested but you could give it a try and see if it works. It may avoid the sort etc. You will find reformat.sh in BBMap suite.

              Code:
              reformat.sh in=your_fasta_file.fa qfin=your_qual_file.qual out=fastq_format_file.fq

              Comment


              • #8
                Thank your very much. I have tried this:
                cat file.fasta|paste - -|sort|sed 's/\t/\n/g'
                But it doesn't resolve all:
                Code:
                (qiime191) [zillur@genomics final]$ head new_sorted_1.fasta 
                >M00365:7:000000000-A48JK:1:1101:10000:14343
                TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTGCGTAGGCGGATTATTAAGTTAGGGGTGAAATCCCGAGGCTCAACCTCGGAACTGCCCTTAAAACTGTTGGTCTTGAGTTCTGGAGAGGTGAGTGGAATTGCTAGT
                >M00365:7:000000000-A48JK:1:1101:10000:18084
                TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGAATGCGGAATTCCAGGT
                >M00365:7:000000000-A48JK:1:1101:10000:25105
                TACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTTCGTAGGCGGGTTATTAAGTCAGATGTGAAATCCCAGGGCTCAACCTTGGAACTGCATTTGAAACTGGTAACCTAGAGACTAGGAGAGGTCAGTGGAATACCGAGT
                >M00365:7:000000000-A48JK:1:1101:10000:5055
                CACGTAGGGGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCTGTTCAGTAAGTCAGGTGTGAAAATCCAAGGCTCAACCTTGGGACGCCACCTGATACCGCTGTGACTAGAGTCCGGTAGAGGAGATTGGAATTCCTGG
                >M00365:7:000000000-A48JK:1:1101:10001:16084
                TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGATTGCGGAATTCCAGGT
                refomart.sh gives me:
                Code:
                [zillur@genomics final]$ ./bbmap/reformat.sh in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                java -ea -Xmx111g -cp /home/zillur/Desktop/zillur/yadira/study_1799_split_library_seqs_and_mapping/filter/final/bbmap/current/ jgi.ReformatReads in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                Executing jgi.ReformatReads [in=new_15.fasta, qfin=qual_.1.qual, out=f_nw_15_ql_.1.fq]
                
                Input is being processed as unpaired
                Exception in thread "Thread-1" java.lang.AssertionError: Quality and Base headers differ for read 0
                	at stream.FastaQualReadInputStream.toReadList(FastaQualReadInputStream.java:128)
                	at stream.FastaQualReadInputStream.toReads(FastaQualReadInputStream.java:110)
                	at stream.FastaQualReadInputStream.fillBuffer(FastaQualReadInputStream.java:94)
                	at stream.FastaQualReadInputStream.hasMore(FastaQualReadInputStream.java:54)
                	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
                	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                What should I do now?

                Best Regards
                Zillur

                Comment


                • #9
                  When you sort the fasta file, did you also sort the qual file?

                  Originally posted by zillur View Post
                  In my qual file I have many other sequences including my fasta.
                  What do you mean by having other sequences in your qual file?

                  Comment


                  • #10
                    If you have BioPerl ≥ 1.6.922 and Sort::Naturally, then

                    https://github.com/douglasgscofield/...ipts/fastaSort

                    shows how to sort on sequence name, using natural sort as it seems you require.

                    Comment


                    • #11
                      Originally posted by zillur View Post
                      Thank your very much. I have tried this: But it doesn't resolve all:
                      Code:
                      (qiime191) [zillur@genomics final]$ head new_sorted_1.fasta 
                      >M00365:7:000000000-A48JK:1:1101:10000:14343
                      TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTGCGTAGGCGGATTATTAAGTTAGGGGTGAAATCCCGAGGCTCAACCTCGGAACTGCCCTTAAAACTGTTGGTCTTGAGTTCTGGAGAGGTGAGTGGAATTGCTAGT
                      >M00365:7:000000000-A48JK:1:1101:10000:18084
                      TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGAATGCGGAATTCCAGGT
                      >M00365:7:000000000-A48JK:1:1101:10000:25105
                      TACGAAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTTCGTAGGCGGGTTATTAAGTCAGATGTGAAATCCCAGGGCTCAACCTTGGAACTGCATTTGAAACTGGTAACCTAGAGACTAGGAGAGGTCAGTGGAATACCGAGT
                      >M00365:7:000000000-A48JK:1:1101:10000:5055
                      CACGTAGGGGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCTGTTCAGTAAGTCAGGTGTGAAAATCCAAGGCTCAACCTTGGGACGCCACCTGATACCGCTGTGACTAGAGTCCGGTAGAGGAGATTGGAATTCCTGG
                      >M00365:7:000000000-A48JK:1:1101:10001:16084
                      TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGCTAGGTCAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGATACTGCCTAGCTAGAGTATGTTAGAGGATTGCGGAATTCCAGGT
                      refomart.sh gives me:
                      Code:
                      [zillur@genomics final]$ ./bbmap/reformat.sh in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                      java -ea -Xmx111g -cp /home/zillur/Desktop/zillur/yadira/study_1799_split_library_seqs_and_mapping/filter/final/bbmap/current/ jgi.ReformatReads in=new_15.fasta qfin=qual_.1.qual out=f_nw_15_ql_.1.fq
                      Executing jgi.ReformatReads [in=new_15.fasta, qfin=qual_.1.qual, out=f_nw_15_ql_.1.fq]
                      
                      Input is being processed as unpaired
                      Exception in thread "Thread-1" java.lang.AssertionError: Quality and Base headers differ for read 0
                      	at stream.FastaQualReadInputStream.toReadList(FastaQualReadInputStream.java:128)
                      	at stream.FastaQualReadInputStream.toReads(FastaQualReadInputStream.java:110)
                      	at stream.FastaQualReadInputStream.fillBuffer(FastaQualReadInputStream.java:94)
                      	at stream.FastaQualReadInputStream.hasMore(FastaQualReadInputStream.java:54)
                      	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:643)
                      	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
                      What should I do now?

                      Best Regards
                      Zillur
                      The sort example has sorted your data alphabetically. If you try to sort your qual file, I think you will get the same order of headers.
                      Persistent LABS

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X