Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tool to find how many counts for each sequence

    Hi all,

    It is my first time to do gene sequencing. I am sequencing short PCR products and want to get the counts for each sequence (I know the sequences in the sample for PCR). Is there anyone knew any tool for this?

    Thank you for the help.

    Zheng

  • #2
    How long are the sequences, and do you expect 100% identity, or what? Do you already know the sequences you are expecting?

    Comment


    • #3
      30 to 90 bp.

      Yes, I expect 100% identity and know the sequences.

      Comment


      • #4
        In that case, just map the reads to a reference fasta file you make of the sequences. For example, with BBMap, you can output a file containing the counts of reads that mapped to each reference sequence:

        bbmap.sh -Xmx2g ref=sequences.fasta in=reads.fastq scafstats=stats.txt

        What kind of sequencing are you doing? You may need to adapter-trim the input reads first, for example, if you have a fixed read length with NGS reads.

        Comment


        • #5
          Thanks a lot. I use MiSeq.

          Comment


          • #6
            In that case, you will need to either trim the adapters first, or use a different approach such as matching substrings, if the reference sequences are sufficiently different.

            You can trim adapters like this:

            bbduk.sh -Xmx1g in=reads.fq out=clean.fq ktrim=r ref=nextera.fa.gz,truseq.fa.gz k=25 mink=12 hdist=1

            ...where those nextera and truseq adapter files are included with the BBTools package (which contains BBMap and BBDuk) in the /resources/ directory.

            Alternately, you can generate a report directly from BBDuk like this:

            bbduk.sh -Xmx1g in=reads.fastq ref=sequences.fasta stats=stats.txt k=29 fbm

            ...if the reference sequences are sufficiently different that they do not share any 29-mers.

            Comment


            • #7
              If I add a few samples that include inserts with 30 bp unknown sequences, is it possible for me to count reads for 50 sequences that appear most frequently?

              Comment


              • #8
                The above tools require known sequences. If some are unknown, you could assemble the adapter-trimmed reads (with, say, Velvet) which should result in a fasta file containing one copy of each sequence present in your library, then map to that assembly to get the counts, as in post 4.

                Comment


                • #9
                  Brian, it is very helpful. Thanks again.

                  Comment


                  • #10
                    Oh, I should probably add, this may be difficult to assemble since the coverage is likely to be extremely high. You may need to normalize first. Also, what length are the reads, and are they paired or single?

                    Comment


                    • #11
                      The reads are from both sides and 30 for each side.

                      Comment


                      • #12
                        In that case, there's an easier solution, if all of the reads of the same product are expected to start at the same location.

                        kmercountexact.sh in=reads.fq outk=counts.txt mincount=4 k=30

                        That will produce a file "counts.txt" containing the count of every 30bp string in the file, excluding strings present fewer than 4 times.

                        Comment


                        • #13
                          It works good

                          until

                          zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          java -ea -Xmx11628m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16]

                          ways=37
                          Initial:
                          Memory: free=187m, used=64m

                          Final:
                          Memory: free=125m, used=193m

                          Input: 396838 reads 11111464 bases.
                          Result: 0 reads (0.00%) 0 bases (0.00%)
                          Unique Kmers: 4345841
                          Exception in thread "main" java.lang.AssertionError: ACCTTTTACCTAGTTT 3

                          at kmer.KmerNode.dumpKmersAsText(KmerNode.java:155)
                          at kmer.HashForest.dumpKmersAsText(HashForest.java:211)
                          at kmer.HashArray.dumpKmersAsText(HashArray.java:249)
                          at jgi.CountKmersExact.dumpKmersAsText(CountKmersExact.java:749)
                          at jgi.CountKmersExact.process(CountKmersExact.java:383)
                          at jgi.CountKmersExact.main(CountKmersExact.java:58)

                          Here I tried k=25, 20, 16. It works for k=25 or 20,

                          Some sequences at both ends may show "N". Do this command search short sequence from different start positions? Or always start from a fixed position? thanks,

                          Zheng

                          Comment


                          • #14
                            Oh, sorry about that; I left a debug assertion in. I will remove that from the next version. Until then, add the "-da" flag to your command and it will work correctly.

                            Any subsequence containing an 'N' will be ignored. Otherwise, it looks at all valid subsequences. So, for example:

                            NNTTANGNNCT

                            for K=2 would produce:
                            TT 1
                            TA 1
                            CT 1

                            but nothing with the G that is surrounded by Ns. Thus, the reason I suggest K=30 for length-30 reads is that you will produce exactly one substring per read, while K=20 would produce 11 substrings per read. But if a 30bp read contains even a single N, it will yield no substrings.

                            Comment


                            • #15
                              Thanks.

                              Seems still have a problem after with flag -da.

                              zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              java -da -Xmx11773m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16, -da]

                              ways=37
                              Exception in thread "main" java.lang.RuntimeException: Output file ERa-05RNS160m_S6_L001_R2_001_counts16.txt already exists, and overwrite=false
                              at jgi.CountKmersExact.<init>(CountKmersExact.java:335)
                              at jgi.CountKmersExact.main(CountKmersExact.java:55)

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X