Thursday, October 1, 2020

Repeated DNA Sequences

Problem: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a method to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Example (taken from leetcode):

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]


Approach: Sliding window with hashing. Simply take each possible 10 character substring and put it in the hash. If a substring coming for the second time, just add it into result.


Implementation in C#:

        public IList<string> FindRepeatedDnaSequences(string s)

        {

            List<string> result = new List<string>();

            Dictionary<string, int> hash = new Dictionary<string, int>();

            for (int i = 0; i + 10 <= s.Length; ++i)

            {

                string key = s.Substring(i, 10);

                hash[key] =  hash.ContainsKey(key) ? hash[key] + 1 : 1;

                if (hash[key] == 2)

                {

                    result.Add(key);

                }

            }

            return result;

        } 


Complexity: O(10*n) => O(n)

No comments:

Post a Comment