Problem: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a method to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example (taken from leetcode):
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Approach: Sliding window with hashing. Simply take each possible 10 character substring and put it in the hash. If a substring coming for the second time, just add it into result.
Implementation in C#:
public IList<string> FindRepeatedDnaSequences(string s)
{
List<string> result = new List<string>();
Dictionary<string, int> hash = new Dictionary<string, int>();
for (int i = 0; i + 10 <= s.Length; ++i)
{
string key = s.Substring(i, 10);
hash[key] = hash.ContainsKey(key) ? hash[key] + 1 : 1;
if (hash[key] == 2)
{
result.Add(key);
}
}
return result;
}
Complexity: O(10*n) => O(n)
No comments:
Post a Comment