The Repeated DNA Sequences problem asks you to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. The DNA sequence is provided as a string s
consisting of only the characters 'A', 'C', 'G', and 'T'. Your task is to return a list of all such repeated sequences, without duplicates.
At first glance, you might think to check every possible 10-letter substring and count how many times it appears. However, this approach would be very slow for large strings, since there are many overlapping substrings and checking each one against all others would be inefficient.
Instead, we want a way to efficiently track which 10-letter substrings we've seen before, and quickly check if we've seen them more than once. This is a classic use case for a hash set: we can store each substring as we see it, and if we encounter it again, we know it's a repeat.
By using a hash set, we avoid the need to compare every substring with every other substring, and instead check for repeats in constant time.
The solution can be broken down into the following steps:
n
, there are n - 9
possible 10-letter substrings.Why use sets? Sets allow for constant-time lookups and insertions, making this approach efficient even for large input strings.
Let's use the sample input s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
.
"AAAAACCCCC"
. Add it to the seen set.
"AAAACCCCCA"
. Add it to the seen set.
"AAAAACCCCC"
again. This one is already in the seen set, so add it to the result set.
"AAAAACCCCC"
and "CCCCCAAAAA"
each appear more than once.
["AAAAACCCCC", "CCCCCAAAAA"]
.
At each step, we only need to check if the substring is in the seen set, and if so, whether we've already added it to the result.
Brute-force approach:
The optimized approach is much more efficient and suitable for large inputs.
class Solution:
def findRepeatedDnaSequences(self, s: str) -> list[str]:
seen = set()
repeated = set()
for i in range(len(s) - 9):
seq = s[i:i+10]
if seq in seen:
repeated.add(seq)
else:
seen.add(seq)
return list(repeated)
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> seen, repeated;
for (int i = 0; i + 9 < s.length(); ++i) {
string seq = s.substr(i, 10);
if (seen.count(seq))
repeated.insert(seq);
else
seen.insert(seq);
}
return vector<string>(repeated.begin(), repeated.end());
}
};
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> seen = new HashSet<>();
Set<String> repeated = new HashSet<>();
for (int i = 0; i <= s.length() - 10; i++) {
String seq = s.substring(i, i + 10);
if (seen.contains(seq)) {
repeated.add(seq);
} else {
seen.add(seq);
}
}
return new ArrayList<>(repeated);
}
}
var findRepeatedDnaSequences = function(s) {
const seen = new Set();
const repeated = new Set();
for (let i = 0; i <= s.length - 10; i++) {
const seq = s.substring(i, i + 10);
if (seen.has(seq)) {
repeated.add(seq);
} else {
seen.add(seq);
}
}
return Array.from(repeated);
};
The key to solving the Repeated DNA Sequences problem efficiently is to use hash sets to track seen and repeated substrings. By iterating through the string only once and leveraging the constant-time lookups provided by sets, we can find all repeated 10-letter sequences quickly and without unnecessary duplication. This approach is both simple and powerful, demonstrating how the right data structures can turn a brute-force problem into an elegant solution.