187. Repeated DNA Sequences - Leetcode Solution

Problem Description

The Repeated DNA Sequences problem asks you to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. The DNA sequence is provided as a string s consisting of only the characters 'A', 'C', 'G', and 'T'. Your task is to return a list of all such repeated sequences, without duplicates.

Each substring must be exactly 10 characters long.
Return each repeated sequence only once, even if it appears more than twice.
The input string may be very long (up to 10⁵ characters), so efficiency is important.

Thought Process

At first glance, you might think to check every possible 10-letter substring and count how many times it appears. However, this approach would be very slow for large strings, since there are many overlapping substrings and checking each one against all others would be inefficient.

Instead, we want a way to efficiently track which 10-letter substrings we've seen before, and quickly check if we've seen them more than once. This is a classic use case for a hash set: we can store each substring as we see it, and if we encounter it again, we know it's a repeat.

By using a hash set, we avoid the need to compare every substring with every other substring, and instead check for repeats in constant time.

Solution Approach

The solution can be broken down into the following steps:

Iterate over all possible 10-letter substrings:
- For a string of length n, there are n - 9 possible 10-letter substrings.
Track seen substrings:
- Use a hash set to store each 10-letter substring as you see it for the first time.
Track repeated substrings:
- Use another hash set (or list) to store substrings that have already been seen and repeated, so you only add them to the result once.
Build the result:
- If a substring has been seen before, and not already added to the result, add it to the result set.
Return the result:
- Convert the result set to a list (if needed) and return it.

Why use sets? Sets allow for constant-time lookups and insertions, making this approach efficient even for large input strings.

Example Walkthrough

Let's use the sample input s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT".

Start at index 0. The first 10-letter substring is "AAAAACCCCC". Add it to the seen set.
Move to index 1. The next substring is "AAAACCCCCA". Add it to the seen set.
Continue this process, adding each new substring to the seen set.
When you reach index 10, the substring is "AAAAACCCCC" again. This one is already in the seen set, so add it to the result set.
Continue until the end of the string. You will find that "AAAAACCCCC" and "CCCCCAAAAA" each appear more than once.
The final result is ["AAAAACCCCC", "CCCCCAAAAA"].

At each step, we only need to check if the substring is in the seen set, and if so, whether we've already added it to the result.

Time and Space Complexity

Brute-force approach:

Time: O(n^2), since for each substring, you might have to compare it to all others.
Space: O(n), storing all substrings.

Optimized approach (using hash sets):

Time: O(n), where n is the length of the string. We only iterate through the string once, and set operations are O(1).
Space: O(n), since in the worst case, all substrings are unique and stored in the set.

The optimized approach is much more efficient and suitable for large inputs.

Code Implementation

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> list[str]:
        seen = set()
        repeated = set()
        for i in range(len(s) - 9):
            seq = s[i:i+10]
            if seq in seen:
                repeated.add(seq)
            else:
                seen.add(seq)
        return list(repeated)

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        unordered_set<string> seen, repeated;
        for (int i = 0; i + 9 < s.length(); ++i) {
            string seq = s.substr(i, 10);
            if (seen.count(seq))
                repeated.insert(seq);
            else
                seen.insert(seq);
        }
        return vector<string>(repeated.begin(), repeated.end());
    }
};

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Set<String> seen = new HashSet<>();
        Set<String> repeated = new HashSet<>();
        for (int i = 0; i <= s.length() - 10; i++) {
            String seq = s.substring(i, i + 10);
            if (seen.contains(seq)) {
                repeated.add(seq);
            } else {
                seen.add(seq);
            }
        }
        return new ArrayList<>(repeated);
    }
}

var findRepeatedDnaSequences = function(s) {
    const seen = new Set();
    const repeated = new Set();
    for (let i = 0; i <= s.length - 10; i++) {
        const seq = s.substring(i, i + 10);
        if (seen.has(seq)) {
            repeated.add(seq);
        } else {
            seen.add(seq);
        }
    }
    return Array.from(repeated);
};

Summary

The key to solving the Repeated DNA Sequences problem efficiently is to use hash sets to track seen and repeated substrings. By iterating through the string only once and leveraging the constant-time lookups provided by sets, we can find all repeated 10-letter sequences quickly and without unnecessary duplication. This approach is both simple and powerful, demonstrating how the right data structures can turn a brute-force problem into an elegant solution.

AlgoMap.io

Want Help Cracking FAANG?