Want Help Cracking FAANG?

(Then click this)

×
Back to Question Bank

609. Find Duplicate File in System - Leetcode Solution

Code Implementation

class Solution:
    def findDuplicate(self, paths):
        from collections import defaultdict
        content_map = defaultdict(list)
        for path in paths:
            tokens = path.split()
            root = tokens[0]
            for file in tokens[1:]:
                name, content = file.split('(')
                content = content[:-1]  # remove trailing ')'
                full_path = root + '/' + name
                content_map[content].append(full_path)
        return [group for group in content_map.values() if len(group) > 1]
      
class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, vector<string>> content_map;
        for (auto &path : paths) {
            istringstream iss(path);
            string root;
            iss >> root;
            string file;
            while (iss >> file) {
                int l = file.find('(');
                int r = file.find(')');
                string name = file.substr(0, l);
                string content = file.substr(l + 1, r - l - 1);
                string full_path = root + "/" + name;
                content_map[content].push_back(full_path);
            }
        }
        vector<vector<string>> res;
        for (auto &p : content_map) {
            if (p.second.size() > 1)
                res.push_back(p.second);
        }
        return res;
    }
};
      
class Solution {
    public List<List<String>> findDuplicate(String[] paths) {
        Map<String, List<String>> contentMap = new HashMap<>();
        for (String path : paths) {
            String[] tokens = path.split(" ");
            String root = tokens[0];
            for (int i = 1; i < tokens.length; ++i) {
                String file = tokens[i];
                int l = file.indexOf('(');
                int r = file.indexOf(')');
                String name = file.substring(0, l);
                String content = file.substring(l + 1, r);
                String fullPath = root + "/" + name;
                contentMap.computeIfAbsent(content, k -> new ArrayList<>()).add(fullPath);
            }
        }
        List<List<String>> res = new ArrayList<>();
        for (List<String> group : contentMap.values()) {
            if (group.size() > 1)
                res.add(group);
        }
        return res;
    }
}
      
var findDuplicate = function(paths) {
    const contentMap = {};
    for (let path of paths) {
        let tokens = path.split(' ');
        let root = tokens[0];
        for (let i = 1; i < tokens.length; ++i) {
            let file = tokens[i];
            let l = file.indexOf('(');
            let r = file.indexOf(')');
            let name = file.substring(0, l);
            let content = file.substring(l + 1, r);
            let fullPath = root + '/' + name;
            if (!contentMap[content]) contentMap[content] = [];
            contentMap[content].push(fullPath);
        }
    }
    return Object.values(contentMap).filter(group => group.length > 1);
};
      

Problem Description

Given a list of directory information in the form of strings, where each string represents a directory and its files in the format "root/d1/d2/.../dn file1.txt(content1) file2.txt(content2) ...", you are to find all groups of duplicate files in the file system. Files are considered duplicates if they have exactly the same content (not just the same name).

Your task is to return a list of groups, where each group contains the file paths of files that share the same content. Each group must have at least two files. The file path should be formed by combining the directory path and the file name, separated by a slash (/).

Constraints:

  • Each input string contains one directory followed by one or more files with content in parentheses.
  • All file names are unique within a directory but may appear in different directories.
  • Only duplicate files (with identical content) should be grouped; unique files are ignored.
  • Do not reuse file elements in multiple groups.

Thought Process

The problem revolves around detecting duplicate file contents across a simulated file system. At first glance, a brute-force approach would be to compare every file with every other file, but this quickly becomes inefficient as the number of files grows. Instead, we should look for a way to group files by their content efficiently.

The key insight is that the file content (the string inside parentheses) serves as a unique identifier for the file's data. If we map this content to the list of file paths that contain it, we can easily identify duplicates: any content with more than one associated file path indicates duplicate files.

This leads us to consider using a hash map (dictionary) where the keys are file contents and the values are lists of file paths. By processing each input string, extracting file names and their content, and populating this map, we can efficiently group duplicate files.

Solution Approach

Let's break down the solution step-by-step:

  1. Initialize a map: Create a hash map (dictionary) to associate file contents with a list of file paths.
  2. Iterate through each directory string: For each string in the input, split it by spaces. The first token is the directory path, and the remaining tokens are files.
  3. Process each file: For every file token, separate the file name and its content. The content is always between the first '(' and the closing ')'. Build the full file path by joining the directory path and the file name.
  4. Populate the map: Add the full file path to the list corresponding to its content in the map.
  5. Collect results: After processing all input, iterate over the map and collect only those lists that have more than one file path, indicating duplicates.

This approach leverages the efficiency of hash maps for grouping and lookups, ensuring that we only need to process each file once.

Example Walkthrough

Let's consider the following input:

["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]

  1. The first string is "root/a 1.txt(abcd) 2.txt(efgh)":
    • Directory: root/a
    • Files: 1.txt(abcd), 2.txt(efgh)
    • Map updates:
      • abcd → [root/a/1.txt]
      • efgh → [root/a/2.txt]
  2. The second string is "root/c 3.txt(abcd)":
    • Directory: root/c
    • File: 3.txt(abcd)
    • Map update:
      • abcd → [root/a/1.txt, root/c/3.txt]
  3. The third string is "root/c/d 4.txt(efgh)":
    • Directory: root/c/d
    • File: 4.txt(efgh)
    • Map update:
      • efgh → [root/a/2.txt, root/c/d/4.txt]
  4. The fourth string is "root 4.txt(efgh)":
    • Directory: root
    • File: 4.txt(efgh)
    • Map update:
      • efgh → [root/a/2.txt, root/c/d/4.txt, root/4.txt]

After processing, the map is:

  • abcd → [root/a/1.txt, root/c/3.txt]
  • efgh → [root/a/2.txt, root/c/d/4.txt, root/4.txt]
Both groups have more than one file, so they are returned as the result:
  • [root/a/1.txt, root/c/3.txt]
  • [root/a/2.txt, root/c/d/4.txt, root/4.txt]

Time and Space Complexity

Brute-force approach:

  • Comparing every file with every other file would result in O(N^2 * L) time, where N is the number of files and L is the average length of file content.
  • This is inefficient for large inputs.
Optimized approach (hash map):
  • Time complexity: O(N * L), where N is the number of files and L is the average content length. Each file is processed once, and string operations are O(L).
  • Space complexity: O(N * L), since we store each file's content and path in the map.

The optimized approach is significantly faster and more memory-efficient than brute-force, especially as the number of files increases.

Summary

The problem is elegantly solved by leveraging a hash map to group file paths by their content. By processing each file only once and grouping by content, we efficiently identify all groups of duplicate files. This approach avoids unnecessary comparisons and scales well with input size. The key insight is to treat file content as a unique identifier, making grouping and lookup operations fast and straightforward.