class Solution:
def findDuplicate(self, paths):
from collections import defaultdict
content_map = defaultdict(list)
for path in paths:
tokens = path.split()
root = tokens[0]
for file in tokens[1:]:
name, content = file.split('(')
content = content[:-1] # remove trailing ')'
full_path = root + '/' + name
content_map[content].append(full_path)
return [group for group in content_map.values() if len(group) > 1]
class Solution {
public:
vector<vector<string>> findDuplicate(vector<string>& paths) {
unordered_map<string, vector<string>> content_map;
for (auto &path : paths) {
istringstream iss(path);
string root;
iss >> root;
string file;
while (iss >> file) {
int l = file.find('(');
int r = file.find(')');
string name = file.substr(0, l);
string content = file.substr(l + 1, r - l - 1);
string full_path = root + "/" + name;
content_map[content].push_back(full_path);
}
}
vector<vector<string>> res;
for (auto &p : content_map) {
if (p.second.size() > 1)
res.push_back(p.second);
}
return res;
}
};
class Solution {
public List<List<String>> findDuplicate(String[] paths) {
Map<String, List<String>> contentMap = new HashMap<>();
for (String path : paths) {
String[] tokens = path.split(" ");
String root = tokens[0];
for (int i = 1; i < tokens.length; ++i) {
String file = tokens[i];
int l = file.indexOf('(');
int r = file.indexOf(')');
String name = file.substring(0, l);
String content = file.substring(l + 1, r);
String fullPath = root + "/" + name;
contentMap.computeIfAbsent(content, k -> new ArrayList<>()).add(fullPath);
}
}
List<List<String>> res = new ArrayList<>();
for (List<String> group : contentMap.values()) {
if (group.size() > 1)
res.add(group);
}
return res;
}
}
var findDuplicate = function(paths) {
const contentMap = {};
for (let path of paths) {
let tokens = path.split(' ');
let root = tokens[0];
for (let i = 1; i < tokens.length; ++i) {
let file = tokens[i];
let l = file.indexOf('(');
let r = file.indexOf(')');
let name = file.substring(0, l);
let content = file.substring(l + 1, r);
let fullPath = root + '/' + name;
if (!contentMap[content]) contentMap[content] = [];
contentMap[content].push(fullPath);
}
}
return Object.values(contentMap).filter(group => group.length > 1);
};
Given a list of directory information in the form of strings, where each string represents a directory and its files in the format "root/d1/d2/.../dn file1.txt(content1) file2.txt(content2) ..."
, you are to find all groups of duplicate files in the file system. Files are considered duplicates if they have exactly the same content (not just the same name).
Your task is to return a list of groups, where each group contains the file paths of files that share the same content. Each group must have at least two files. The file path should be formed by combining the directory path and the file name, separated by a slash (/
).
Constraints:
The problem revolves around detecting duplicate file contents across a simulated file system. At first glance, a brute-force approach would be to compare every file with every other file, but this quickly becomes inefficient as the number of files grows. Instead, we should look for a way to group files by their content efficiently.
The key insight is that the file content (the string inside parentheses) serves as a unique identifier for the file's data. If we map this content to the list of file paths that contain it, we can easily identify duplicates: any content with more than one associated file path indicates duplicate files.
This leads us to consider using a hash map (dictionary) where the keys are file contents and the values are lists of file paths. By processing each input string, extracting file names and their content, and populating this map, we can efficiently group duplicate files.
Let's break down the solution step-by-step:
This approach leverages the efficiency of hash maps for grouping and lookups, ensuring that we only need to process each file once.
Let's consider the following input:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
"root/a 1.txt(abcd) 2.txt(efgh)"
:
root/a
1.txt(abcd)
, 2.txt(efgh)
abcd
→ [root/a/1.txt
]efgh
→ [root/a/2.txt
]"root/c 3.txt(abcd)"
:
root/c
3.txt(abcd)
abcd
→ [root/a/1.txt
, root/c/3.txt
]"root/c/d 4.txt(efgh)"
:
root/c/d
4.txt(efgh)
efgh
→ [root/a/2.txt
, root/c/d/4.txt
]"root 4.txt(efgh)"
:
root
4.txt(efgh)
efgh
→ [root/a/2.txt
, root/c/d/4.txt
, root/4.txt
]After processing, the map is:
abcd
→ [root/a/1.txt
, root/c/3.txt
]efgh
→ [root/a/2.txt
, root/c/d/4.txt
, root/4.txt
]root/a/1.txt
, root/c/3.txt
]root/a/2.txt
, root/c/d/4.txt
, root/4.txt
]Brute-force approach:
O(N^2 * L)
time, where N
is the number of files and L
is the average length of file content.O(N * L)
, where N
is the number of files and L
is the average content length. Each file is processed once, and string operations are O(L)
.O(N * L)
, since we store each file's content and path in the map.The optimized approach is significantly faster and more memory-efficient than brute-force, especially as the number of files increases.
The problem is elegantly solved by leveraging a hash map to group file paths by their content. By processing each file only once and grouping by content, we efficiently identify all groups of duplicate files. This approach avoids unnecessary comparisons and scales well with input size. The key insight is to treat file content as a unique identifier, making grouping and lookup operations fast and straightforward.