The "Statistics from a Large Sample" problem asks you to compute several statistical values from a large dataset. You are given an array count
of length 256, where count[i]
represents the number of times the integer i
appears in the sample. The possible values in the sample are all integers from 0 to 255.
Your task is to compute and return an array containing the following statistics, in order:
Constraints:
count.length == 256
0 <= count[i] <= 10^9
count[i]
is at least 1 (there is at least one value in the sample)
At first glance, you might think to reconstruct the full sample array and then use standard functions to compute statistics. However, since the sample could be extremely large (with counts up to 10^9
per value), this is not feasible in terms of time or memory.
Instead, we realize that all the necessary information is encoded in the count
array. We can process the count
array directly in a single pass to compute the minimum, maximum, total sum, total count, and mode. The only slightly tricky part is computing the median, because it depends on the total number of elements and their distribution.
We shift from a brute-force approach (building the full sample) to an optimized approach (processing the histogram/counts directly).
Let's break down the steps to compute each statistic efficiently:
i
where count[i] > 0
; that's the minimum.i
where count[i] > 0
; that's the maximum.i
, compute count[i] * i
and sum these values to get the total sum.count[i]
to get the total number of elements.i
with the highest count[i]
as you iterate.(total_count // 2) + 1
.
(total_count // 2)
and (total_count // 2) + 1
.
count
array, keeping a running total, and record the value(s) where the cumulative count reaches the needed positions.
This approach only requires a few passes through the count
array and uses constant extra space.
Let's consider a small example:
count = [0, 2, 3, 0, 1, 0, ..., 0] # Only indices 1, 2, and 4 are non-zero # Sample: [1, 1, 2, 2, 2, 4]
Brute-force approach:
count
array a constant number of times. 256 is a constant, so this is effectively O(1).The optimized approach is extremely efficient and suitable for very large samples.
By leveraging the histogram representation in the count
array, we can compute all required statistics with a few simple passes, without ever reconstructing the large sample. This approach is both time and space efficient, and illustrates the power of thinking in terms of counts and positions rather than individual elements. The most subtle part is finding the median, which requires careful tracking of cumulative counts, but all other statistics follow directly from the counts.
class Solution:
def sampleStats(self, count):
# Find min and max
min_val = None
max_val = None
total_count = 0
total_sum = 0
mode = 0
max_freq = 0
for i in range(256):
if count[i]:
if min_val is None:
min_val = i
max_val = i
if count[i] > max_freq:
max_freq = count[i]
mode = i
total_count += count[i]
total_sum += count[i] * i
# Find median
median = 0.0
left = (total_count + 1) // 2
right = (total_count // 2) + 1
cnt = 0
m1 = None
m2 = None
for i in range(256):
cnt += count[i]
if m1 is None and cnt >= left:
m1 = i
if m2 is None and cnt >= right:
m2 = i
break
median = (m1 + m2) / 2
mean = total_sum / total_count
return [float(min_val), float(max_val), mean, median, float(mode)]
class Solution {
public:
vector<double> sampleStats(vector<int>& count) {
int minVal = -1, maxVal = -1, mode = 0;
long long total = 0, sum = 0;
int maxFreq = 0;
for (int i = 0; i < 256; ++i) {
if (count[i]) {
if (minVal == -1) minVal = i;
maxVal = i;
if (count[i] > maxFreq) {
maxFreq = count[i];
mode = i;
}
}
total += count[i];
sum += (long long)count[i] * i;
}
// Median
double median = 0.0;
int left = (total + 1) / 2, right = (total / 2) + 1;
int cnt = 0;
int m1 = -1, m2 = -1;
for (int i = 0; i < 256; ++i) {
cnt += count[i];
if (m1 == -1 && cnt >= left) m1 = i;
if (m2 == -1 && cnt >= right) {
m2 = i;
break;
}
}
median = (m1 + m2) / 2.0;
double mean = (double)sum / total;
return { (double)minVal, (double)maxVal, mean, median, (double)mode };
}
};
class Solution {
public double[] sampleStats(int[] count) {
int min = -1, max = -1, mode = 0;
long total = 0, sum = 0;
int maxFreq = 0;
for (int i = 0; i < 256; i++) {
if (count[i] > 0) {
if (min == -1) min = i;
max = i;
if (count[i] > maxFreq) {
maxFreq = count[i];
mode = i;
}
}
total += count[i];
sum += (long)count[i] * i;
}
// Median
double median = 0.0;
int left = (int)((total + 1) / 2);
int right = (int)(total / 2 + 1);
int cnt = 0, m1 = -1, m2 = -1;
for (int i = 0; i < 256; i++) {
cnt += count[i];
if (m1 == -1 && cnt >= left) m1 = i;
if (m2 == -1 && cnt >= right) {
m2 = i;
break;
}
}
median = (m1 + m2) / 2.0;
double mean = (double)sum / total;
return new double[]{(double)min, (double)max, mean, median, (double)mode};
}
}
var sampleStats = function(count) {
let min = null, max = null, mode = 0, maxFreq = 0;
let total = 0, sum = 0;
for (let i = 0; i < 256; i++) {
if (count[i]) {
if (min === null) min = i;
max = i;
if (count[i] > maxFreq) {
maxFreq = count[i];
mode = i;
}
}
total += count[i];
sum += count[i] * i;
}
// Median
let left = Math.floor((total + 1) / 2);
let right = Math.floor(total / 2) + 1;
let cnt = 0, m1 = null, m2 = null;
for (let i = 0; i < 256; i++) {
cnt += count[i];
if (m1 === null && cnt >= left) m1 = i;
if (m2 === null && cnt >= right) {
m2 = i;
break;
}
}
let median = (m1 + m2) / 2;
let mean = sum / total;
return [min * 1.0, max * 1.0, mean, median, mode * 1.0];
};