Want Help Cracking FAANG?

(Then click this)

×
Back to Question Bank

1093. Statistics from a Large Sample - Leetcode Solution

Problem Description

The "Statistics from a Large Sample" problem asks you to compute several statistical values from a large dataset. You are given an array count of length 256, where count[i] represents the number of times the integer i appears in the sample. The possible values in the sample are all integers from 0 to 255.

Your task is to compute and return an array containing the following statistics, in order:

  • The minimum value in the sample
  • The maximum value in the sample
  • The mean (average) value of the sample
  • The median value of the sample
  • The mode (the value that appears most frequently) of the sample
Each value should be returned as a floating-point number, except for the minimum, maximum, and mode, which are integers.

Constraints:

  • count.length == 256
  • 0 <= count[i] <= 10^9
  • The sum of all count[i] is at least 1 (there is at least one value in the sample)
You must process the input efficiently, as the sample can be very large.

Thought Process

At first glance, you might think to reconstruct the full sample array and then use standard functions to compute statistics. However, since the sample could be extremely large (with counts up to 10^9 per value), this is not feasible in terms of time or memory.

Instead, we realize that all the necessary information is encoded in the count array. We can process the count array directly in a single pass to compute the minimum, maximum, total sum, total count, and mode. The only slightly tricky part is computing the median, because it depends on the total number of elements and their distribution.

We shift from a brute-force approach (building the full sample) to an optimized approach (processing the histogram/counts directly).

Solution Approach

Let's break down the steps to compute each statistic efficiently:

  1. Minimum and Maximum:
    • Scan from left to right to find the first index i where count[i] > 0; that's the minimum.
    • Scan from right to left to find the first index i where count[i] > 0; that's the maximum.
  2. Mean (Average):
    • For each i, compute count[i] * i and sum these values to get the total sum.
    • Sum all count[i] to get the total number of elements.
    • Divide total sum by total count for the mean.
  3. Mode:
    • Keep track of the index i with the highest count[i] as you iterate.
  4. Median:
    • The median is the middle value (or average of two middle values) in the sorted sample.
    • Since we only have counts, we can simulate walking through the sorted sample by accumulating counts until we reach the middle element(s).
    • If the total number of elements is odd, the median is the value at position (total_count // 2) + 1.
    • If even, the median is the average of the values at positions (total_count // 2) and (total_count // 2) + 1.
    • We walk through the count array, keeping a running total, and record the value(s) where the cumulative count reaches the needed positions.

This approach only requires a few passes through the count array and uses constant extra space.

Example Walkthrough

Let's consider a small example:

    count = [0, 2, 3, 0, 1, 0, ..., 0]
    # Only indices 1, 2, and 4 are non-zero
    # Sample: [1, 1, 2, 2, 2, 4]
  
  • Minimum: The first non-zero index is 1.
  • Maximum: The last non-zero index is 4.
  • Mean: (1*2 + 2*3 + 4*1) / 6 = (2 + 6 + 4) / 6 = 12 / 6 = 2.0
  • Mode: Index 2 has count 3, which is the highest. Mode is 2.
  • Median:
    • Total count is 6 (even).
    • Median is the average of the 3rd and 4th elements.
    • Walking through the counts:
      • At index 1: count = 2 (covers positions 1,2)
      • At index 2: count = 3 (covers positions 3,4,5)
    • So, both the 3rd and 4th elements are 2. Median is (2+2)/2 = 2.0

Time and Space Complexity

Brute-force approach:

  • Time: O(N), where N is the total number of elements in the sample. This is not practical for large N.
  • Space: O(N), since we'd have to reconstruct the entire sample array.
Optimized approach (using counts):
  • Time: O(256), since we scan the count array a constant number of times. 256 is a constant, so this is effectively O(1).
  • Space: O(1), only a few variables for sums and indices.

The optimized approach is extremely efficient and suitable for very large samples.

Summary

By leveraging the histogram representation in the count array, we can compute all required statistics with a few simple passes, without ever reconstructing the large sample. This approach is both time and space efficient, and illustrates the power of thinking in terms of counts and positions rather than individual elements. The most subtle part is finding the median, which requires careful tracking of cumulative counts, but all other statistics follow directly from the counts.

Code Implementation

class Solution:
    def sampleStats(self, count):
        # Find min and max
        min_val = None
        max_val = None
        total_count = 0
        total_sum = 0
        mode = 0
        max_freq = 0

        for i in range(256):
            if count[i]:
                if min_val is None:
                    min_val = i
                max_val = i
                if count[i] > max_freq:
                    max_freq = count[i]
                    mode = i
            total_count += count[i]
            total_sum += count[i] * i

        # Find median
        median = 0.0
        left = (total_count + 1) // 2
        right = (total_count // 2) + 1
        cnt = 0
        m1 = None
        m2 = None
        for i in range(256):
            cnt += count[i]
            if m1 is None and cnt >= left:
                m1 = i
            if m2 is None and cnt >= right:
                m2 = i
                break
        median = (m1 + m2) / 2

        mean = total_sum / total_count
        return [float(min_val), float(max_val), mean, median, float(mode)]
      
class Solution {
public:
    vector<double> sampleStats(vector<int>& count) {
        int minVal = -1, maxVal = -1, mode = 0;
        long long total = 0, sum = 0;
        int maxFreq = 0;
        for (int i = 0; i < 256; ++i) {
            if (count[i]) {
                if (minVal == -1) minVal = i;
                maxVal = i;
                if (count[i] > maxFreq) {
                    maxFreq = count[i];
                    mode = i;
                }
            }
            total += count[i];
            sum += (long long)count[i] * i;
        }
        // Median
        double median = 0.0;
        int left = (total + 1) / 2, right = (total / 2) + 1;
        int cnt = 0;
        int m1 = -1, m2 = -1;
        for (int i = 0; i < 256; ++i) {
            cnt += count[i];
            if (m1 == -1 && cnt >= left) m1 = i;
            if (m2 == -1 && cnt >= right) {
                m2 = i;
                break;
            }
        }
        median = (m1 + m2) / 2.0;
        double mean = (double)sum / total;
        return { (double)minVal, (double)maxVal, mean, median, (double)mode };
    }
};
      
class Solution {
    public double[] sampleStats(int[] count) {
        int min = -1, max = -1, mode = 0;
        long total = 0, sum = 0;
        int maxFreq = 0;
        for (int i = 0; i < 256; i++) {
            if (count[i] > 0) {
                if (min == -1) min = i;
                max = i;
                if (count[i] > maxFreq) {
                    maxFreq = count[i];
                    mode = i;
                }
            }
            total += count[i];
            sum += (long)count[i] * i;
        }
        // Median
        double median = 0.0;
        int left = (int)((total + 1) / 2);
        int right = (int)(total / 2 + 1);
        int cnt = 0, m1 = -1, m2 = -1;
        for (int i = 0; i < 256; i++) {
            cnt += count[i];
            if (m1 == -1 && cnt >= left) m1 = i;
            if (m2 == -1 && cnt >= right) {
                m2 = i;
                break;
            }
        }
        median = (m1 + m2) / 2.0;
        double mean = (double)sum / total;
        return new double[]{(double)min, (double)max, mean, median, (double)mode};
    }
}
      
var sampleStats = function(count) {
    let min = null, max = null, mode = 0, maxFreq = 0;
    let total = 0, sum = 0;
    for (let i = 0; i < 256; i++) {
        if (count[i]) {
            if (min === null) min = i;
            max = i;
            if (count[i] > maxFreq) {
                maxFreq = count[i];
                mode = i;
            }
        }
        total += count[i];
        sum += count[i] * i;
    }
    // Median
    let left = Math.floor((total + 1) / 2);
    let right = Math.floor(total / 2) + 1;
    let cnt = 0, m1 = null, m2 = null;
    for (let i = 0; i < 256; i++) {
        cnt += count[i];
        if (m1 === null && cnt >= left) m1 = i;
        if (m2 === null && cnt >= right) {
            m2 = i;
            break;
        }
    }
    let median = (m1 + m2) / 2;
    let mean = sum / total;
    return [min * 1.0, max * 1.0, mean, median, mode * 1.0];
};