Simple sorted string compression

AS
Ackshaey Singh
Difficulty Medium
Time Complexity
O(n)
Space Complexity
O(n)
String
Microsoft

You're sitting in your Microsoft interview and the interviewer slides a string problem across the table. "Given a sorted string, compress it by collapsing repeated characters into the character followed by its count." At first glance it feels straightforward, but the details matter: when do you append the count? How do you detect group boundaries? And can you do it in a single pass? This problem, also known as "String Compression" on other platforms, tests your ability to handle string manipulation cleanly and efficiently.

TL;DR

Scan the sorted string left to right with a counter that tracks consecutive identical characters. When the current character differs from the next (or you reach the end), append the character to a StringBuilder. If the count exceeds 1, also append the count. Reset the counter and continue. This runs in O(n) time with a single pass and uses O(n) space for the output.

Why This Problem Matters

String compression shows up regularly in interviews because it tests several core skills at once: loop boundary handling, conditional logic, and efficient string building. It is a building block for more advanced topics like run-length encoding, data serialization, and text processing pipelines. Getting comfortable with single-pass string algorithms pays off across a wide range of interview questions.

Understanding the Problem

Let's pin down exactly what we need to do.

Given: A sorted string (all identical characters are adjacent) Task: Replace runs of 2+ identical characters with the character followed by the count Return: The compressed string Key detail: Single occurrences are left as-is (no "a1", just "a")

Here are the examples from the problem:

compress("AAAAaaabbbbbcccc") --> "A4a3b5c4"
compress("aabbbbccc")       --> "a2b4c3"
compress("abc")             --> "abc"

Notice in the third example, no character repeats, so the output matches the input. The compression only helps when characters appear 3 or more times, since "aa" and "a2" consume the same space.

Visualizing Character Groups

The sorted input guarantees that identical characters cluster together. For the input "aabbbcddddeeeeef", the groups look like this:

Loading visualization...

Each group becomes either the character alone (if it appears once) or the character followed by its count. The output for this input is "a2b3cd4e5f".

Edge Cases to Consider

  1. Empty string: Return an empty string
  2. Single character: Return it unchanged
  3. No repeats ("abc"): Return the input as-is
  4. All same characters ("aaaa"): Return character + count ("a4")
  5. Case sensitivity: "A" and "a" are different characters

Solution Approach

Since the string is already sorted, we can solve this in a single left-to-right scan. The core idea is to count how many times each character repeats consecutively, then decide what to write to the output.

The Counting Strategy

Walk through the string one character at a time. Keep a counter (we'll call it accumulator) that tracks the current run length. At each position, ask two questions:

  1. Have I reached the end of the string?
  2. Is the next character different from the current one?

If either answer is yes, we've found a group boundary. Write the character to the output, and if the accumulator is greater than 1, also write the count. Then reset the accumulator for the next group.

The Decision at Each Boundary

Loading visualization...

When count == 1, we skip the number entirely. This is the detail that trips up many candidates. Appending "a1" would actually expand the string rather than compress it.

Building the Algorithm Step by Step

Let's trace through "aabbbcddddeeeeef" to see how the accumulator and output evolve:

Loading visualization...

At each group boundary (where the current character differs from the next), we flush the accumulated count to the output and start fresh. The full trace produces "a2b3cd4e5f".

Implementation

Prefer a different language? Jump to solutions in other languages.

Here's the Java solution with the single-pass approach:

public class Solution {
  public String compress(String input) {
    StringBuilder sb = new StringBuilder();
    int accumulator = 0;

    for (int i = 0; i < input.length(); i++) {
      accumulator++;

      if (i + 1 == input.length() || input.charAt(i) != input.charAt(i + 1)) {
        sb.append(input.charAt(i));
        if (accumulator > 1) {
          sb.append(accumulator);
        }
        accumulator = 0;
      }
    }

    return sb.toString();
  }
}

Let's walk through the key decisions in this code:

  1. StringBuilder over string concatenation: Java strings are immutable. Each += would create a new String object, turning our O(n) algorithm into O(n^2). StringBuilder's internal buffer avoids this.

  2. Boundary detection with i + 1 == input.length(): This handles the last character in the string. Without this check, comparing input.charAt(i) with input.charAt(i + 1) would throw an IndexOutOfBoundsException.

  3. Conditional count appending: The if (accumulator > 1) guard is what makes this "simple" compression work correctly. A single occurrence just gets the character, not "a1".

  4. Reset after flush: Setting accumulator = 0 (rather than 1) works because the loop immediately increments it at the top of the next iteration.

Dry Run with "abc"

icharaccumulatorboundary?output
0a1yes (a != b)"a"
1b1yes (b != c)"ab"
2c1yes (end)"abc"

No counts appended, so "abc" passes through unchanged.

Dry Run with "aabbbbccc"

icharaccumulatorboundary?output
0a1no""
1a2yes"a2"
2b1no"a2"
3b2no"a2"
4b3no"a2"
5b4yes"a2b4"
6c1no"a2b4"
7c2no"a2b4"
8c3yes (end)"a2b4c3"

Complexity Analysis

Time Complexity: O(n)

We visit each character exactly once. The operations inside the loop (incrementing, comparing, appending) are all O(1). StringBuilder's append method runs in amortized O(1) time because it doubles its internal buffer when needed, similar to how ArrayList grows.

Space Complexity: O(n)

The StringBuilder stores the output string, which in the worst case (no repeats) is the same length as the input. In the best case (all identical characters), the output is O(log n) characters (the character plus the digits of the count), but we report worst-case space as O(n).

Why Not Constant Space?

The problem description mentions "shoot for constant space complexity." This is achievable if you're allowed to modify the input string in place (or use a character array), but in Java, strings are immutable. The StringBuilder is the minimum overhead needed to produce the output.

Common Pitfalls

  1. Off-by-one on the boundary check: Forgetting i + 1 == input.length() causes an ArrayIndexOutOfBoundsException on the last character. Always handle the end-of-string case.

  2. Appending count for single characters: Writing "a1" instead of "a" when the character appears only once. The accumulator > 1 guard prevents this.

  3. Using string concatenation in a loop: This silently degrades performance to O(n^2). In an interview, mentioning why you chose StringBuilder demonstrates awareness of performance characteristics.

  4. Forgetting to reset the counter: If you skip accumulator = 0, the counts will accumulate across groups, producing wildly wrong results.

  5. Ignoring case sensitivity: 'A' and 'a' have different ASCII values and must be treated as separate characters. The sorted input places uppercase before lowercase, so "AAAaaa" has two distinct groups.

Interview Tips

When you encounter this problem in a Microsoft or similar interview:

  1. Clarify the constraints upfront: "Is the string already sorted? Should I handle empty input? Is case significant?" These questions show thoroughness.

  2. Start with examples: Walk through "aabbbcddddeeeeef" on the whiteboard before writing any code. Identify the groups visually.

  3. Mention the StringBuilder choice: Briefly explain why you're not using += with strings. This is a small detail that separates experienced Java developers from beginners.

  4. Test your boundary condition: The i + 1 == input.length() check is the most common source of bugs. Trace through the last character explicitly.

  5. Discuss variations: If time allows, mention how you'd handle unsorted strings (sort first or use a frequency map) or how standard run-length encoding differs (always appends the count).

Related Variations

This problem opens the door to several follow-up questions an interviewer might ask:

  • What if the string isn't sorted? You'd need a HashMap to track character frequencies, then iterate through the map. The output order might differ.
  • What if you need to decompress too? Write the inverse function that reads character-count pairs and expands them back.
  • What about run-length encoding for binary data? Same concept, but applied to byte streams where runs can be very long.

Solutions in Other Languages

Python

class Solution:
    def compress(self, input_str: str) -> str:
        builder = ""
        accumulator = 0

        for i in range(len(input_str)):
            accumulator += 1
            if i + 1 == len(input_str) or input_str[i] != input_str[i + 1]:
                builder += input_str[i]
                if accumulator > 1:
                    builder += str(accumulator)
                accumulator = 0

        return builder

Python strings are also immutable, so repeated concatenation is technically O(n^2). For interview purposes this is acceptable, but in production code you might use a list and ''.join() at the end.

JavaScript

class Solution {
  compress(input) {
    let builder = "";
    let accumulator = 0;

    for (let i = 0; i < input.length; i++) {
      accumulator += 1;
      if (i + 1 === input.length || input[i] !== input[i + 1]) {
        builder += input[i];
        if (accumulator > 1) {
          builder += accumulator;
        }
        accumulator = 0;
      }
    }

    return builder;
  }
}

TypeScript

class Solution {
  compress(input: string): string {
    let builder = "";
    let accumulator = 0;

    for (let i = 0; i < input.length; i++) {
      accumulator += 1;
      if (i + 1 === input.length || input[i] !== input[i + 1]) {
        builder += input[i];
        if (accumulator > 1) {
          builder += accumulator;
        }
        accumulator = 0;
      }
    }

    return builder;
  }
}

C++

#include <string>

class Solution {
public:
  std::string compress(std::string input) {
    std::string builder;
    int accumulator = 0;

    for (int i = 0; i < input.length(); i++) {
      accumulator++;
      if (i + 1 == input.length() || input.at(i) != input.at(i + 1)) {
        builder.push_back(input.at(i));
        if (accumulator > 1) {
          builder.append(std::to_string(accumulator));
        }
        accumulator = 0;
      }
    }

    return builder;
  }
};

Go

import (
  "strconv"
  "strings"
)

func Compress(input string) string {
  var builder strings.Builder
  accumulator := 0

  for i := range input {
    accumulator++
    if i+1 == len(input) || input[i] != input[i+1] {
      builder.WriteByte(input[i])
      if accumulator > 1 {
        builder.WriteString(strconv.Itoa(accumulator))
      }
      accumulator = 0
    }
  }

  return builder.String()
}

Go's strings.Builder is the idiomatic choice for efficient string construction, similar to Java's StringBuilder.

Scala

import scala.collection.mutable

class Solution {
  def compress(input: String): String = {
    val sb = new mutable.StringBuilder()
    var accumulator = 0

    for (i <- 0 until input.length) {
      accumulator += 1
      if (i + 1 == input.length || input(i) != input(i + 1)) {
        sb.append(input(i))
        if (accumulator > 1) {
          sb.append(accumulator)
        }
        accumulator = 0
      }
    }

    sb.toString
  }
}

Kotlin

class Solution {
    fun compress(input: String): String {
        val sb = StringBuilder()
        var accumulator = 0

        for (i in 0 until input.length) {
            accumulator++
            if (i + 1 == input.length || input[i] != input[i + 1]) {
                sb.append(input[i])
                if (accumulator > 1) {
                    sb.append(accumulator)
                }
                accumulator = 0
            }
        }

        return sb.toString()
    }
}

Ruby

class Solution
  def compress(input)
    result = ""
    accumulator = 0

    (0...input.length).each do |i|
      accumulator += 1
      if i + 1 == input.length || input[i] != input[i + 1]
        result += input[i]
        result += accumulator.to_s if accumulator > 1
        accumulator = 0
      end
    end

    result
  end
end

Rust

pub fn compress(input: String) -> String {
    let mut result = String::new();
    let mut accumulator = 0;
    let chars: Vec<char> = input.chars().collect();

    for i in 0..chars.len() {
        accumulator += 1;
        if i + 1 == chars.len() || chars[i] != chars[i + 1] {
            result.push(chars[i]);
            if accumulator > 1 {
                result.push_str(&accumulator.to_string());
            }
            accumulator = 0;
        }
    }

    result
}

Rust requires converting the string to a Vec<char> for indexed access since Rust strings are UTF-8 encoded and don't support O(1) indexing natively.

Swift

func compress(_ input: String) -> String {
    let chars = Array(input)
    var result = ""
    var accumulator = 0

    for i in 0..<chars.count {
        accumulator += 1
        if i + 1 == chars.count || chars[i] != chars[i + 1] {
            result.append(chars[i])
            if accumulator > 1 {
                result += "\(accumulator)"
            }
            accumulator = 0
        }
    }

    return result
}

Dart

class Solution {
  String compress(String input) {
    StringBuffer sb = StringBuffer();
    int accumulator = 0;

    for (int i = 0; i < input.length; i++) {
      accumulator++;
      if (i + 1 == input.length || input[i] != input[i + 1]) {
        sb.write(input[i]);
        if (accumulator > 1) {
          sb.write(accumulator);
        }
        accumulator = 0;
      }
    }

    return sb.toString();
  }
}

C#

using System.Text;

public class Solution {
    public string compress(string input) {
        var sb = new StringBuilder();
        int accumulator = 0;

        for (int i = 0; i < input.Length; i++) {
            accumulator++;
            if (i + 1 == input.Length || input[i] != input[i + 1]) {
                sb.Append(input[i]);
                if (accumulator > 1) {
                    sb.Append(accumulator);
                }
                accumulator = 0;
            }
        }

        return sb.ToString();
    }
}

PHP

class Solution {
    public function compress(string $input): string {
        $result = "";
        $accumulator = 0;

        for ($i = 0; $i < strlen($input); $i++) {
            $accumulator++;
            if ($i + 1 === strlen($input) || $input[$i] !== $input[$i + 1]) {
                $result .= $input[$i];
                if ($accumulator > 1) {
                    $result .= $accumulator;
                }
                $accumulator = 0;
            }
        }

        return $result;
    }
}

Practice Makes Perfect

Once you're comfortable with sorted string compression, try these related problems to strengthen your string manipulation skills:

  • Find the first non-duplicate character in a string
  • Check if two strings are anagrams
  • Reverse an integer (similar boundary-handling patterns)

This problem and thousands of others are available on Firecode, where over 50,000 users have prepared for technical interviews and landed roles at top tech companies. Whether you're targeting Microsoft or your next big opportunity, building a habit of daily practice on fundamentals like this will make the difference.

Frequently Asked Questions

What is the time complexity of sorted string compression?
The time complexity is O(n) where n is the length of the input string. Each character is visited exactly once in a single pass. The algorithm tracks consecutive identical characters using a counter and appends to a StringBuilder when a group boundary is detected.
What is the space complexity of sorted string compression?
The space complexity is O(n) because the output string can be at most the same length as the input. In the worst case where no character repeats, the compressed string equals the original. A StringBuilder is used to build the result efficiently.
Why does the problem specify the string is already sorted?
A sorted string guarantees that all identical characters are adjacent. This allows a single-pass algorithm that counts consecutive duplicates without needing a hash map or extra data structure. Without sorting, you would need to either sort first or use a frequency map.
When does this compression actually save space?
The compression only saves space when a character repeats 3 or more times. A single character like 'a' becomes 'a' (no change). Two characters 'aa' becomes 'a2' (same length). Three characters 'aaa' becomes 'a3' (saves 1 character). The problem description notes that 'a1' takes more space than 'a', which is why single occurrences skip the count.
How does this differ from run-length encoding?
This is a simplified variant of run-length encoding (RLE). Standard RLE always emits the count after each character, producing pairs like 'a1b3c2'. This problem's version omits the count when it equals 1, producing 'ab3c2' instead. The sorted input constraint also simplifies the algorithm since identical characters are guaranteed to be adjacent.
How should case sensitivity be handled?
Uppercase and lowercase letters are treated as distinct characters. Since the input is sorted, uppercase letters (which have lower ASCII values) appear before their lowercase counterparts. For example, 'AAAaaa' compresses to 'A3a3' because 'A' and 'a' are different characters.
What edge cases should you handle in an interview?
Handle the empty string (return empty), a single character (return as-is), a string with no repeats (return unchanged), and numeric characters in the input. The algorithm naturally handles all these cases because the counter only appends when greater than 1, and boundary detection works correctly at string boundaries.
Why use StringBuilder instead of string concatenation in Java?
Java strings are immutable, so each concatenation creates a new String object. For a string of length n, repeated concatenation produces O(n^2) time complexity due to copying. StringBuilder maintains a mutable character buffer and appends in amortized O(1) time, keeping the overall algorithm at O(n).