Sorting Algorithms Lab: From Bubble Sort to Timsort

Lesson 667min8,157 chars

Learning Objectives

✓Implement bubble sort, merge sort, and quicksort from scratch and empirically measure their performance differences across random, sorted, reverse-sorted, and nearly-sorted input patterns
✓Explain why Timsort is Python's default sorting algorithm by identifying how it detects and exploits natural runs in real-world data
✓Apply custom key functions using lambda, operator.attrgetter, and tuple keys to sort complex objects by multiple criteria in a single pass
✓Evaluate the trade-offs between sorting algorithms in terms of time complexity, space complexity, stability, and cache performance
✓Integrate multi-key sorting into a production-style leaderboard system using Python's sort stability guarantees

Sorting Algorithms Lab: From Bubble Sort to Timsort

Last lesson, we broke recursion into a mechanical process: base case, reduction step, trust the function. We also learned the hard way that Python's 1000-frame recursion limit is a real constraint, not a suggestion. Today, we are going to lean on that divide-and-conquer pattern to build merge sort and quicksort from scratch. But first, let me show you something that made me mass-revert a commit at 2 AM.

Look at these two snippets. Both attempt to sort a list of 10,000 integers that happens to already be sorted.

Snippet A:

Python

import sys
sys.setrecursionlimit(20000)

def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]  # Always pick the first element
    left = [x for x in arr[1:] if x <= pivot]
    right = [x for x in arr[1:] if x > pivot]
    return quicksort(left) + [pivot] + quicksort(right)

data = list(range(10000))
result = quicksort(data)  # RecursionError even with raised limit

Snippet B:

Python

data = list(range(10000))
result = sorted(data)  # Returns instantly

Snippet A crashes. Snippet B finishes in microseconds. Both are "sorting." Why does the first one explode? If you remember our discussion of the call stack from Lesson 5, you already have part of the answer. Picking the first element as the pivot on already-sorted data gives you partitions of size 0 and n-1 every single time. That is not divide-and-conquer. That is divide-and-do-nothing. You get n recursive calls, each adding a frame to the stack, and Python says "no" at frame 1000 (or 20000 if you forced it). Snippet B uses Timsort, Python's built-in hybrid algorithm, which is optimized specifically for this kind of pre-sorted input. By the end of this lab, you will understand every layer of that difference.

💡 Key Insight: A sorting algorithm's performance on real-world data matters more than its theoretical average case. Real data is rarely random. It has patterns, and the best algorithms exploit them.

The Quadratic Trap: Why Simple Sorts Hurt at Scale

Bubble sort and insertion sort are the first algorithms most people learn, and they are also the first algorithms most people should stop using. I say this not to be dismissive. These algorithms are beautiful in their simplicity, and understanding them deeply teaches you to reason about algorithmic complexity. But I have seen production systems brought to their knees because someone thought "it is just a small list" and did not anticipate that list growing.

Bubble sort works by repeatedly comparing adjacent pairs and swapping them if they are out of order. You walk through the list from left to right, and after one full pass, the largest element has "bubbled" to the end. After two passes, the two largest are in place. After n passes, everything is sorted. The cost compounds fast: for a list of n elements, you might need up to n passes, each examining up to n elements. That is O(n squared) comparisons and swaps in the worst case. On 100 elements, that is 10,000 operations. Tolerable. On 10,000 elements, that is 100 million. Not tolerable.

Insertion sort is the smarter cousin. Instead of blindly swapping neighbors, it maintains a sorted portion at the front of the list and inserts each new element into its correct position within that sorted portion. Think of it like sorting a hand of playing cards: you pick up one card at a time and slide it into the right spot among the cards you are already holding. The worst case is still O(n squared), when the input is in reverse order, because each insertion requires shifting all previously sorted elements. But here is the thing that matters in practice: insertion sort is extremely fast on nearly-sorted data. If the list only has a few elements out of place, each insertion only shifts a couple of elements, and the whole thing runs in nearly O(n) time.

⚠️ Common Pitfall: Students often dismiss insertion sort entirely after learning merge sort. Do not make that mistake. Timsort, the algorithm Python actually uses, relies on insertion sort for small sub-arrays because its low overhead beats merge sort when n is small (typically under 32-64 elements).

Keep this image in mind. Imagine you are organizing books on a shelf. Bubble sort is like going back and forth along the shelf, swapping any two adjacent books that are out of order, over and over until nothing moves. Insertion sort is like taking each book off the shelf one at a time and putting it back in the right spot. Both get the job done on a small bookshelf. Neither scales to a library.

Algorithm	Best Case	Average Case	Worst Case	Space	Stable?	When It Shines
Bubble Sort	O(n)	O(n^2)	O(n^2)	O(1)	Yes	Teaching tool only
Insertion Sort	O(n)	O(n^2)	O(n^2)	O(1)	Yes	Small n, nearly sorted data

The "Stable" column deserves attention. A stable sort preserves the relative order of elements that compare as equal. If you have two users with the same activity score, a stable sort keeps them in whatever order they were in before sorting. This connects directly to the dict ordering guarantee we discussed in Lesson 2. Remember how Python 3.7 made dictionary insertion order an official guarantee? Stability in sorting is the same idea applied to sequences. It turns out to be critical for multi-key sorting, which we will use in our NetProbe project later.

🤔 Think about it: If insertion sort is O(n) on nearly-sorted data, and bubble sort is also O(n) on already-sorted data (with an early termination flag), why do we prefer insertion sort?

View Answer

Because "nearly sorted" and "already sorted" are different things. Bubble sort's O(n) best case requires zero swaps on a complete pass, meaning the data is perfectly sorted. Insertion sort's near-linear behavior kicks in whenever elements are close to their final positions, even if they are not perfectly placed. In practice, data is "mostly sorted" far more often than "perfectly sorted." Insertion sort also has better cache locality because it shifts elements in a contiguous block rather than hopping around with swaps.

Quadratic time has a hard ceiling. Now let us build an algorithm that breaks through it.

Merge Sort: Divide-and-Conquer with a Guarantee

Merge sort is the algorithm that convinced me recursion is not just an academic exercise. In Lesson 5, we learned to trust the recursive function by focusing on the base case and the reduction step. Merge sort is that pattern in its purest form: split the list in half, recursively sort each half, then merge the two sorted halves back together. The base case is a list of length 0 or 1 (already sorted by definition). The reduction step is the split. The "trust the function" part is believing that sorting half the data recursively will actually work.

O(n log n), no matter the input. Whether random, sorted, reverse-sorted, or adversarially constructed, merge sort never deviates from that bound. The "log n" comes from the depth of the recursion tree. Each level of recursion halves the problem, so you get log2(n) levels. At each level, the merge step touches every element exactly once, contributing O(n) work. Multiply those together: O(n log n). For 10,000 elements, that is roughly 130,000 operations instead of bubble sort's 100 million. That is not a small improvement. That is the difference between your API responding in milliseconds and timing out.

The trade-off is space. Every time you merge two halves, you need a temporary array to hold the merged result. This means merge sort uses O(n) additional memory. On a machine with 16 GB of RAM sorting a list of a million integers, that is barely noticeable. On an embedded device sorting sensor readings with 256 KB of RAM, it matters. At Google, I once had to switch from merge sort to an in-place variant for a data pipeline that was processing tens of millions of records on machines with tight memory budgets. The O(n) space overhead was actually the bottleneck, not the time complexity.

Let me show you the core merge operation. This is where the real work happens:

Merging two sorted lists into one sorted list:

Python

def merge(left, right):
    result = []
    i = j = 0
    while i < len(left) and j < len(right):
        if left[i] <= right[j]:  # <= preserves stability
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result.extend(left[i:])
    result.extend(right[j:])
    return result

Notice the <= in the comparison, not <. This is what makes merge sort stable. When two elements are equal, we take from the left half first, preserving their original relative order. Change that to < and you have lost stability. I have seen this bug in interview code more times than I can count.

💡 Key Insight: The merge step is O(n) because each element is touched exactly once. Combined with log n levels of recursion, you get the O(n log n) guarantee. This is provably optimal for comparison-based sorting. You literally cannot do better.

Deep Dive: Why O(n log n) is the theoretical floor

Any comparison-based sorting algorithm must be able to distinguish between all n! (factorial) possible orderings of n elements. Each comparison gives you one bit of information (yes or no), so you need at least log2(n!) comparisons. By Stirling's approximation, log2(n!) is approximately n log2(n). This means no comparison-based sort can beat O(n log n) in the worst case. Non-comparison sorts like counting sort and radix sort can beat this bound, but they require specific constraints on the input (like integer keys within a known range). Python's Timsort, being comparison-based, is bounded by this same floor, but it gets very close to it in practice.

Picture this: you are organizing a messy deck of cards by splitting it in half, handing each half to a friend, and telling them "sort your half." They each do the same thing recursively. Eventually, everyone has a pile of one card, trivially sorted. Then you merge piles back together: compare the top card of each pile, take the smaller one. That is exactly what the algorithm does, and the analogy holds perfectly. Where it breaks down is space. In the card analogy, you need table space for the merge. In the algorithm, you need memory.

Remember in Lesson 3 when we used deques as bounded queues? The merge operation has the same feel: you are pulling elements from the front of two sequences in order. If you squint, the merge step is a two-queue merge into a single output stream. That connection between queues and merge sort is not a coincidence. Merge-based algorithms appear everywhere in systems that process ordered streams of data, from database query plans to external sorting of files too large to fit in memory.

🤔 Think about it: Merge sort creates new lists at every level of recursion. If you are sorting a list of 1 million elements, roughly how many total elements are allocated across all temporary merge buffers?

View Answer

At each level of recursion, the total number of elements across all merge buffers is n (one million). There are log2(n) levels, so naively you might think the total allocation is n * log(n). But in practice, since Python's garbage collector reclaims temporary lists after each merge returns, the peak additional memory usage at any point is O(n), not O(n log n). The buffers from deeper levels are freed before the top-level merge allocates its buffer.

Quicksort: Brilliance with a Dark Side

Quicksort is the algorithm I have the most complicated relationship with. On average, it is the fastest comparison-based sort in practice. Its O(n log n) average case has small constant factors because it sorts in-place (no extra arrays) and has excellent cache locality. But its O(n squared) worst case is not theoretical. It will bite you in production, as we saw in the opening snippet.

The algorithm works by choosing a pivot element, then partitioning the array into elements less than the pivot and elements greater than the pivot. The pivot ends up in its correct final position, and you recursively sort the two partitions. When the pivot is chosen well (near the median), the partitions are roughly equal, and you get O(n log n) behavior. When the pivot is chosen poorly (the smallest or largest element), one partition has n-1 elements and the other has 0, giving you O(n squared) behavior identical to bubble sort.

Pivot selection is everything. The naive approach of always picking the first element (like our crashing snippet A) fails catastrophically on sorted or nearly-sorted input, which is extremely common in practice. Think about it: database query results are often ordered by ID or timestamp. Log files are ordered chronologically. User lists come back from APIs sorted alphabetically. "Already sorted" is one of the most common input patterns in real software, and it is quicksort's kryptonite with a naive pivot.

There are three main pivot strategies worth knowing:

Strategy	How It Works	Worst Case Trigger	Used By
First element	Pick arr[0]	Sorted/reverse input	Nobody serious
Random	Pick random index	Astronomically unlikely	Many implementations
Median-of-three	Median of first, middle, last	Rare, adversarial	C stdlib qsort

I strongly prefer median-of-three for teaching and random for production. Median-of-three is deterministic and easy to reason about: you look at the first, middle, and last elements, and use the median of those three as your pivot. This avoids the sorted-input disaster because the middle element of sorted data is a good pivot. Random pivot selection is even more robust because it makes adversarial inputs impossible without knowing your random seed. Java's Arrays.sort() uses dual-pivot quicksort with a variant of this approach for primitive arrays.

⚠️ Common Pitfall: Even with good pivot selection, quicksort can still degrade on inputs with many duplicate values. The classic Lomuto partition scheme puts all elements equal to the pivot on one side, creating unbalanced partitions. The three-way partition (Dutch National Flag) variant solves this by grouping equals together. If you ever implement quicksort for production use, handle duplicates.

Let me be direct: unless you are working on a systems language runtime or an embedded system, you should never implement quicksort yourself for production use. Python's sorted() and list.sort() use Timsort, which beats naive quicksort on real-world data. Java uses dual-pivot quicksort for primitives and Timsort for objects. C++ uses Introsort (quicksort that falls back to heap sort if recursion depth gets too high). All of these are carefully engineered by people who have spent years optimizing them. Implement quicksort to understand it. Use the standard library for everything else.

Deep Dive: The Introsort trick - best of both worlds

Introsort, invented by David Musser in 1997 and used in C++'s std::sort, starts with quicksort but monitors the recursion depth. If the depth exceeds 2 * log2(n), it switches to heap sort, which guarantees O(n log n) worst case. This gives you quicksort's excellent average-case performance with a guaranteed worst-case bound. Python does not use introsort because Timsort is specifically designed for the kinds of data Python programs encounter: partially sorted sequences with existing "runs" of ordered elements. Different languages optimize for different usage patterns.

This connects back to something we built in Lesson 4. Remember the pain of pointer manipulation in linked lists? Quicksort is actually elegant on linked lists because partitioning only requires relinking nodes, not moving data. Merge sort, on the other hand, is harder on linked lists because finding the midpoint requires a full traversal (no random access). The choice between these algorithms depends on your data structure, not just your data.

🤔 Think about it: Why does quicksort have better cache performance than merge sort on arrays, despite both being O(n log n) on average?

View Answer

Quicksort sorts in-place, meaning it operates on the same contiguous block of memory throughout. The partition step scans through elements sequentially, which is cache-friendly. Merge sort allocates new temporary arrays at each level, which may land in different parts of memory, causing cache misses. On modern CPUs where cache misses are 100x more expensive than L1 cache hits, this constant-factor difference can make quicksort 2-3x faster in practice despite having the same big-O complexity.

Timsort: Why Python's Built-in Sort Is Smarter Than You Think

Timsort is the reason you should almost always use sorted() or list.sort() instead of rolling your own sort. It was invented by Tim Peters in 2002 specifically for CPython, and it has since been adopted by Java (for object arrays), Android, and Swift. That is an impressive resume for an algorithm that started as a CPython optimization.

The key insight behind Timsort is that real-world data is not random. It contains "runs": consecutive sequences of elements that are already in ascending or descending order. A database dump sorted by timestamp has one giant run. A list of transactions with occasional out-of-order entries has many long runs with short disruptions. Even "random" data tends to have short runs of 2-3 elements by chance. Timsort detects these natural runs and exploits them.

Here is how Timsort works at a high level. First, it scans the input looking for natural runs (ascending or strictly descending sequences). Descending runs are reversed in-place to become ascending. If a natural run is shorter than a minimum threshold (called "minrun," typically 32-64), Timsort extends it using insertion sort. Remember our earlier discussion about insertion sort being efficient on small inputs? This is exactly where it pays off. Once it has collected enough runs, it merges them with a carefully designed stack-based strategy that keeps the merges balanced.

The result is stunning performance on partially sorted data. If the input is already sorted, Timsort detects one giant run and does nothing beyond the initial scan: O(n) time. If the input has a few sorted chunks, Timsort merges only those chunks: much faster than O(n log n). On truly random data, Timsort performs like a well-implemented merge sort: O(n log n) with good constants. It never degrades to O(n squared) like quicksort can.

Input Pattern	Bubble Sort	Quicksort (random pivot)	Merge Sort	Timsort
Already sorted	O(n) with flag	O(n log n)	O(n log n)	O(n)
Reverse sorted	O(n^2)	O(n log n)	O(n log n)	O(n)
Random	O(n^2)	O(n log n)	O(n log n)	O(n log n)
Few unique values	O(n^2)	O(n^2) with bad partition	O(n log n)	O(n log n)
Nearly sorted	O(n^2)	O(n log n)	O(n log n)	O(n)

Look at that table carefully. Timsort matches or beats every other algorithm on every input pattern. It is never worse than O(n log n), and on the patterns that appear most frequently in real software (sorted, nearly sorted, reverse sorted), it drops to O(n). This is why Tim Peters is a legend in the Python community, and this is why I will never write my own sort for production Python code.

💡 Key Insight: Timsort's genius is not in being faster than merge sort on random data. It is in being dramatically faster on the kinds of data that real programs actually process. Algorithm design is not just about worst-case bounds. It is about matching the algorithm to the distribution of inputs you will encounter.

Here is a concrete example of the difference this makes in production. In 2015, a team at Spotify was processing user listening history to generate year-end "Wrapped" statistics. Their initial pipeline used a custom merge sort to rank songs by play count per user. It worked fine in testing with random data, but production data had a pattern: most users' listening history was roughly chronological, with songs appearing in clusters (albums, playlists). When they switched to Python's built-in sorted() with a key function, the sorting step became 4x faster because Timsort detected the natural runs in chronological listening data. The lesson is simple: benchmark with realistic data, not random data.

What went wrong in the Spotify case?

The custom merge sort treated every input identically, always performing O(n log n) work regardless of existing order. Timsort's run detection meant it could skip most of the merging work when songs were already roughly grouped by listening session. The engineers had benchmarked their custom sort on shuffled test data, where it performed comparably to Timsort. But production data's natural ordering was a free optimization that only Timsort could exploit. The takeaway: always benchmark on data that looks like your actual production data.

Custom Sorting: key Functions, Stability, and Multi-Key Ranking

Python's sorting functions accept a key parameter that transforms elements before comparison, and mastering this parameter is more valuable than implementing sort algorithms yourself. In production Python, you will write sorted(users, key=lambda u: u.score, reverse=True) a thousand times for every once you implement merge sort. The key function is called exactly once per element (not once per comparison), and the results are cached internally. This is called the "decorate-sort-undecorate" pattern, and it is far more efficient than the old cmp function approach from Python 2.

Stability is what makes multi-key sorting elegant. Since Timsort is stable (equal elements keep their original order), you can sort by multiple keys by sorting multiple times, from least significant key to most significant. Want to sort users by score descending, then by name ascending for ties? Sort by name first, then by score. The stable sort preserves the name ordering within each score group. This is the same stability property we discussed with merge sort earlier: the <= in the merge comparison.

Multiple passes work, but tuple keys do it in one. Python compares tuples element by element, so key=lambda u: (-u.score, u.name) sorts by score descending (note the negation) and name ascending in a single pass. This is my preferred approach because it is explicit, efficient, and does not rely on understanding stability to get the right result. For attribute access, operator.attrgetter is faster than lambda because it avoids the overhead of creating a Python function object:

Python

from operator import attrgetter
sorted(users, key=attrgetter('score'))  # Faster than lambda u: u.score

Let me connect this back to hash tables from Lesson 2. When you use a dict to count occurrences and then sort by count, you are combining two data structures in a pattern I use almost daily. Count with a dict (O(n) to build), then sort the items (O(k log k) where k is the number of unique keys). This is faster than sorting the raw list and counting runs, and it is the foundation of the "Top K" pattern we will explore in Lesson 8 with heaps.

📌 Remember: list.sort() sorts in-place and returns None. sorted() returns a new list and leaves the original unchanged. Mixing these up is a classic bug: my_list = my_list.sort() sets my_list to None.

Peer Teaching Prompt: Explain sort stability to a friend who has never heard of it, in exactly 3 sentences.

Model answer

A stable sort guarantees that when two elements compare as equal, they stay in the same relative order they were in before sorting. This matters when you sort data that has already been sorted by a different criterion, because a stable sort preserves that earlier ordering for tied elements. Python's built-in sort is stable, which means you can layer multiple sort criteria by sorting from least important to most important.

🤔 Think about it: You have a list of (name, department, salary) tuples. You want them sorted by department alphabetically, then by salary descending within each department. Write the sorted() call.

View Answer

Python

sorted(employees, key=lambda e: (e[1], -e[2]))

The tuple key (e[1], -e[2]) sorts by department ascending (strings compare alphabetically by default) and salary descending (negating an integer reverses numeric order). This single-pass approach is cleaner and faster than sorting twice. Note: negation only works for numeric types. For strings, you would need to sort twice using stability, or use a custom comparison with functools.cmp_to_key.

Challenge 1: Benchmark the Quadratic Wall (Guided)

Your task is to implement bubble sort and measure how long it takes on increasing input sizes. Form a hypothesis first: if bubble sort is O(n squared), doubling the input size should roughly quadruple the time. Let us verify that empirically.

Problem: Implement bubble_sort(arr) that sorts a list in-place. Then time it on lists of size 100, 1000, 5000, and 10000 random integers. Compare with Python's built-in sorted().

Hints:

The outer loop runs n times. The inner loop compares adjacent pairs.
Add an early_exit flag: if no swaps occur in a pass, the list is sorted.
Use time.perf_counter() for precise timing, not time.time().

Step-by-step solution

Python

def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break

The early exit check (if not swapped: break) makes bubble sort O(n) on already-sorted input. Without it, bubble sort always does O(n squared) work regardless of input. Time this on random data and watch the quadratic growth: 1000 elements might take 0.05s, but 10000 elements takes roughly 5s (100x slower for 10x more data).

Challenge 2: Merge Sort from Scratch (Semi-Guided)

Implement merge sort using the divide-and-conquer pattern from Lesson 5. You need two functions: merge(left, right) that merges two sorted lists, and merge_sort(arr) that recursively splits and merges.

Approach hints:

Base case: list of length 0 or 1.
Split at the midpoint: mid = len(arr) // 2.
The merge step must use <= (not <) to maintain stability.
Think about what happens to the call stack. For 10,000 elements, recursion depth is about log2(10000) which is approximately 14. Well within Python's limit.

Full solution

Python

def merge(left, right):
    result = []
    i = j = 0
    while i < len(left) and j < len(right):
        if left[i] <= right[j]:
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1
    result.extend(left[i:])
    result.extend(right[j:])
    return result

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])
    right = merge_sort(arr[mid:])
    return merge(left, right)

Notice the recursion depth is O(log n), not O(n). This is why merge sort does not hit Python's recursion limit on reasonable inputs. For a million elements, the recursion depth is only about 20.

Challenge 3: The Leaderboard Problem (Independent)

You are building a user ranking system for a social platform. Users have a username, a connection_count, an activity_score, and a join_date. The leaderboard should rank users by activity score descending. For ties, rank by connection count descending. For double ties, rank by join date ascending (earlier joiners rank higher). Implement this using sorted() with a key function. No solution provided up front. This is the exact pattern you will use in the project update below.

Hint

Use a tuple key with negation for descending numeric sorts. Dates as strings in ISO format (YYYY-MM-DD) sort correctly in ascending order without conversion.

Solution

Python

from operator import attrgetter

class User:
    def __init__(self, username, connections, activity, join_date):
        self.username = username
        self.connection_count = connections
        self.activity_score = activity
        self.join_date = join_date

users = [User("alice", 45, 920, "2024-01-15"),
    User("bob", 120, 920, "2024-03-01"),
    User("carol", 89, 1050, "2023-11-20"),
    User("dave", 120, 920, "2024-01-15"),]

leaderboard = sorted(
    users,
    key=lambda u: (-u.activity_score, -u.connection_count, u.join_date)
)

for rank, u in enumerate(leaderboard, 1):
    print(f"{rank}. {u.username}: score={u.activity_score}, "
          f"connections={u.connection_count}, joined={u.join_date}")

Carol ranks first (highest score). Among the three tied at 920, bob and dave have 120 connections (beating alice's 45). Between bob and dave, dave joined earlier (2024-01-15 vs 2024-03-01), so dave ranks second.

Bonus: Optimization Challenge

Can you make bubble sort 10x faster without changing the algorithm? Here is the constraint: you must still use pairwise adjacent swaps with the same logic. But you can change the implementation language.

Hint

The bottleneck is Python's interpreter overhead per comparison and swap. Each iteration of the inner loop involves Python bytecode dispatch, object comparisons, and tuple packing for the swap. What if the inner loop ran in C instead of Python?

Solution

Use numpy to pre-sort with C-level operations, or use ctypes to call a C implementation. But the real "10x" trick is simpler: just use list.sort(). Seriously. The point of this challenge is to internalize that algorithmic improvement (switching from O(n squared) to O(n log n)) always beats constant-factor optimization (making O(n squared) run faster). A faster bubble sort is still quadratic. A slower merge sort still beats it at scale. Optimize the algorithm, not the implementation.

Solution Review: Complexity Comparison

Let us put all the algorithms side by side and look at both the theoretical and practical picture. Theory tells you the shape of the curve. Practice tells you where the curve actually matters.

Algorithm	Time (Best)	Time (Average)	Time (Worst)	Space	Stable	In-Place	Python Use Case
Bubble Sort	O(n)	O(n^2)	O(n^2)	O(1)	Yes	Yes	Never in production
Insertion Sort	O(n)	O(n^2)	O(n^2)	O(1)	Yes	Yes	Timsort's small-array subroutine
Merge Sort	O(n log n)	O(n log n)	O(n log n)	O(n)	Yes	No	External sorting, linked lists
Quicksort	O(n log n)	O(n log n)	O(n^2)	O(log n)	No	Yes	Not used in CPython
Timsort	O(n)	O(n log n)	O(n log n)	O(n)	Yes	No	`sorted()`, `list.sort()`

Three things stand out from this table. First, Timsort is the only algorithm with an O(n) best case and an O(n log n) worst case. It literally combines the best properties of insertion sort and merge sort. Second, quicksort is the only unstable algorithm in this list, which is why Python does not use it for sorted(). Stability matters for multi-key sorting, and Python prioritizes correctness and predictability over raw speed. Third, the space column reveals a real trade-off: the algorithms with guaranteed O(n log n) time (merge sort, Timsort) both require O(n) extra space. You cannot have it all.

💡 Key Takeaway: In Python, your sorting hierarchy should be: (1) use sorted() or list.sort() for everything, (2) understand merge sort and quicksort for interviews and system design, (3) know bubble sort and insertion sort exist so you can recognize them in legacy code and explain why they should be replaced.

Analogy Bridge: Sorting algorithms are like transportation options for a cross-country trip. Bubble sort is walking: simple, works for short distances, absurd for anything real. Insertion sort is biking: still muscle-powered, but surprisingly efficient for short trips around the neighborhood. Merge sort is a train: reliable schedule, guaranteed arrival time, but you need infrastructure (extra memory). Quicksort is driving: usually fastest door-to-door, but one traffic jam (bad pivot) and you are stuck for hours. Timsort is a self-driving car that checks traffic before departure and chooses the best route: it takes the highway (merge sort) for long stretches but cuts through neighborhoods (insertion sort) when that is faster.

Where the analogy breaks down: in real transportation, you cannot combine methods mid-trip. Timsort literally combines insertion sort and merge sort within a single sort operation, switching strategies based on the data it encounters. That adaptivity has no clean transportation equivalent.

Comprehensive Code: Sorting Lab

This is the single runnable script that ties together everything from this lesson. It implements all four algorithms, benchmarks them on different input patterns, and demonstrates custom key sorting. Copy this, run it, and study the output.

Python

"""
Sorting Algorithms Lab: From Bubble Sort to Timsort
Dr. James Park - Lesson 6

Run this script to see empirical performance comparisons
across different algorithms and input patterns.
"""

import time
import random
from operator import attrgetter


# --- Algorithm Implementations ---

def bubble_sort(arr):
    """Sort a list in-place using bubble sort with early termination."""
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break  # List is sorted, no need to continue
    return arr


def insertion_sort(arr):
    """Sort a list in-place using insertion sort."""
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and arr[j] > key:
            arr[j + 1] = arr[j]  # Shift element right
            j -= 1
        arr[j + 1] = key  # Insert into correct position
    return arr


def merge(left, right):
    """Merge two sorted lists into one sorted list (stable)."""
    result = []
    i = j = 0

    while i < len(left) and j < len(right):
        if left[i] <= right[j]:  # <= ensures stability
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1

    result.extend(left[i:])  # Append remaining from left
    result.extend(right[j:])  # Append remaining from right
    return result


def merge_sort(arr):
    """Sort a list using recursive merge sort. Returns new sorted list."""
    if len(arr) <= 1:
        return arr

    mid = len(arr) // 2
    left = merge_sort(arr[:mid])   # Recursively sort left half
    right = merge_sort(arr[mid:])  # Recursively sort right half
    return merge(left, right)


def quicksort(arr):
    """Sort using quicksort with median-of-three pivot selection."""
    if len(arr) <= 1:
        return arr

    # --- Median-of-three pivot selection ---
    first, middle, last = arr[0], arr[len(arr) // 2], arr[-1]
    pivot = sorted([first, middle, last])[1]  # Pick the median value

    # --- Three-way partition (handles duplicates well) ---
    less = [x for x in arr if x < pivot]
    equal = [x for x in arr if x == pivot]
    greater = [x for x in arr if x > pivot]

    return quicksort(less) + equal + quicksort(greater)


# --- Benchmarking Utilities ---

def benchmark(sort_func, data, label):
    """Time a sort function and return elapsed time in seconds."""
    test_data = data.copy()  # Don't modify original
    start = time.perf_counter()

    if sort_func.__name__ in ('bubble_sort', 'insertion_sort'):
        sort_func(test_data)  # In-place sorts
    elif sort_func.__name__ == 'builtin_sort':
        sorted(test_data)     # Python's Timsort via sorted()
    else:
        sort_func(test_data)  # merge_sort, quicksort return new lists

    elapsed = time.perf_counter() - start
    return elapsed


def builtin_sort(arr):
    """Wrapper for Python's built-in sorted() to use in benchmarks."""
    return sorted(arr)


# --- Generate Test Data Patterns ---

SIZE = 3000  # Adjust for your machine (larger = slower bubble sort)

random.seed(42)  # Reproducible results
random_data = [random.randint(0, 100000) for _ in range(SIZE)]
sorted_data = list(range(SIZE))
reverse_data = list(range(SIZE, 0, -1))
nearly_sorted = list(range(SIZE))

# Introduce 5% disorder into nearly_sorted
num_swaps = SIZE // 20
for _ in range(num_swaps):
    i = random.randint(0, SIZE - 1)
    j = random.randint(0, SIZE - 1)
    nearly_sorted[i], nearly_sorted[j] = nearly_sorted[j], nearly_sorted[i]


# --- Run Benchmarks ---

print("=" * 65)
print(f"SORTING BENCHMARK: {SIZE} elements")
print("=" * 65)

algorithms = [(bubble_sort, "Bubble Sort"),
    (insertion_sort, "Insertion Sort"),
    (merge_sort, "Merge Sort"),
    (quicksort, "Quicksort (median-of-3)"),
    (builtin_sort, "Timsort (built-in)"),]

patterns = [(random_data, "Random"),
    (sorted_data, "Already Sorted"),
    (reverse_data, "Reverse Sorted"),
    (nearly_sorted, "Nearly Sorted (5% swaps)"),]

# Print header
print(f"\n{'Algorithm':<25}", end="")
for _, pattern_name in patterns:
    print(f"{pattern_name:<22}", end="")
print()
print("-" * 113)

# Benchmark each algorithm on each pattern
for sort_func, algo_name in algorithms:
    print(f"{algo_name:<25}", end="")
    for data, _ in patterns:
        elapsed = benchmark(sort_func, data, algo_name)
        print(f"{elapsed:<22.4f}", end="")
    print()

print()


# --- Custom Key Sorting Demo ---

print("=" * 65)
print("CUSTOM KEY SORTING: NetProbe User Leaderboard")
print("=" * 65)


class User:
    """Represents a network user with ranking attributes."""

    def __init__(self, username, connection_count, activity_score, join_date):
        self.username = username
        self.connection_count = connection_count
        self.activity_score = activity_score
        self.join_date = join_date

    def __repr__(self):
        return (f"User({self.username}, conns={self.connection_count}, "
                f"score={self.activity_score}, joined={self.join_date})")


# --- Create sample users ---

users = [User("alice", 45, 920, "2024-01-15"),
    User("bob", 120, 920, "2024-03-01"),
    User("carol", 89, 1050, "2023-11-20"),
    User("dave", 120, 920, "2024-01-15"),
    User("eve", 200, 780, "2025-06-10"),
    User("frank", 15, 1050, "2024-08-22"),
    User("grace", 67, 780, "2023-05-01"),]


# --- Method 1: Tuple key (preferred) ---

print("\n--- Leaderboard (tuple key: score desc, connections desc, date asc) ---")
leaderboard = sorted(
    users,
    key=lambda u: (-u.activity_score, -u.connection_count, u.join_date)
)

for rank, user in enumerate(leaderboard, 1):
    print(f"  {rank}. {user.username:<10} score={user.activity_score:>5}  "
          f"conns={user.connection_count:>4}  joined={user.join_date}")
# Output:
#   1. carol      score= 1050  conns=  89  joined=2023-11-20
#   2. frank      score= 1050  conns=  15  joined=2024-08-22
#   3. dave       score=  920  conns= 120  joined=2024-01-15
#   4. bob        score=  920  conns= 120  joined=2024-03-01
#   5. alice      score=  920  conns=  45  joined=2024-01-15
#   6. eve        score=  780  conns= 200  joined=2025-06-10
#   7. grace      score=  780  conns=  67  joined=2023-05-01


# --- Method 2: attrgetter (faster for single-key) ---

print("\n--- Sorted by connection count (attrgetter, descending) ---")
by_connections = sorted(users, key=attrgetter('connection_count'), reverse=True)
for user in by_connections:
    print(f"  {user.username:<10} connections={user.connection_count}")
# Output:
#   eve        connections=200
#   bob        connections=120
#   dave       connections=120
#   carol      connections=89
#   grace      connections=67
#   alice      connections=45
#   frank      connections=15


# --- Method 3: Multiple sorts using stability ---

print("\n--- Two-pass stable sort (sort by name, then by score) ---")
stable_sorted = sorted(users, key=attrgetter('username'))       # Secondary key first
stable_sorted = sorted(stable_sorted, key=attrgetter('activity_score'), reverse=True)

for user in stable_sorted:
    print(f"  {user.username:<10} score={user.activity_score}")
# Output:
#   carol      score=1050
#   frank      score=1050
#   alice      score=920
#   bob        score=920
#   dave       score=920
#   eve        score=780
#   grace      score=780


# --- Verify sort correctness ---

print("\n--- Verification ---")
test = [random.randint(0, 999) for _ in range(500)]
assert bubble_sort(test.copy()) == sorted(test), "Bubble sort FAILED"
assert insertion_sort(test.copy()) == sorted(test), "Insertion sort FAILED"
assert merge_sort(test) == sorted(test), "Merge sort FAILED"
assert quicksort(test) == sorted(test), "Quicksort FAILED"
print("All sort implementations verified correct!")

What to look for in the output. The benchmark table will show you the quadratic wall in real numbers. Bubble sort and insertion sort will be orders of magnitude slower than merge sort, quicksort, and Timsort on random data. But watch the "Already Sorted" column: insertion sort should be fast there, and Timsort should be nearly instant. The "Nearly Sorted" column is where Timsort really flexes, often matching or beating pure merge sort because it detects the existing runs. The leaderboard section demonstrates all three approaches to multi-key sorting: tuple keys, attrgetter, and multi-pass stable sorting.

🔨 Project Update

We are adding user ranking and a leaderboard command to our NetProbe project. In previous lessons, we built a network analyzer with user connections (Lesson 2 hash tables), a bounded activity queue (Lesson 3 deques), linked list connection chains (Lesson 4), and recursive friend-of-friend discovery with memoization (Lesson 5). Now we add the ability to sort and rank users by multiple criteria.

Here is the cumulative project code with the new additions highlighted:

Python

"""
NetProbe - Network Analysis Tool
Cumulative project: Lessons 1-6
"""

import collections
import functools
from operator import attrgetter
from datetime import datetime


# --- Core Data Structures (Lessons 1-4) ---

class User:
    """Represents a network user with connection and activity data."""

    def __init__(self, username, join_date=None):
        self.username = username
        self.join_date = join_date or datetime.now().strftime("%Y-%m-%d")
        self.connections = set()       # Lesson 2: hash set for O(1) lookup
        self.activity_score = 0
        self.activity_log = collections.deque(maxlen=50)  # Lesson 3: bounded queue

    def __repr__(self):
        return f"User({self.username})"


class Network:
    """Social network graph with analysis capabilities."""

    def __init__(self):
        self.users = {}  # Lesson 2: dict for O(1) user lookup

    def add_user(self, username, join_date=None):
        """Add a user to the network."""
        if username not in self.users:
            self.users[username] = User(username, join_date)
        return self.users[username]

    def add_connection(self, user_a, user_b):
        """Create a bidirectional connection between two users."""
        if user_a in self.users and user_b in self.users:
            self.users[user_a].connections.add(user_b)
            self.users[user_b].connections.add(user_a)

    def log_activity(self, username, action, points=10):
        """Log user activity and update score."""
        if username in self.users:
            user = self.users[username]
            user.activity_log.append(action)   # Lesson 3: deque auto-evicts old
            user.activity_score += points

    # --- Lesson 5: Recursive friend-of-friend discovery ---

    def discover_connections(self, username, depth=2, visited=None):
        """Find friends-of-friends up to a given depth using recursion."""
        if visited is None:
            visited = set()

        if depth == 0 or username not in self.users:
            return set()  # Base case

        visited.add(username)
        discovered = set()

        for friend in self.users[username].connections:
            if friend not in visited:
                discovered.add(friend)
                # Recursive case: go deeper
                deeper = self.discover_connections(friend, depth - 1, visited)
                discovered.update(deeper)

        return discovered

    # --- NEW IN LESSON 6: Sorting and Leaderboard ---

    def leaderboard(self, sort_by="activity", limit=None):
        """
        Generate a ranked leaderboard of users.

        sort_by options:
          "activity"    - by activity_score descending
          "connections"  - by connection_count descending
          "combined"     - by score desc, then connections desc, then join_date asc
        """
        user_list = list(self.users.values())

        if sort_by == "activity":
            ranked = sorted(user_list,
                            key=lambda u: -u.activity_score)

        elif sort_by == "connections":
            ranked = sorted(user_list,
                            key=lambda u: -len(u.connections))

        elif sort_by == "combined":
            # Multi-key sort using tuple (primary, secondary, tertiary)
            ranked = sorted(
                user_list,
                key=lambda u: (-u.activity_score,
                               -len(u.connections),
                               u.join_date)
            )
        else:
            raise ValueError(f"Unknown sort_by: {sort_by}")

        if limit:
            ranked = ranked[:limit]

        return ranked

    def cli_leaderboard(self, sort_by="combined", limit=None):
        """Display formatted leaderboard to console."""
        ranked = self.leaderboard(sort_by=sort_by, limit=limit)

        print(f"\n{'='*60}")
        print(f"  NETPROBE LEADERBOARD (sorted by: {sort_by})")
        print(f"{'='*60}")
        print(f"  {'Rank':<6}{'User':<12}{'Score':<10}"
              f"{'Conns':<10}{'Joined':<12}")
        print(f"  {'-'*54}")

        for rank, user in enumerate(ranked, 1):
            print(f"  {rank:<6}{user.username:<12}"
                  f"{user.activity_score:<10}"
                  f"{len(user.connections):<10}"
                  f"{user.join_date:<12}")

        print()

    def cli_discover(self, username, depth=2):
        """Display friend-of-friend discovery results. (Lesson 5)"""
        discovered = self.discover_connections(username, depth)
        direct = self.users[username].connections if username in self.users else set()

        print(f"\nDiscovery from {username} (depth={depth}):")
        for person in sorted(discovered):  # sorted() for consistent display
            tag = "(direct)" if person in direct else "(indirect)"
            print(f"  -> {person} {tag}")


# --- Demo: Build and Analyze a Network ---

if __name__ == "__main__":
    net = Network()

    # Add users with join dates
    net.add_user("alice", "2023-06-15")
    net.add_user("bob", "2023-09-01")
    net.add_user("carol", "2024-01-10")
    net.add_user("dave", "2024-03-22")
    net.add_user("eve", "2023-12-05")
    net.add_user("frank", "2024-07-18")

    # Build connections
    net.add_connection("alice", "bob")
    net.add_connection("alice", "carol")
    net.add_connection("bob", "carol")
    net.add_connection("bob", "dave")
    net.add_connection("carol", "eve")
    net.add_connection("dave", "eve")
    net.add_connection("eve", "frank")

    # Simulate activity
    for action in ["post", "comment", "share", "like", "post"]:
        net.log_activity("alice", action, points=15)   # 75 total
    for action in ["post", "comment", "post", "post"]:
        net.log_activity("bob", action, points=20)     # 80 total
    for action in ["comment", "like"]:
        net.log_activity("carol", action, points=10)   # 20 total
    for action in ["post", "share", "comment", "like", "post", "share"]:
        net.log_activity("dave", action, points=12)    # 72 total
    for action in ["post", "comment", "share"]:
        net.log_activity("eve", action, points=25)     # 75 total
    for action in ["like"]:
        net.log_activity("frank", action, points=5)    # 5 total

    # Lesson 5: Friend-of-friend discovery
    net.cli_discover("alice", depth=2)

    # NEW - Lesson 6: Leaderboard commands
    net.cli_leaderboard(sort_by="activity")
    net.cli_leaderboard(sort_by="connections")
    net.cli_leaderboard(sort_by="combined", limit=3)

What we added this lesson:

leaderboard() method with three sort modes: activity, connections, and combined multi-key
cli_leaderboard() for formatted console output
Multi-key sorting using tuple keys with negation for descending order
The limit parameter for "Top K" style results (foreshadowing Lesson 8 on heaps)

Run the project you've built so far. You should see output like:

Text

Discovery from alice (depth=2):
  -> bob (direct)
  -> carol (direct)
  -> dave (indirect)
  -> eve (indirect)

============================================================
  NETPROBE LEADERBOARD (sorted by: activity)
============================================================
  Rank  User        Score     Conns     Joined
  ------------------------------------------------------
  1     bob         80        3         2023-09-01
  2     alice       75        2         2023-06-15
  3     eve         75        3         2023-12-05
  4     dave        72        2         2024-03-22
  5     carol       20        3         2024-01-10
  6     frank       5         1         2024-07-18

============================================================
  NETPROBE LEADERBOARD (sorted by: connections)
============================================================
  Rank  User        Score     Conns     Joined
  ------------------------------------------------------
  1     bob         80        3         2023-09-01
  2     carol       20        3         2024-01-10
  3     eve         75        3         2023-12-05
  4     alice       75        2         2023-06-15
  5     dave        72        2         2024-03-22
  6     frank       5         1         2024-07-18

============================================================
  NETPROBE LEADERBOARD (sorted by: combined)
============================================================
  Rank  User        Score     Conns     Joined
  ------------------------------------------------------
  1     bob         80        3         2023-09-01
  2     eve         75        3         2023-12-05
  3     alice       75        2         2023-06-15

Notice how the combined leaderboard breaks ties correctly. Alice and Eve both have 75 activity points, but Eve has 3 connections to Alice's 2, so Eve ranks higher. This is the tuple key (-activity_score, -len(connections), join_date) doing its work. If they also tied on connections, the earlier join date would win.

Summary Table

Concept	What It Does	When to Use	Watch Out For
Bubble Sort	Repeatedly swaps adjacent pairs until no swaps remain, O(n^2)	Never in production	Quadratic cost makes it unusable past a few hundred elements
Insertion Sort	Maintains a sorted prefix and inserts each new element into its correct position, O(n^2)	Small arrays, nearly sorted data, Timsort's sub-array subroutine	Still quadratic on random or reverse-sorted input
Merge Sort	Divides the list in half, sorts each half recursively, merges the sorted halves, O(n log n)	When you need a guaranteed time bound, external sorting, linked lists	Requires O(n) additional memory for the merge buffers
Quicksort	Partitions around a pivot element, recursively sorts each partition, O(n log n) average	Systems-level code in C, C++, or Java (where it is used by the runtime)	O(n^2) worst case with a bad pivot, especially on sorted input
Timsort	Detects natural runs, extends short runs with insertion sort, merges runs adaptively	Python's default: always use `sorted()` or `list.sort()`	Do not reimplement it; understand it so you can trust it
key functions	Transform each element once before comparison, caching the result internally	Any custom sort on objects or computed values	Called once per element, not once per comparison
Tuple keys	Pack multiple sort criteria into a single key for one-pass multi-key sorting	Leaderboards, rankings, any sort with tiebreaker logic	Negation for descending order only works on numeric types, not strings
Sort stability	Guarantees that equal elements keep their original relative order after sorting	Multi-pass sorting, layered sort criteria, preserving prior sort order	Only valid for stable algorithms; quicksort (Lomuto) is not stable

Next lesson, we shift from sorting to searching. Lesson 7 covers binary search and binary search trees, where the sorted order we established today becomes the foundation for O(log n) lookups without hashing. If you understand why Timsort works hard to maintain sorted runs, you will appreciate why binary search trees work hard to maintain sorted structure.

Difficulty Fork

Too Easy

You have a solid handle on sorting. Key takeaways to carry forward: (1) always use Python's built-in sort in production, (2) Timsort exploits existing order in data, (3) tuple keys with negation solve multi-criteria ranking in a single pass. Next lesson introduces binary search, which requires sorted data as a precondition, so today's work is directly foundational.

Extra challenge: Implement a "stability test" that creates a list of (value, original_index) tuples, sorts by value only, and verifies that tied values maintain their original index order. Confirm that merge sort is stable and quicksort (Lomuto partition) is not.

Just Right

You are building the right intuitions. Try this reinforcement exercise: create a list of 10,000 elements where the first 9,900 are sorted and the last 100 are random. Time bubble sort, merge sort, and sorted() on this data. You should see insertion sort and Timsort outperform merge sort here because they exploit the existing order in the first 9,900 elements.

Alternative analogy: Think of merge sort as assembling a jigsaw puzzle by first sorting pieces into edge pieces and interior pieces (divide), sorting each group (recurse), then combining them (merge). The "extra space" is your table surface. You need room to lay pieces out before joining them.

Challenge

Interview Problem (Amazon, Meta): You are given a list of log entries, each with a timestamp and a server ID. The logs from each individual server are already in chronological order, but the combined list is not. Design the most efficient sorting approach.

Hint: This is a k-way merge problem. You have k sorted subsequences interleaved together. A naive sort ignores the existing order and runs in O(n log n). A k-way merge using a min-heap (Lesson 8 preview) runs in O(n log k), which is significantly faster when k is much smaller than n. This is exactly the pattern Timsort uses internally: detect sorted runs, then merge them efficiently.

Production Problem: You are at a fintech company processing 50 million transactions per day. Transactions arrive mostly in order (by timestamp) but occasionally arrive late due to network delays. Design a sorting strategy that handles this "mostly sorted with late arrivals" pattern efficiently.

Answer: Use an insertion-sort-like approach with a bounded buffer. Keep a sorted buffer of the last N transactions. Each new transaction is inserted into its correct position in the buffer (O(N) worst case per insert, but usually O(1) because most arrive in order). Flush the oldest transactions from the buffer once they are past the maximum expected delay window. This is essentially what Timsort does at a micro level: handle small disorders with insertion sort, handle large blocks with merge sort.

Code Playground

Python

"""
Sorting Algorithms Lab: From Bubble Sort to Timsort
Dr. James Park - Lesson 6

Run this script to see empirical performance comparisons
across different algorithms and input patterns.
"""

import time
import random
from operator import attrgetter

# --- Algorithm Implementations ---

def bubble_sort(arr):
    """Sort a list in-place using bubble sort with early termination."""
    n = len(arr)
    for i in range(n):
        swapped = False
        for j in range(0, n - i - 1):
            if arr[j] > arr[j + 1]:
                arr[j], arr[j + 1] = arr[j + 1], arr[j]
                swapped = True
        if not swapped:
            break  # List is sorted, no need to continue
    return arr

def insertion_sort(arr):
    """Sort a list in-place using insertion sort."""
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and arr[j] > key:
            arr[j + 1] = arr[j]  # Shift element right
            j -= 1
        arr[j + 1] = key  # Insert into correct position
    return arr

def merge(left, right):
    """Merge two sorted lists into one sorted list (stable)."""
    result = []
    i = j = 0

while i < len(left) and j < len(right):
        if left[i] <= right[j]:  # <= ensures stability
            result.append(left[i])
            i += 1
        else:
            result.append(right[j])
            j += 1

result.extend(left[i:])  # Append remaining from left
    result.extend(right[j:])  # Append remaining from right
    return result

def merge_sort(arr):
    """Sort a list using recursive merge sort. Returns new sorted list."""
    if len(arr) <= 1:
        return arr

mid = len(arr) // 2
    left = merge_sort(arr[:mid])   # Recursively sort left half
    right = merge_sort(arr[mid:])  # Recursively sort right half
    return merge(left, right)

def quicksort(arr):
    """Sort using quicksort with median-of-three pivot selection."""
    if len(arr) <= 1:
        return arr

# --- Median-of-three pivot selection ---
    first, middle, last = arr[0], arr[len(arr) // 2], arr[-1]
    pivot = sorted([first, middle, last])[1]  # Pick the median value

# --- Three-way partition (handles duplicates well) ---
    less = [x for x in arr if x < pivot]
    equal = [x for x in arr if x == pivot]
    greater = [x for x in arr if x > pivot]

return quicksort(less) + equal + quicksort(greater)

# --- Benchmarking Utilities ---

def benchmark(sort_func, data, label):
    """Time a sort function and return elapsed time in seconds."""
    test_data = data.copy()  # Don't modify original
    start = time.perf_counter()

if sort_func.__name__ in ('bubble_sort', 'insertion_sort'):
        sort_func(test_data)  # In-place sorts
    elif sort_func.__name__ == 'builtin_sort':
        sorted(test_data)     # Python's Timsort via sorted()
    else:
        sort_func(test_data)  # merge_sort, quicksort return new lists

elapsed = time.perf_counter() - start
    return elapsed

def builtin_sort(arr):
    """Wrapper for Python's built-in sorted() to use in benchmarks."""
    return sorted(arr)

# --- Generate Test Data Patterns ---

SIZE = 3000  # Adjust for your machine (larger = slower bubble sort)

random.seed(42)  # Reproducible results
random_data = [random.randint(0, 100000) for _ in range(SIZE)]
sorted_data = list(range(SIZE))
reverse_data = list(range(SIZE, 0, -1))
nearly_sorted = list(range(SIZE))

# Introduce 5% disorder into nearly_sorted
num_swaps = SIZE // 20
for _ in range(num_swaps):
    i = random.randint(0, SIZE - 1)
    j = random.randint(0, SIZE - 1)
    nearly_sorted[i], nearly_sorted[j] = nearly_sorted[j], nearly_sorted[i]

# --- Run Benchmarks ---

print("=" * 65)
print(f"SORTING BENCHMARK: {SIZE} elements")
print("=" * 65)

algorithms = [
    (bubble_sort, "Bubble Sort"),
    (insertion_sort, "Insertion Sort"),
    (merge_sort, "Merge Sort"),
    (quicksort, "Quicksort (median-of-3)"),
    (builtin_sort, "Timsort (built-in)"),
]

patterns = [
    (random_data, "Random"),
    (sorted_data, "Already Sorted"),
    (reverse_data, "Reverse Sorted"),
    (nearly_sorted, "Nearly Sorted (5% swaps)"),
]

# Print header
print(f"\n{'Algorithm':<25}", end="")
for _, pattern_name in patterns:
    print(f"{pattern_name:<22}", end="")
print()
print("-" * 113)

# Benchmark each algorithm on each pattern
for sort_func, algo_name in algorithms:
    print(f"{algo_name:<25}", end="")
    for data, _ in patterns:
        elapsed = benchmark(sort_func, data, algo_name)
        print(f"{elapsed:<22.4f}", end="")
    print()

print()

# --- Custom Key Sorting Demo ---

print("=" * 65)
print("CUSTOM KEY SORTING: NetProbe User Leaderboard")
print("=" * 65)

class User:
    """Represents a network user with ranking attributes."""

def __init__(self, username, connection_count, activity_score, join_date):
        self.username = username
        self.connection_count = connection_count
        self.activity_score = activity_score
        self.join_date = join_date

def __repr__(self):
        return (f"User({self.username}, conns={self.connection_count}, "
                f"score={self.activity_score}, joined={self.join_date})")

# --- Create sample users ---

users = [
    User("alice", 45, 920, "2024-01-15"),
    User("bob", 120, 920, "2024-03-01"),
    User("carol", 89, 1050, "2023-11-20"),
    User("dave", 120, 920, "2024-01-15"),
    User("eve", 200, 780, "2025-06-10"),
    User("frank", 15, 1050, "2024-08-22"),
    User("grace", 67, 780, "2023-05-01"),
]

# --- Method 1: Tuple key (preferred) ---

print("\n--- Leaderboard (tuple key: score desc, connections desc, date asc) ---")
leaderboard = sorted(
    users,
    key=lambda u: (-u.activity_score, -u.connection_count, u.join_date)
)

for rank, user in enumerate(leaderboard, 1):
    print(f"  {rank}. {user.username:<10} score={user.activity_score:>5}  "
          f"conns={user.connection_count:>4}  joined={user.join_date}")
# Output:
#   1. carol      score= 1050  conns=  89  joined=2023-11-20
#   2. frank      score= 1050  conns=  15  joined=2024-08-22
#   3. dave       score=  920  conns= 120  joined=2024-01-15
#   4. bob        score=  920  conns= 120  joined=2024-03-01
#   5. alice      score=  920  conns=  45  joined=2024-01-15
#   6. eve        score=  780  conns= 200  joined=2025-06-10
#   7. grace      score=  780  conns=  67  joined=2023-05-01

# --- Method 2: attrgetter (faster for single-key) ---

print("\n--- Sorted by connection count (attrgetter, descending) ---")
by_connections = sorted(users, key=attrgetter('connection_count'), reverse=True)
for user in by_connections:
    print(f"  {user.username:<10} connections={user.connection_count}")
# Output:
#   eve        connections=200
#   bob        connections=120
#   dave       connections=120
#   carol      connections=89
#   grace      connections=67
#   alice      connections=45
#   frank      connections=15

# --- Method 3: Multiple sorts using stability ---

print("\n--- Two-pass stable sort (sort by name, then by score) ---")
stable_sorted = sorted(users, key=attrgetter('username'))       # Secondary key first
stable_sorted = sorted(stable_sorted, key=attrgetter('activity_score'), reverse=True)

for user in stable_sorted:
    print(f"  {user.username:<10} score={user.activity_score}")
# Output:
#   carol      score=1050
#   frank      score=1050
#   alice      score=920
#   bob        score=920
#   dave       score=920
#   eve        score=780
#   grace      score=780

# --- Verify sort correctness ---

print("\n--- Verification ---")
test = [random.randint(0, 999) for _ in range(500)]
assert bubble_sort(test.copy()) == sorted(test), "Bubble sort FAILED"
assert insertion_sort(test.copy()) == sorted(test), "Insertion sort FAILED"
assert merge_sort(test) == sorted(test), "Merge sort FAILED"
assert quicksort(test) == sorted(test), "Quicksort FAILED"
print("All sort implementations verified correct!")

Q&A

Previous Next