Hash Tables Under the Hood: Why dict and set Are Your Best Friends
Learning Objectives
- ✓Implement a basic hash table with chaining to demonstrate core hashing mechanics
- ✓Apply dict and set operations to replace O(n) linear scans with O(1) lookups in existing code
- ✓Analyze Python dict internals including open addressing, compact layout, and insertion-order preservation
- ✓Use set intersection, union, and difference to solve collection comparison problems efficiently
- ✓Use defaultdict and Counter to simplify aggregation and counting patterns
- ✓Evaluate when to use dict vs set vs specialized collections based on access patterns and performance requirements
Hash Tables Under the Hood: Why dict and set Are Your Best Friends
In 45 minutes, you will have a working hash table built from scratch, a deep understanding of how Python's dict and set operate internally, and a NetProbe CLI that finds mutual friends between users in milliseconds instead of minutes.
Last lesson, we ripped the cover off Python lists and found a dynamic array underneath. You saw that list.append() is amortized O(1) but x in my_list is O(n), meaning every lookup scans the entire array. I left you with a question: what if you need to search through a million records? A list will betray you. Today, we fix that betrayal with the single most important data structure in practical programming: the hash table.
Here is what the final output of today's project looks like:
$ python netprobe.py lookup alice
[NetProbe] Found user 'alice' in 0.000004s (dict lookup)
$ python netprobe.py mutual alice bob
[NetProbe] alice and bob share 3 mutual connections: {carol, dave, eve}
[NetProbe] Computed via set intersection in 0.000002s
$ python netprobe.py common-connections
[NetProbe] Top pairs by mutual connections:
alice <-> bob: 3 shared
carol <-> dave: 2 shared
eve <-> frank: 1 shared
That is what O(1) feels like. Let's earn it.
The Production Incident That Taught Me to Love Hashing
Every lesson I teach on hash tables starts with the same story, because it is the reason I actually understand them. In 2016, I was at a startup building a Django-based analytics dashboard. We had about 400,000 user records, and the "find duplicate emails" feature was implemented as a nested loop: for each user, scan all other users to check for a matching email. That is O(n-squared). With 400,000 users, that is 160 billion comparisons. The feature ran as a Celery task and it never finished. The worker would time out, retry, time out again, and eventually fill our Redis queue with zombie tasks until the whole system ground to a halt at 3 AM on a Saturday.
The fix took four lines of code. I replaced the nested loop with a set. As we iterated through users, we checked email in seen_emails (O(1) per check) and added to the set. The entire deduplication ran in under two seconds. Four lines. Two seconds. I went from a weekend-ruining outage to a solved problem in the time it takes to brew coffee. That is the power of hash tables, and that is why I am so opinionated about teaching them early.
The core insight is simple: hash tables trade memory for speed. Instead of searching through every item to find what you want, a hash table computes a numeric index directly from the key. It is like the difference between searching every book in a library versus walking straight to the shelf number printed on your library card. The shelf number is the hash. The library is the array of buckets. And when two books get assigned the same shelf, that is a collision, and we need a strategy for handling it.
💡 Key Insight: If you ever write
for item in collectionjust to check membership or find a value by key, you are almost certainly doing it wrong. Adictorsetcan do it in O(1).
Before we touch Python's built-in dict and set, we are going to build a hash table from scratch. I am a firm believer that you do not truly understand a data structure until you have implemented one badly, debugged it, and then appreciated why the standard library version is better. Let's start building.
Building a Hash Table from Scratch: Buckets, Hashes, and Collisions
A hash table is just an array where the index is computed from the key, not assigned sequentially. That is it. Every other detail, buckets, collision resolution, load factors, rehashing, is a consequence of making that one idea work reliably. Let me walk you through the mechanics.
Step 1: The hash function turns a key into an array index. You feed in a string like "alice" and get back a number like 3. That number tells you which slot (or "bucket") in the array to use. Python gives us a built-in hash() function that returns a large integer for any hashable object. We then use the modulo operator (% size) to squeeze that big number into our small array. If our array has 8 slots, then hash("alice") % 8 gives us a number between 0 and 7.
Step 2: Collisions are inevitable, and you must plan for them. Think about it: if you have 8 buckets and 20 keys, at least some keys must share a bucket. This is called a collision. There are two major strategies for handling collisions. "Chaining" stores a list of key-value pairs at each bucket, so colliding keys just pile up in the same slot. "Open addressing" searches for the next empty slot when a collision occurs. We will build chaining first because it is easier to understand, then I will explain why Python uses open addressing instead.
Step 3: The load factor determines when your table gets slow. The load factor is the ratio of stored items to total buckets: n / size. When this ratio climbs past about 0.7, collisions become frequent and performance degrades. At that point, you "rehash": create a bigger array (typically double the size) and reinsert every item. Yes, rehashing is O(n), but just like the dynamic array resizing we saw in Lesson 1, it happens rarely enough that the amortized cost stays O(1).
The table below summarizes the two collision strategies. Read through it, but I will explain the tradeoffs in detail afterward.
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Chaining | Each bucket holds a linked list of (key, value) pairs | Simple to implement; never truly "full" | Extra memory for list nodes; cache-unfriendly |
| Open Addressing | On collision, probe for next empty slot | Cache-friendly; no pointer overhead | Deletion is tricky; clustering can degrade performance |
The reason Python uses open addressing is performance on modern hardware. When you chain with linked lists, each node can live anywhere in memory. Following pointers from node to node causes CPU cache misses, which are expensive. Open addressing stores everything in a contiguous array, which means sequential memory access and better cache utilization. CPython's dict uses a technique called "open addressing with perturbation" that mixes in higher bits of the hash to break up clustering patterns. We will dig into that in the deep dive below.
Deep Dive: CPython's Perturbation Probing
Standard open addressing uses "linear probing" (check slot i, then i+1, then i+2) or "quadratic probing" (check slot i, then i+1, then i+4, then i+9). Both suffer from clustering, where occupied slots clump together and slow down lookups.
CPython does something cleverer. When a collision occurs at slot j, the next probe is computed as:
j = (5 * j + 1 + perturb) % size
where perturb starts as the full hash value and gets right-shifted by 5 on each iteration (perturb >>= 5). This means the probe sequence depends on ALL bits of the hash, not just the low bits used for the initial index. It scatters probes across the table more uniformly than linear or quadratic probing.
This is documented in CPython's Objects/dictobject.c source code, and the comment there (written by Tim Peters) is one of the best pieces of technical writing in the entire CPython codebase. I recommend reading it if you enjoy understanding internals.
🤔 Think about it: If you have a hash table with 8 buckets and you insert 6 items, what is the load factor? Should you rehash?
View Answer
The load factor is 6/8 = 0.75. This is above the typical threshold of 0.67-0.7, so yes, you should rehash. Most implementations would have triggered a resize when the 5th or 6th item was inserted. CPython's dict resizes when the table is 2/3 full, which for 8 buckets means after inserting the 6th item (6/8 = 0.75 is past the 2/3 = 0.667 threshold, so it would have actually resized before you even got to 6).
💡 Key Takeaway: A hash table is an array indexed by hash values, with a collision strategy and a resize policy. Every feature of
dictandsetfollows from these three pieces.
Why Python's dict Is a Masterpiece of Engineering
Python's dict is not just "a hash table." It is one of the most optimized data structures in any language's standard library. Starting with Python 3.6 (and guaranteed from 3.7), dictionaries maintain insertion order. This was not achieved by adding a linked list on top of the hash table, which is what Java's LinkedHashMap does. Instead, Raymond Hettinger redesigned the internal layout to be both more memory-efficient and ordered by default. This is the "compact dict" design, and understanding it will change how you think about dictionaries.
The compact dict uses two arrays instead of one. The old design had a single sparse array where each slot held (hash, key, value) or was empty. If the array had 8 slots but only 3 items, 5 slots were wasted space, each holding three pointers worth of nothing. The compact dict splits this into a dense array of (hash, key, value) entries and a sparse index array that just holds small integers pointing into the dense array. The dense array has no gaps, so insertion order is preserved naturally, and the sparse array wastes far less memory because each empty slot is just one small integer, not three pointers.
This redesign saved 20-25% memory in real-world Python programs. The Python core team measured this across a variety of benchmarks. For web applications that create thousands of dictionaries (think of every request object, every JSON payload, every ORM model instance), this adds up fast. Django, Flask, and FastAPI all benefit enormously from this optimization because HTTP request handling is dictionary-heavy. Every header, every query parameter, every form field is a key-value pair in a dict.
The following table compares old-style and compact dict layouts for a table with 8 slots and 3 stored entries:
| Property | Old Layout (Pre-3.6) | Compact Layout (3.7+) |
|---|---|---|
| Sparse array entry size | 3 pointers (hash + key + value) | 1 byte (index into dense array) |
| Wasted space per empty slot | 24 bytes (on 64-bit) | 1 byte |
| Insertion order preserved | No (implementation detail in 3.6) | Yes (guaranteed) |
| Memory for 3 items, 8 slots | 8 x 24 = 192 bytes | 8 x 1 + 3 x 24 = 80 bytes |
The bottom line is dramatic: 80 bytes versus 192 bytes for the same data. That 58% reduction compounds across every dictionary in your program. If you are running a FastAPI service that handles 10,000 concurrent requests, each with a dozen dicts for headers, query params, JSON body, and middleware state, the savings are measured in megabytes. This is why I tell my students: dict is not just convenient, it is engineered for production.
Deep Dive: How dict Preserves Insertion Order
The trick is beautifully simple. The dense entries array is append-only during normal operation. When you insert "alice": 1, then "bob": 2, then "carol": 3, they go into indices 0, 1, 2 of the dense array. The sparse index array maps hash slots to dense indices: if hash("alice") % 8 == 3, then sparse[3] = 0 (pointing to dense index 0).
When you iterate with for k in my_dict, Python just walks the dense array from index 0 to index N-1. Since items were appended in insertion order, iteration is in insertion order. No linked list needed.
Deletion is the one complication: removing an item from the middle of the dense array would break indices. CPython handles this by marking the dense entry as a "dummy" (tombstone) and adjusting during the next resize. This is why heavy deletion patterns can temporarily use more memory than expected, and why dict.pop() does not immediately reclaim the space.
Now that we understand the internal machinery of dict, let's see how set works. The answer will not surprise you: it is the same thing, minus the values.
Sets: The Unsung Hero of Algorithmic Thinking
A set in Python is literally a dict without values. Internally, CPython's set uses the same open-addressing hash table with the same perturbation probing. The only difference is that each entry stores just a hash and a key, with no associated value. This is why set has the same O(1) membership testing, the same collision behavior, and the same load factor constraints as dict. But set gives you something dict does not: mathematical set operations that are insanely useful in practice.
Set intersection alone has saved me more debugging hours than any other single operation. Finding common elements between two collections is a problem that comes up constantly: mutual friends in a social graph, overlapping permissions between roles, shared dependencies between packages, duplicate entries across databases. Without sets, you write nested loops. With sets, you write a & b. The asymptotic difference is not incremental, it is categorical: O(n times m) versus O(min(n, m)).
Let me show you the wrong way first, because I need you to feel the pain before you appreciate the cure.
❌ The nested loop approach:
friends_alice = ["bob", "carol", "dave", "eve", "frank"]
friends_bob = ["carol", "dave", "eve", "grace", "heidi"]
mutual = [f for f in friends_alice if f in friends_bob]
# O(n * m): for each of alice's friends, scan ALL of bob's friends
This looks innocent with 5 friends each. But social networks do not have 5 friends. Facebook's median user has around 200 friends. If you run this on two users with 200 friends each, you execute 40,000 comparisons. For a feature that shows mutual friends across all friend pairs of a user with 200 friends, you have 200 times 199 divided by 2 = 19,900 pairs, each requiring up to 40,000 comparisons. That is nearly 800 million operations. I have seen this exact pattern take down a microservice at a company I will not name.
✅ The set intersection approach:
friends_alice = {"bob", "carol", "dave", "eve", "frank"}
friends_bob = {"carol", "dave", "eve", "grace", "heidi"}
mutual = friends_alice & friends_bob # O(min(n, m))
# Result: {"carol", "dave", "eve"}
The & operator on sets iterates through the smaller set and checks each element against the larger one. Each membership check is O(1), so the total cost is O(min(n, m)). For 200 friends, that is 200 checks instead of 40,000. For the full mutual-friends-across-all-pairs computation, the difference is staggering.
Here is a complete comparison of set operations and their costs. Study this table, because these operations will appear in every coding interview that involves collections.
| Operation | Syntax | Big-O | What It Returns |
|---|---|---|---|
| Intersection | a & b or a.intersection(b) | O(min(len(a), len(b))) | Elements in both a and b |
| Union | a | b or a.union(b) | O(len(a) + len(b)) | Elements in a or b or both |
| Difference | a - b or a.difference(b) | O(len(a)) | Elements in a but not in b |
| Symmetric Difference | a ^ b or a.symmetric_difference(b) | O(len(a) + len(b)) | Elements in a or b but not both |
| Subset Check | a <= b or a.issubset(b) | O(len(a)) | True if every element of a is in b |
An important subtlety: intersection iterates the smaller set. CPython's implementation of set.intersection starts by checking which set is smaller, then loops through the smaller one checking membership in the larger one. This means tiny_set & huge_set and huge_set & tiny_set are equally fast. The implementation handles it for you. LinkedIn uses exactly this principle for their "People You May Know" feature, where they compute set intersections between connection lists of varying sizes across hundreds of millions of users.
⚠️ Common Pitfall: Do not confuse set difference (
a - b) with symmetric difference (a ^ b). The first gives you what is inabut notb. The second gives you what is in either but not both. I have reviewed pull requests where this confusion caused a permission system to grant access to users who should have been denied. Test with small examples first.
🤔 Think about it: You have set A with 1,000 elements and set B with 10 elements. Which is faster: A & B or A - B? Why?
View Answer
A & B is faster. Intersection is O(min(len(A), len(B))) = O(10), because it iterates the smaller set and checks membership in the larger one. Difference A - B is O(len(A)) = O(1,000), because it must check every element of A against B. When one set is much smaller than the other, intersection is dramatically cheaper than difference.
💡 Key Takeaway: Set operations are not just syntactic sugar for loops. They are algorithmically superior implementations that exploit hash-based O(1) membership testing. Use them whenever you need to compare, combine, or filter collections.
defaultdict, Counter, and ChainMap: Power Tools You Are Probably Not Using
Now that you understand the hash table machinery underneath, let's look at the specialized dict variants that eliminate entire categories of boilerplate code. The collections module contains three tools I reach for constantly: defaultdict, Counter, and ChainMap. Each one wraps a dict with behavior that solves a specific, common pattern.
defaultdict eliminates the "check if key exists, then initialize" dance. Here is a pattern I see in virtually every beginner's Python code: you want to group items by some key, so you write an if key not in result check before appending. This is tedious, error-prone, and slow (the key gets hashed twice, once for the in check and once for the assignment). With defaultdict, you specify a factory function (like list or int) and the dictionary auto-creates missing keys with a default value.
❌ The boilerplate way:
groups = {}
for user in users:
if user.role not in groups:
groups[user.role] = []
groups[user.role].append(user.name)
✅ The defaultdict way:
from collections import defaultdict
groups = defaultdict(list)
for user in users:
groups[user.role].append(user.name) # auto-creates empty list if role is new
Counter is a dict subclass designed specifically for counting things. It takes an iterable and returns a dictionary mapping each element to its count. But the real power is in its methods: most_common(n) gives you the top N elements by count, and Counter objects support arithmetic operations. You can add two Counters together, subtract one from another, or intersect them (which gives the minimum count for each shared key). At Google, I used Counter extensively for log analysis: counting error codes, tallying request types, summarizing user actions. It turns a five-line loop into a one-liner.
ChainMap groups multiple dicts into a single lookup without copying. This is less commonly taught but incredibly useful for configuration systems. When you have default settings, environment overrides, and command-line flags, you want lookups to check command-line first, then environment, then defaults. ChainMap does exactly this. Django's settings system does not use ChainMap directly, but the pattern is identical: a layered lookup where more specific values shadow less specific ones. Flask's configuration system works similarly, checking app config, then blueprint config, then defaults.
| Tool | When to Use | What It Replaces |
|---|---|---|
defaultdict(list) | Grouping items by key | if key not in d: d[key] = [] boilerplate |
defaultdict(int) | Counting occurrences | d[key] = d.get(key, 0) + 1 |
Counter | Frequency analysis, top-N queries | Manual counting loops |
ChainMap | Layered config/settings lookup | Merging dicts with {**defaults, **overrides} |
A nuance about ChainMap that trips people up: writes only affect the first mapping. If you do chain_map["key"] = value, it writes to the first dict in the chain, not to whichever dict originally held that key. This is intentional, because the first mapping represents the most specific scope, but it surprises people who expect it to work like a merged dict. I lost about two hours to this behavior when building a configuration system for a CLI tool. The defaults kept getting "overridden" but the override was going to the wrong layer.
Deep Dive: Counter Arithmetic
Counter supports +, -, & (intersection/minimum), and | (union/maximum). Here is what each does:
Counter("aab") + Counter("bcc")givesCounter({'a': 2, 'b': 2, 'c': 2}). Counts are summed.Counter("aab") - Counter("ab")givesCounter({'a': 1}). Counts are subtracted, and zero or negative counts are dropped.Counter("aab") & Counter("abc")givesCounter({'a': 1, 'b': 1}). For each key, take the minimum count.Counter("aab") | Counter("abc")givesCounter({'a': 2, 'b': 1, 'c': 1}). For each key, take the maximum count.
The intersection operation is particularly useful for finding the "minimum common multiset" between two collections, for example, finding which Scrabble letters two players share at minimum.
🤔 Think about it: What happens if you access a key in a defaultdict(int) that does not exist, and you never assign to it? Does the key end up in the dict?
View Answer
Yes. Accessing a missing key in a defaultdict triggers the default factory, which creates the key with the default value (0 for int). So d = defaultdict(int); x = d["missing"] leaves d as {"missing": 0}. This is a common surprise: just reading a key can modify the dictionary. If you want to check without creating, use d.get("missing", 0) or "missing" in d instead.
💡 Key Takeaway:
defaultdict,Counter, andChainMapare not just convenience wrappers. They encode common patterns into data structures, reducing bugs and making intent clearer. Prefer them over hand-rolled logic whenever the pattern matches.
Hash Collisions in Practice: What Makes a Bad Hash Function
Understanding collisions is the difference between knowing how to use a dict and knowing when it might betray you. Most of the time, Python's built-in hash() function works perfectly. Strings, integers, tuples of immutable types: all produce well-distributed hashes. But there are cases where things go wrong, and if you do not understand why, you will spend hours debugging mysterious slowdowns.
The worst-case scenario for a hash table is when every key maps to the same bucket. If that happens, your O(1) lookup degrades to O(n), because the collision resolution strategy (whether chaining or open addressing) has to scan through all the colliding entries. This is not theoretical. In 2011, a group of security researchers published a paper showing that they could construct HTTP POST parameters whose keys all hash to the same bucket in most web frameworks' internal dicts. By sending a specially crafted request, an attacker could force the server to spend O(n-squared) time parsing form data, effectively creating a denial-of-service attack with a single request. This affected PHP, Java, Python, Ruby, and many other languages. Python added hash randomization (enabled by default since Python 3.3) as a countermeasure. Each time your Python process starts, it generates a random seed that gets mixed into string hashes.
This is why hash("hello") returns a different value every time you restart Python. If you have ever noticed this and been confused, now you know why. The randomization prevents attackers from predicting hash values and crafting collision attacks. But it also means you cannot rely on hash() for anything that needs to be consistent across process restarts, like caching to disk or generating deterministic file names. For those use cases, use hashlib.sha256() or similar, which always produces the same output for the same input.
Another subtle source of bad hashing: custom objects with poorly implemented __hash__. When you define a class and want instances to be usable as dict keys or set members, you need to implement both __hash__ and __eq__. The critical contract is: if a == b, then hash(a) == hash(b). Violating this contract causes silent, maddening bugs where you insert an item into a dict and then cannot find it. I once spent an entire afternoon debugging a test failure that came down to a data class where __eq__ compared all fields but __hash__ only used one field. Two different objects hashed to the same value but were not equal, causing phantom duplicates in a set.
⚠️ Common Pitfall: If you override
__eq__on a class but forget to override__hash__, Python will set__hash__toNone, making instances unhashable. You will getTypeError: unhashable typewhen trying to use them as dict keys. Always implement both together.
| Scenario | Hash Quality | Performance Impact |
|---|---|---|
| Normal strings/ints | Excellent distribution | O(1) average lookups |
| All keys collide (attack) | All map to same bucket | O(n) per lookup, O(n-squared) total |
Custom __hash__ returns constant | Worst possible | Same as attack scenario |
hash() across restarts | Different each time (randomized) | No cross-process caching |
The practical takeaway: trust Python's built-in hashing for built-in types, use hashlib for cross-process consistency, and always implement __hash__ and __eq__ together for custom classes. If you ever need to debug a mysterious dict performance regression, the first thing to check is whether keys are colliding excessively. You can test this by examining hash(key) % table_size for a sample of your keys and checking the distribution.
🤔 Think about it: Why can't you use a list as a dictionary key, but you can use a tuple?
View Answer
Lists are mutable: you can change their contents after creation. If a list were used as a dict key and then modified, its hash would change, but the dict would still have it stored under the old hash. The key would become "lost" in the wrong bucket, invisible to lookups. Tuples are immutable, so their contents (and therefore their hash) cannot change after creation. This is why Python requires dict keys to be hashable, and mutability is the primary reason an object cannot be hashable. Note: a tuple containing a list is also unhashable, because the list inside could still change.
💡 Key Takeaway: Hash quality directly determines dict and set performance. Python handles this well for built-in types, but custom objects and cross-process scenarios require careful attention.
Now that we understand the theory and internals, let's build something real. Time to write code.
The Complete Code: Hash Table from Scratch Meets Python's Best
This is our single comprehensive code block for the lesson. It builds a minimal hash table with chaining, demonstrates dict and set operations for the NetProbe project, and benchmarks everything. Read the inline comments carefully, because they connect every line back to the concepts we just covered.
# ============================================================
# Lesson 2: Hash Tables Under the Hood
# A complete script: custom hash table + dict/set power tools
# ============================================================
import time
from collections import defaultdict, Counter, ChainMap
# --- Part 1: Build a Hash Table from Scratch (Chaining) ---
class HashTable:
"""A minimal hash table using separate chaining for collision resolution."""
def __init__(self, initial_size=8):
"""Start with 8 buckets, just like CPython's dict."""
self.size = initial_size
self.count = 0
self.buckets = [[] for _ in range(self.size)] # each bucket is a list of (key, value) pairs
self.load_factor_threshold = 0.67 # CPython uses 2/3
def _hash_index(self, key):
"""Compute bucket index: built-in hash() mod table size."""
return hash(key) % self.size
def _resize(self):
"""Double the table size and rehash all entries."""
old_buckets = self.buckets
self.size *= 2
self.buckets = [[] for _ in range(self.size)]
self.count = 0
for bucket in old_buckets:
for key, value in bucket:
self.put(key, value) # rehash into new, larger table
def put(self, key, value):
"""Insert or update a key-value pair."""
if self.count / self.size >= self.load_factor_threshold:
self._resize() # keep load factor below threshold
index = self._hash_index(key)
bucket = self.buckets[index]
# Check if key already exists in this bucket's chain
for i, (existing_key, _) in enumerate(bucket):
if existing_key == key:
bucket[i] = (key, value) # update existing
return
bucket.append((key, value)) # new key, add to chain
self.count += 1
def get(self, key, default=None):
"""Retrieve value by key, O(1) average case."""
index = self._hash_index(key)
bucket = self.buckets[index]
for existing_key, value in bucket:
if existing_key == key:
return value
return default # key not found
def __contains__(self, key):
"""Support 'key in table' syntax."""
return self.get(key, sentinel := object()) is not sentinel
def __len__(self):
return self.count
# --- Part 2: Test Our Hash Table ---
print("=" * 55)
print("PART 1: Custom Hash Table with Chaining")
print("=" * 55)
ht = HashTable()
# Insert some user data
users = [("alice", {"name": "Alice Park", "role": "engineer"}),
("bob", {"name": "Bob Kim", "role": "designer"}),
("carol", {"name": "Carol Lee", "role": "engineer"}),
("dave", {"name": "Dave Chen", "role": "manager"}),
("eve", {"name": "Eve Santos", "role": "engineer"}),]
for username, data in users:
ht.put(username, data)
print(f"Table size: {ht.size}, Items: {len(ht)}")
# Output: Table size: 8, Items: 5
print(f"Load factor: {ht.count / ht.size:.2f}")
# Output: Load factor: 0.62
print(f"Lookup 'carol': {ht.get('carol')}")
# Output: Lookup 'carol': {'name': 'Carol Lee', 'role': 'engineer'}
print(f"'alice' in table: {'alice' in ht}")
# Output: 'alice' in table: True
print(f"'zara' in table: {'zara' in ht}")
# Output: 'zara' in table: False
# --- Part 3: Dict vs List Lookup Benchmark ---
print("\n" + "=" * 55)
print("PART 2: Dict vs List Lookup Benchmark")
print("=" * 55)
# Build a list and a dict with the same data
user_count = 100_000
user_list = [{"username": f"user_{i}", "email": f"user_{i}@test.com"} for i in range(user_count)]
user_dict = {f"user_{i}": {"email": f"user_{i}@test.com"} for i in range(user_count)}
# Search for the LAST user (worst case for list)
target = f"user_{user_count - 1}"
# List search: O(n)
start = time.perf_counter()
result_list = next(u for u in user_list if u["username"] == target)
list_time = time.perf_counter() - start
# Dict lookup: O(1)
start = time.perf_counter()
result_dict = user_dict[target]
dict_time = time.perf_counter() - start
print(f"List search ({user_count:,} items): {list_time:.6f}s")
# Output: (varies, typically 0.005-0.02s)
print(f"Dict lookup ({user_count:,} items): {dict_time:.6f}s")
# Output: (varies, typically 0.000001-0.000005s)
speedup = list_time / dict_time if dict_time > 0 else float('inf')
print(f"Dict is ~{speedup:,.0f}x faster")
# Output: (varies, typically 1000-10000x faster)
# --- Part 4: Set Operations for Mutual Friends ---
print("\n" + "=" * 55)
print("PART 3: Set Operations - Mutual Friends")
print("=" * 55)
# Social graph: each user maps to a set of friends
social_graph = {
"alice": {"bob", "carol", "dave", "eve", "frank"},
"bob": {"alice", "carol", "dave", "eve", "grace"},
"carol": {"alice", "bob", "dave", "heidi"},
"dave": {"alice", "bob", "carol", "eve"},
"eve": {"alice", "bob", "dave", "frank"},
"frank": {"alice", "eve"},
"grace": {"bob"},
"heidi": {"carol"},
}
# Find mutual friends between alice and bob
mutual = social_graph["alice"] & social_graph["bob"] # set intersection
print(f"alice & bob mutual friends: {mutual}")
# Output: alice & bob mutual friends: {'carol', 'dave', 'eve'}
# (Note: set ordering may vary per run due to hash randomization)
# Friends of alice but NOT bob (set difference)
alice_only = social_graph["alice"] - social_graph["bob"]
print(f"alice's exclusive friends: {alice_only}")
# Output: alice's exclusive friends: {'frank'}
# (bob is excluded because we're looking at alice's friend LIST, which includes bob,
# but the difference removes anyone also in bob's friend list)
# All friends in either circle (set union)
combined = social_graph["alice"] | social_graph["bob"]
print(f"Combined network: {combined}")
# Output: Combined network: {'bob', 'carol', 'dave', 'eve', 'frank', 'alice', 'grace'}
# (ordering varies)
# Find ALL mutual friend pairs with counts
print("\nTop mutual friend pairs:")
pair_mutuals = []
users_list_sorted = sorted(social_graph.keys())
for i, user_a in enumerate(users_list_sorted):
for user_b in users_list_sorted[i + 1:]:
shared = social_graph[user_a] & social_graph[user_b]
if shared:
pair_mutuals.append((user_a, user_b, len(shared), shared))
pair_mutuals.sort(key=lambda x: x[2], reverse=True) # sort by count, descending
for user_a, user_b, count, shared in pair_mutuals[:5]:
print(f" {user_a} <-> {user_b}: {count} mutual ({shared})")
# Output: (top pairs with highest mutual friend counts, ordering of set elements varies)
# --- Part 5: defaultdict and Counter in Action ---
print("\n" + "=" * 55)
print("PART 4: defaultdict and Counter")
print("=" * 55)
# Group users by role using defaultdict
all_users = [("Alice", "engineer"), ("Bob", "designer"), ("Carol", "engineer"),
("Dave", "manager"), ("Eve", "engineer"), ("Frank", "designer"),
("Grace", "manager"), ("Heidi", "engineer"),]
by_role = defaultdict(list)
for name, role in all_users:
by_role[role].append(name) # no need to check if key exists
print("Users by role:")
for role, names in sorted(by_role.items()):
print(f" {role}: {names}")
# Output:
# designer: ['Bob', 'Frank']
# engineer: ['Alice', 'Carol', 'Eve', 'Heidi']
# manager: ['Dave', 'Grace']
# Count connection frequency with Counter
all_connections = []
for user, friends in social_graph.items():
all_connections.extend(friends)
connection_counts = Counter(all_connections)
print(f"\nMost connected users: {connection_counts.most_common(3)}")
# Output: Most connected users: [('alice', ...), ('bob', ...), ('dave', ...)]
# (exact counts depend on the social graph defined above)
# Counter arithmetic: combine connection data from two sources
source_a = Counter({"alice": 5, "bob": 3, "carol": 2})
source_b = Counter({"alice": 3, "bob": 7, "dave": 4})
combined_counts = source_a + source_b # element-wise addition
print(f"Combined counts: {combined_counts}")
# Output: Combined counts: Counter({'bob': 10, 'alice': 8, 'dave': 4, 'carol': 2})
# --- Part 6: ChainMap for Layered Configuration ---
print("\n" + "=" * 55)
print("PART 5: ChainMap for Layered Config")
print("=" * 55)
defaults = {"timeout": 30, "retries": 3, "debug": False, "host": "localhost"}
env_config = {"timeout": 60, "debug": True} # environment overrides
cli_flags = {"timeout": 10} # command-line overrides (highest priority)
config = ChainMap(cli_flags, env_config, defaults) # first dict has highest priority
print(f"timeout: {config['timeout']}") # from cli_flags
# Output: timeout: 10
print(f"debug: {config['debug']}") # from env_config (not in cli_flags)
# Output: debug: True
print(f"retries: {config['retries']}") # from defaults (not in cli or env)
# Output: retries: 3
print(f"host: {config['host']}") # from defaults
# Output: host: localhost
# Show which layer each value comes from
for key in ["timeout", "debug", "retries", "host"]:
for i, mapping in enumerate(["cli_flags", "env_config", "defaults"]):
if key in config.maps[i]:
print(f" {key} -> resolved from {mapping}")
break
# Output:
# timeout -> resolved from cli_flags
# debug -> resolved from env_config
# retries -> resolved from defaults
# host -> resolved from defaults
# --- Part 7: Performance Summary ---
print("\n" + "=" * 55)
print("PERFORMANCE SUMMARY")
print("=" * 55)
operations = [("dict[key]", "O(1) avg", "Hash + index"),
("key in dict", "O(1) avg", "Same as lookup"),
("key in list", "O(n)", "Linear scan"),
("set_a & set_b", "O(min(n,m))", "Iterate smaller"),
("set_a | set_b", "O(n + m)", "Iterate both"),
("set_a - set_b", "O(n)", "Iterate left set"),
("Counter.most_common(k)", "O(n log k)", "Heap selection"),]
print(f"{'Operation':<28} {'Complexity':<14} {'Mechanism'}")
print("-" * 55)
for op, complexity, mechanism in operations:
print(f"{op:<28} {complexity:<14} {mechanism}")
# Output: formatted table of all operation complexities
📌 Remember: Copy this entire script and run it. Every part builds on concepts from this lesson, and the output demonstrates the performance differences we discussed. Modify the
user_countvariable in Part 3 to see how dict lookup stays constant while list search grows linearly.
Speed Round
Five rapid-fire questions. Answer each in under 30 seconds. No peeking.
1. What is the average time complexity of key in my_dict?
Answer
O(1). Dictionary membership testing uses hash-based lookup, not linear scanning.
2. True or false: Python dicts maintain insertion order as of Python 3.7.
Answer
True. It was an implementation detail in CPython 3.6 and became a language guarantee in Python 3.7.
3. What does defaultdict(int) return for a missing key?
Answer
0. The int() factory function returns 0 when called with no arguments.
4. Which set operation finds elements in A but not B: A & B, A - B, or A ^ B?
Answer
A - B (set difference). A & B is intersection, A ^ B is symmetric difference.
5. Why does hash("hello") return a different value each time you restart Python?
Answer
Hash randomization. Python mixes a random seed into string hashes on startup to prevent hash collision denial-of-service attacks. This has been the default since Python 3.3.
Common Gotchas: Mistakes I Have Seen (and Made)
This section is a collection of real mistakes from production code, code reviews, and my own embarrassing history. Each one involves hash tables, dicts, or sets, and each one is subtle enough to pass code review if you are not paying attention.
Gotcha 1: Mutating a dict while iterating over it. Python will raise a RuntimeError if you add or remove keys during a for key in my_dict loop. This is a safety feature, because the hash table could resize mid-iteration, invalidating the internal state. The fix is to iterate over a copy of the keys: for key in list(my_dict.keys()). I see this mistake at least once a month in code reviews.
⚠️ Common Pitfall:
for key in my_dictanddel my_dict[some_key]in the same loop will crash. Always iterate overlist(my_dict)if you plan to modify the dict.
Gotcha 2: Using a mutable default argument with defaultdict. This is a Python classic, not specific to defaultdict, but it bites harder here. If you write defaultdict(lambda: []) versus defaultdict(list), both work, but some developers try defaultdict(lambda: some_list) where some_list is a variable, and then every default value is the same list object. Every key shares the same list. Mutations to one key's value appear in all others. This is the "mutable default argument" trap wearing a different hat.
Gotcha 3: Assuming dict ordering in Python 3.6 code that might run on 3.5. If your codebase supports Python 3.5 or earlier (some legacy enterprise systems still do), dict does not preserve insertion order. You need collections.OrderedDict for guaranteed ordering. I know it is 2026 and Python 3.5 sounds ancient, but I helped a financial services client migrate off Python 3.5 just last year. Legacy is everywhere.
Gotcha 4: Forgetting that set elements must be hashable. You cannot put lists, dicts, or sets inside a set. You can put tuples and frozensets. If you need a "set of sets," use frozenset. This comes up in graph algorithms where you want to track visited states, and each state is a collection. Convert to frozenset before adding to the visited set.
Gotcha 5: Using dict() constructor vs {} literal. The literal {} is faster because it is compiled to a single BUILD_MAP bytecode instruction, while dict() involves a function call with all its overhead. For hot paths, this matters. In normal code, use whichever you find more readable, but be aware that {} creates an empty dict, not an empty set. An empty set requires set().
| Gotcha | Symptom | Fix |
|---|---|---|
| Mutating during iteration | RuntimeError | Iterate over list(my_dict) |
| Shared mutable default | All keys have same value | Use defaultdict(list), not a shared reference |
| Assuming order pre-3.7 | Random key order on older Python | Use OrderedDict for compatibility |
| Unhashable set elements | TypeError | Convert lists to tuples, sets to frozensets |
{} vs set() | Empty {} is a dict, not a set | Use set() for empty sets |
💡 Key Takeaway: Most dict and set bugs come from violating the hashability contract or mutating during iteration. If something "should work" but does not, check these two things first.
Speed Round Bonus: The Failure-First Pattern
Let me show you a mistake so common that I built it into my onboarding training at my last startup. The task: count how many times each word appears in a log file.
❌ The wrong way (four different bugs in five lines):
word_counts = {}
for word in words:
word_counts[word] += 1 # KeyError on first occurrence of any word!
This crashes immediately with KeyError. So the developer "fixes" it:
word_counts = {}
for word in words:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
This works but hashes word twice per iteration (once for in, once for the assignment). It is also four lines where one will do.
✅ The correct approach:
word_counts = Counter(words) # one line, one pass, optimized C implementation
The Counter class is implemented in C in CPython and runs significantly faster than any Python-level loop. It also gives you most_common(), arithmetic operations, and a clean API. There is no reason to write manual counting loops in Python. Ever. I am not being diplomatic here. If I see a manual counting loop in a code review, I will request changes.
🔨 Project Update
Here is NetProbe, our progressive project, updated with everything from Lessons 1 and 2. In Lesson 1, we built the CSV loader and linear search. In this lesson, we add dict-based indexing, set-based mutual friend detection, and a common-connections command. New code is marked with # NEW IN LESSON 2 comments.
Copy the entire block below into netprobe.py and run it with the commands shown at the bottom.
First, create the data file users.csv with this content:
username,fullname,role
alice,Alice Park,engineer
bob,Bob Kim,designer
carol,Carol Lee,engineer
dave,Dave Chen,manager
eve,Eve Santos,engineer
frank,Frank Wu,designer
grace,Grace Huang,manager
heidi,Heidi Tanaka,engineer
And create connections.csv:
user_a,user_b
alice,bob
alice,carol
alice,dave
alice,eve
alice,frank
bob,carol
bob,dave
bob,eve
bob,grace
carol,dave
carol,heidi
dave,eve
eve,frank
Now, the cumulative project code:
# netprobe.py - Network Probe CLI (Lessons 1 + 2)
# Usage:
# python netprobe.py lookup <username>
# python netprobe.py mutual <user_a> <user_b>
# python netprobe.py common-connections
import csv
import sys
import time
# --- Lesson 1: CSV Loading ---
def load_users_list(filepath):
"""Load users from CSV into a list of dicts (Lesson 1 approach)."""
users = []
with open(filepath, newline="") as f:
reader = csv.DictReader(f)
for row in reader:
users.append(row)
return users
# --- NEW IN LESSON 2: Dict-based index for O(1) lookup ---
def build_user_index(user_list):
"""Convert user list to a dict keyed by username. O(n) build, O(1) lookup."""
return {user["username"]: user for user in user_list}
# --- NEW IN LESSON 2: Load connections into a social graph (adjacency sets) ---
def load_social_graph(filepath):
"""Load connections CSV into {username: set_of_friends} using defaultdict."""
from collections import defaultdict
graph = defaultdict(set)
with open(filepath, newline="") as f:
reader = csv.DictReader(f)
for row in reader:
user_a = row["user_a"]
user_b = row["user_b"]
graph[user_a].add(user_b) # friendship is bidirectional
graph[user_b].add(user_a)
return dict(graph) # convert back to regular dict for clean output
# --- Lesson 1: Linear search (kept for comparison) ---
def linear_search(user_list, username):
"""O(n) linear scan through list. Lesson 1 approach."""
for user in user_list:
if user["username"] == username:
return user
return None
# --- NEW IN LESSON 2: Dict lookup ---
def dict_lookup(user_index, username):
"""O(1) dict-based lookup. Lesson 2 upgrade."""
return user_index.get(username)
# --- NEW IN LESSON 2: Mutual friends via set intersection ---
def find_mutual_friends(social_graph, user_a, user_b):
"""Find shared connections using set intersection. O(min(n, m))."""
friends_a = social_graph.get(user_a, set())
friends_b = social_graph.get(user_b, set())
return friends_a & friends_b
# --- NEW IN LESSON 2: Top mutual friend pairs ---
def top_common_connections(social_graph, top_n=5):
"""Find pairs with the most mutual friends."""
results = []
users_sorted = sorted(social_graph.keys())
for i, user_a in enumerate(users_sorted):
for user_b in users_sorted[i + 1:]:
mutual = find_mutual_friends(social_graph, user_a, user_b)
if mutual:
results.append((user_a, user_b, len(mutual), mutual))
results.sort(key=lambda x: x[2], reverse=True)
return results[:top_n]
# --- CLI Entry Point ---
def main():
if len(sys.argv) < 2:
print("Usage: python netprobe.py [lookup|mutual|common-connections] [args]")
sys.exit(1)
command = sys.argv[1]
# Load data (both lesson 1 and lesson 2 structures)
user_list = load_users_list("users.csv")
user_index = build_user_index(user_list) # NEW IN LESSON 2
social_graph = load_social_graph("connections.csv") # NEW IN LESSON 2
if command == "lookup" and len(sys.argv) == 3:
username = sys.argv[2]
# Lesson 1: linear search (for comparison)
start = time.perf_counter()
result_linear = linear_search(user_list, username)
linear_time = time.perf_counter() - start
# Lesson 2: dict lookup
start = time.perf_counter()
result_dict = dict_lookup(user_index, username)
dict_time = time.perf_counter() - start
if result_dict:
print(f"[NetProbe] Found user '{username}' in {dict_time:.6f}s (dict lookup)")
print(f" Linear search took {linear_time:.6f}s for comparison")
print(f" Data: {result_dict}")
else:
print(f"[NetProbe] User '{username}' not found.")
elif command == "mutual" and len(sys.argv) == 4:
user_a = sys.argv[3 - 1] # sys.argv[2]
user_b = sys.argv[3]
start = time.perf_counter()
mutual = find_mutual_friends(social_graph, user_a, user_b)
elapsed = time.perf_counter() - start
print(f"[NetProbe] {user_a} and {user_b} share {len(mutual)} mutual connections: {mutual}")
print(f"[NetProbe] Computed via set intersection in {elapsed:.6f}s")
elif command == "common-connections":
results = top_common_connections(social_graph)
print("[NetProbe] Top pairs by mutual connections:")
for user_a, user_b, count, shared in results:
print(f" {user_a} <-> {user_b}: {count} shared {shared}")
else:
print("Usage: python netprobe.py [lookup|mutual|common-connections] [args]")
sys.exit(1)
if __name__ == "__main__":
main()
Run the project you've built so far:
$ python netprobe.py lookup alice
[NetProbe] Found user 'alice' in 0.000004s (dict lookup)
Linear search took 0.000012s for comparison
Data: {'username': 'alice', 'fullname': 'Alice Park', 'role': 'engineer'}
$ python netprobe.py mutual alice bob
[NetProbe] alice and bob share 3 mutual connections: {'carol', 'dave', 'eve'}
[NetProbe] Computed via set intersection in 0.000002s
$ python netprobe.py common-connections
[NetProbe] Top pairs by mutual connections:
alice <-> bob: 3 shared {'carol', 'dave', 'eve'}
alice <-> dave: 3 shared {'bob', 'carol', 'eve'}
bob <-> dave: 2 shared {'alice', 'eve'}
...
(Exact timing values and set ordering will vary on your machine.)
Extending It
What if you wanted to add role-based filtering to the mutual friends query? Suppose you want to find mutual friends who are also engineers. This is where dict and set compose beautifully. You already have user_index mapping usernames to user data, and social_graph mapping usernames to friend sets. Combining them is trivial:
mutual = find_mutual_friends(social_graph, "alice", "bob")
engineers_only = {u for u in mutual if user_index[u]["role"] == "engineer"}
This two-line pattern, set intersection followed by a filtered comprehension, scales to surprisingly complex queries. Want mutual friends who are engineers AND have more than 3 connections? Add another condition to the comprehension. Want to rank them by connection count? Wrap it in sorted() with a key function. The hash table provides O(1) access to each user's data, so the comprehension's total cost is proportional to the number of mutual friends, not the total number of users. This is the composability that makes dict and set so powerful in practice.
Spotify uses a similar pattern for their collaborative playlist recommendations. When two users share a collaborative playlist, Spotify computes the intersection of their listening histories (set intersection), then filters by genre and recency (dict lookups on track metadata), then ranks by a scoring function. The core data structure pattern is the same one you just built in NetProbe.
💡 Key Insight: The real power of hash-based data structures is not any single operation. It is how they compose. Build an index once, then combine lookups and set operations to answer complex questions without writing complex code.
Summary Table
| Concept | What It Does | When to Use | Watch Out For |
|---|---|---|---|
| Hash Table | Maps keys to values via computed index | Any key-value association or fast lookup | Collisions degrade performance; keys must be hashable |
dict | Python's built-in hash map (open addressing) | Default choice for key-value storage | Do not mutate during iteration; {} is a dict, not a set |
set | Hash table without values | Membership testing, deduplication, set math | Elements must be hashable; unordered (no indexing) |
defaultdict | Dict with automatic default values | Grouping, counting, accumulation | Accessing a missing key creates it (side effect) |
Counter | Dict subclass for frequency counting | Word counts, top-N, histogram data | Subtraction drops zero/negative counts |
ChainMap | Layered dict lookup without copying | Configuration, scoped variable resolution | Writes only affect the first mapping |
| Hash function | Converts key to integer index | Underlies all dict/set operations | Must be deterministic; randomized for strings across restarts |
| Load factor | Ratio of items to buckets | Internal resize trigger | Above 0.67 triggers rehash in CPython |
Set intersection (&) | Elements in both sets | Mutual friends, shared permissions, overlap | O(min(n,m)), not O(n+m) |
| Chaining vs Open Addressing | Two collision resolution strategies | Chaining is simpler; open addressing is faster | CPython uses open addressing with perturbation |
Next lesson, we shift to controlled access. Stacks, queues, and deques are not just academic exercises. They are the data structures that power undo systems, breadth-first search, and every task scheduler you have ever used. We will build a browser history stack and a print queue from scratch, then replace them with Python's collections.deque, which gives you O(1) operations at both ends, something lists cannot do without the quadratic trap we saw in Lesson 1.
Difficulty Fork
Too Easy? (review and preview)
Key takeaways from this lesson:
- Hash tables provide O(1) average-case lookup by computing an array index from the key
- Python's
dictuses open addressing with perturbation probing and maintains insertion order since 3.7 setoperations (intersection, union, difference) replace nested loops with single expressionsdefaultdict,Counter, andChainMapencode common dict patterns into specialized types
Preview of Lesson 3: We will explore restricted-access data structures: stacks (LIFO), queues (FIFO), and deques (double-ended). You will implement an undo/redo system using a stack pair and a BFS crawler using a queue. The key insight: sometimes limiting how you access data makes your code safer and your algorithms more correct.
Challenge to prepare: Implement an LRU (Least Recently Used) cache using a dict and a doubly-linked list. This is LeetCode problem 146 and it directly combines hash table knowledge from this lesson with the linked list concepts coming in Lesson 4. Alternatively, explore functools.lru_cache and read its CPython source to see how the standard library solves this.
Just Right (reinforce)
Alternative analogy: The Coat Check Room
Think of a hash table as a coat check at a concert venue. You hand over your coat (the value) and receive a numbered ticket (the hash). When you want your coat back, you give the ticket number and the attendant walks directly to that hook, no searching through every coat. The ticket number IS the address.
Collisions happen when two people get the same ticket number (maybe the numbering system is bad). Chaining is like hanging multiple coats on the same hook with tags. Open addressing is like the attendant checking the next hook over until finding an empty one.
Practice exercise: Take any Python script you have written that uses if x in my_list inside a loop. Convert the list to a set or dict and measure the performance difference. Even with small data, building this habit will save you from O(n-squared) traps when data grows.
Want a Challenge?
Interview problem: Group Anagrams (LeetCode 49)
Given a list of strings, group them by anagram. For example, ["eat", "tea", "tan", "ate", "nat", "bat"] should return [["eat", "tea", "ate"], ["tan", "nat"], ["bat"]].
Hint: Two strings are anagrams if they have the same characters in the same frequencies. What if you sorted each string and used the sorted version as a dict key?
Production-level extension: You are building a plagiarism detection system. You have 100,000 student essays. For each essay, you extract a set of 5-word phrases (called "shingles"). Two essays are suspiciously similar if their shingle sets have a Jaccard similarity (size of intersection divided by size of union) above 0.3. Design a system using sets and dicts that can identify all suspicious pairs without comparing every pair of essays. This is the actual approach used by systems like MOSS (Stanford's Measure of Software Similarity) and Turnitin.