Weighted Graphs, Dijkstra's Algorithm, and the Complete NetProbe

Lesson 1071min8,624 chars

Learning Objectives

  • Implement Dijkstra's shortest path algorithm using heapq and explain its O((V+E) log V) complexity
  • Integrate multiple data structures (hash maps, heaps, graphs, caches, sorted indexes) into a cohesive application
  • Write pytest test cases that verify correctness of graph algorithms including edge cases like disconnected nodes and self-loops

Weighted Graphs, Dijkstra's Algorithm, and the Complete NetProbe

This is the lesson where everything clicks together. Over the past nine lessons, you have built up an arsenal of data structures: dynamic arrays, hash maps, stacks, queues, heaps, trees, and graphs. Each one solved a specific class of problem. Today, we combine them all into a single, polished application. More importantly, we tackle the one massive limitation of the graph work we did in Lesson 9: every edge was treated equally, and in the real world, that is almost never true.

Last lesson, our NetProbe could find the shortest path between two people and detect communities. BFS counted hops, DFS found connected components, and we learned from a real LinkedIn incident that graph bugs usually live in data construction, not traversal logic. But our graph was unweighted. Alice talking to Bob once a year and Carol messaging Dave fifty times a month looked identical to BFS. That is like saying every road on a map is the same length. Technically you can navigate, but the directions will be terrible.

Here is what NetProbe produces when you run the final version today:

Text
=======================================================
  NetProbe v1.0 - Social Network Analyzer
=======================================================

[stats]
  Nodes: 10  Edges: 13
  Communities: 1  Largest: 10 members

[shortest-path] Alice -> Judy (fewest hops)
  Alice -> Bob -> Frank -> Grace -> Heidi -> Ivan -> Judy
  Hops: 6

[strongest-path] Alice -> Judy (strongest connections)
  Alice -> Carol -> Dave -> Eve -> Frank -> Grace -> Heidi -> Ivan -> Judy
  Cost: 0.5686

[compare] Alice -> Frank
  BFS (fewest hops):    Alice -> Bob -> Frank  (2 hops)
  Dijkstra (strongest): Alice -> Carol -> Dave -> Eve -> Frank  (cost 0.2983)

Notice how the two algorithms disagree. BFS sends Alice through Bob's weak connection to Frank (they exchanged 2 messages total). Dijkstra routes her through Carol, Dave, and Eve, where every link is a real, active relationship. More hops, but stronger connections all the way through. That distinction is what you build today, and it is the difference between a toy project and production-grade network analysis.


Project Brief

You are completing NetProbe, a social network analysis CLI that has been growing since Lesson 9. This final milestone adds weighted edges, Dijkstra's shortest-path algorithm, a heap-based "top connectors" feature, a comprehensive pytest suite, and polished output formatting. When you are done, you will have a portfolio-worthy project that integrates hash maps, heaps, graphs, queues, and sorted data structures in one cohesive system.

Acceptance criteria for this lesson are concrete. First, every edge in the graph must carry a numeric weight representing interaction frequency. Second, a strongest_path command must use Dijkstra's algorithm to route through the highest-quality connections. Third, a top_connectors command must use a min-heap (exactly like the Top-K pattern from Lesson 8) to rank users by total interaction weight. Fourth, a pytest test suite must cover BFS, Dijkstra, community detection, disconnected nodes, self-paths, and validation. Fifth, running the script should produce clean, labeled output that a non-engineer could read.

FeatureData StructureAlgorithmLesson Origin
User lookupdict (hash map)O(1) key accessLesson 2
Shortest path (hops)deque + setBFSLesson 9
Strongest path (weighted)heapq + dictDijkstraLesson 10
Community detectionlist (stack) + setDFSLesson 9
Top connectorsheapq (min-heap)Top-K selectionLesson 8
Graph validationdefaultdict(dict)Symmetric edge checkLesson 9

That table is the architectural blueprint for everything NetProbe does. Every row connects a user-facing feature to the data structure and algorithm powering it, and every single one traces back to a lesson in this course. This is deliberate. Real systems are not built from a single clever algorithm. They are built from ten ordinary data structures working in concert. Google's original PageRank combined graph traversal with linear algebra. Spotify's recommendation engine mixes graph-based collaborative filtering with heap-ranked playlists. The power is always in the integration.

💡 Key Insight: The hardest part of building a real system is not implementing any single algorithm. It is choosing the right data structure for each sub-problem and making them work together cleanly.


Architecture: From Unweighted Edges to Weighted Reality

In Lesson 9, we stored our graph as defaultdict(set). Each key was a user, each value was a set of neighbors. This representation was clean and fast: adding an edge was O(1), checking adjacency was O(1), and iterating neighbors was O(degree). But it could not answer the question "how strong is this connection?" because sets do not carry metadata. A set just says "connected" or "not connected." Binary. On or off.

Weighted graphs need a richer representation. Instead of {user: set_of_neighbors}, we need {user: {neighbor: weight}}. That inner container changes from a set to a dict. This is a small syntactic shift but a large conceptual one. Every edge now carries a number. In our social network, that number is interaction frequency: how many messages two people exchanged in the past month. In a road network, it would be distance in kilometers. In a computer network, it would be latency in milliseconds.

I want to keep both representations in NetProbe. The unweighted adjacency list (defaultdict(set)) still powers BFS, which only cares about hop count. The weighted adjacency list (defaultdict(dict)) powers Dijkstra and top-K calculations. Keeping both might seem redundant, but it reflects a real design pattern: Google Maps stores both a simplified road graph for quick routing and a detailed graph with turn restrictions, speed limits, and toll costs for accurate navigation. Different algorithms need different views of the same data.

RepresentationPython TypeStores Weight?Best For
Adjacency setdefaultdict(set)NoBFS, DFS, connectivity checks
Adjacency dictdefaultdict(dict)YesDijkstra, weight-based ranking
Adjacency matrix2D list/numpy arrayYesDense graphs, matrix operations

For NetProbe's social network of hundreds or thousands of users, adjacency lists win decisively over matrices. Social graphs are sparse. The average Facebook user has roughly 300 friends out of billions of possible connections. An adjacency matrix for a million users would consume a trillion entries, most of them zero. Our adjacency dict stores only the edges that exist, which is proportional to the number of friendships, not the square of the user count. This is the same space-versus-time trade-off we discussed in Lesson 6 when comparing sort algorithms: the right choice depends on the shape of your data, not just the theoretical complexity.

Deep Dive: Why not use a single dict-of-dicts and drop the set-based graph?

You could. If every edge has a weight, you can use defaultdict(dict) exclusively and treat self.weights[user].keys() as the neighbor set. BFS would iterate over self.weights[current].items() and ignore the values. This simplifies the code but couples BFS to the weighted representation. If you later add directed edges (where A follows B but B does not follow A) or multi-graphs (multiple edge types between the same pair), having separate representations gives you flexibility. In production systems like Neo4j's graph database, different query engines read from different index structures over the same underlying data. The principle is the same.

🤔 Think about it: If you have N users and E friendships in a social network, what is the space complexity of defaultdict(dict) for the weighted graph?

View Answer

O(N + E). We store N keys in the outer dict (one per user) and each edge appears twice (once per direction), so the inner dicts collectively hold 2E entries. Since E is typically much smaller than N-squared in a sparse social graph, this is vastly more efficient than an N-by-N adjacency matrix at O(N-squared).


The Wrong Way First: Why BFS Breaks on Weighted Graphs

Before we build Dijkstra, let me show you the bug I shipped at a startup in 2016. We were building an internal tool that routed customer support tickets to the "closest" available agent. "Closest" was defined by a weighted graph of expertise overlap: agents who shared more skills with a ticket's topic had lower routing cost. I used BFS because it was already in our codebase from a different feature. The routing worked great in testing, where every agent had similar expertise. In production, it was a disaster.

BFS finds the path with the fewest edges, period. It does not look at weights. If Agent A is one hop away but has zero relevant expertise (weight 1000), and Agent B is three hops away through a chain of highly qualified agents (weights 1, 1, 1), BFS picks Agent A every time. The customer gets the wrong person, the ticket bounces, and resolution time doubles.

Here is the core problem in code:

Do not do this:

Python
# BFS on a weighted graph - ignores weights entirely
path = net.shortest_path("Alice", "Frank")
# Returns: Alice -> Bob -> Frank (2 hops)
# But Bob-Frank weight is 2 (almost no interaction!)

That path looks efficient because it is short. But the Bob-Frank edge has a weight of 2 messages per month. These two people barely know each other. If you are trying to route a message through people who actually talk to each other, this path is useless. It is like GPS routing you down a dirt road because it is geometrically shorter than the highway.

Use Dijkstra for weighted graphs:

Python
# Dijkstra considers edge weights
path, cost = net.strongest_path("Alice", "Frank")
# Returns: Alice -> Carol -> Dave -> Eve -> Frank (4 hops)
# Every edge on this path has high interaction frequency

The Dijkstra path is twice as long in hops but passes through real, active relationships. Alice talks to Carol 30 times a month. Carol talks to Dave 25 times. Dave talks to Eve 5 times. Eve talks to Frank 40 times. Every link is a genuine connection. The BFS path went through a link where two people exchanged 2 messages in a month, which is barely an acquaintance.

⚠️ Common Pitfall: Using BFS on weighted graphs is one of the most common mistakes in production code. It is not wrong in the sense of crashing. It is wrong in the sense of giving subtly incorrect results that look plausible, which is far more dangerous. At my startup, it took us three weeks to realize the routing was bad because the tickets were still getting resolved, just slowly.

AlgorithmConsiders Weights?FindsTime Complexity
BFSNoFewest hopsO(V + E)
DijkstraYesLowest total costO((V + E) log V)
Bellman-FordYes (including negative)Lowest total costO(V * E)

The complexity difference matters less than the correctness difference. Dijkstra is slower than BFS by a log V factor due to the heap operations (remember from Lesson 8, each heappush and heappop is O(log n)). For a social network with 10,000 users and 50,000 edges, that is the difference between 60,000 operations and roughly 800,000 operations. Both finish in milliseconds. But one gives you the right answer and the other does not.

💡 Key Takeaway: BFS and Dijkstra answer fundamentally different questions. BFS answers "what is the fewest edges to cross?" Dijkstra answers "what is the cheapest total path?" Never substitute one for the other.


Dijkstra's Algorithm: Greedy Shortest Paths with a Min-Heap

Dijkstra's algorithm is a greedy algorithm that builds shortest paths outward from a source node, one node at a time, always picking the cheapest next step. It was invented by Edsger Dijkstra in 1956 and published in 1959. The story goes that he designed it in about twenty minutes while sitting at a cafe in Amsterdam, without pen and paper, because he wanted to demonstrate that you could solve complex problems through careful thinking alone. Whether or not that story is entirely true, the algorithm itself is elegant in its simplicity.

The mental model that works best here is a spreading ripple. Imagine dropping a stone into a pond where the water flows at different speeds in different directions, depending on depth. The first ripple to reach any given point took the fastest available route. Dijkstra works the same way: it expands outward from the start node, always extending the shortest known path, and the first time it reaches any node, that path is guaranteed to be optimal. This greedy property only holds when all edge weights are non-negative, which is a critical constraint we will examine shortly.

The algorithm needs three pieces of state. First, a distance table: a dictionary mapping each node to the shortest known distance from the source. Initially, the source has distance 0 and everything else is infinity. Second, a predecessor table: a dictionary mapping each node to the previous node on its shortest path, used to reconstruct the path at the end. Third, a priority queue (min-heap) that always gives us the unvisited node with the smallest tentative distance.

Here is the step-by-step logic:

  1. Push the source onto the min-heap with distance 0.
  2. Pop the node with the smallest distance from the heap.
  3. If this node has already been visited, skip it (stale entry).
  4. Mark the node as visited.
  5. For each unvisited neighbor, calculate the tentative distance through the current node.
  6. If this tentative distance is smaller than the neighbor's current best, update the distance table and push the new entry onto the heap.
  7. Repeat from step 2 until the heap is empty or you pop the target node.

The min-heap is doing all the heavy lifting here. Remember from Lesson 8 that Python's heapq module gives us O(log n) pushes and pops. Without a heap, you would need to scan all unvisited nodes to find the minimum distance, which is O(V) per step and O(V-squared) overall. With the heap, each node is pushed and popped at most a small number of times (proportional to its degree), giving us O((V + E) log V) total. This is the same complexity insight that made our Top-K solution in Lesson 8 efficient: the heap lets you maintain sorted order without sorting everything.

There is a subtlety with Python's heapq that trips people up. The heapq module does not support a "decrease-key" operation. In textbook Dijkstra, when you find a shorter path to a node, you update its priority in the heap. Python's heap does not let you do that. Instead, we use "lazy deletion": we push a new entry with the updated distance and leave the old entry in the heap. When we pop a stale entry later, we check if the node has already been visited and skip it. This is why step 3 exists. It wastes a bit of memory (the heap can have duplicate entries), but it is simple and fast in practice. Spotify's backend uses the same lazy-deletion pattern in their graph-based recommendation engine.

Deep Dive: Why is O((V + E) log V) the right complexity?

Each vertex is popped from the heap at most once (after which it is marked visited and all future pops are skipped). Each edge is examined at most once when its source vertex is processed, potentially causing one heap push. So we have at most V pops and E pushes. Each heap operation is O(log H) where H is the heap size, and H is at most E (since we push at most once per edge relaxation). So the total is O(V log E + E log E). Since E is at most V-squared, log E is at most 2 log V, which simplifies to O((V + E) log V). For sparse graphs where E is proportional to V, this becomes O(V log V), which is phenomenally fast. Compare this to Lesson 6's sorting discussion: merge sort is O(n log n), and Dijkstra on a sparse graph has the same shape, which is not a coincidence. Both are fundamentally about maintaining order over a changing dataset.

Now let me connect this to the "strongest path" concept in NetProbe. Our edge weights represent interaction frequency: higher means stronger. But Dijkstra finds the path with the minimum total cost, not the maximum. We need to convert "strength" to "cost." The simplest conversion is cost = 1 / strength. A connection where two people exchange 50 messages per month has a cost of 0.02, while a connection with 2 messages has a cost of 0.5. Dijkstra naturally prefers the high-frequency paths because they have lower cost. This inversion trick is standard in network analysis. Cisco's OSPF routing protocol uses a similar formula: link cost = reference bandwidth / actual bandwidth.

Peer Teaching Prompt: Explain Dijkstra's algorithm to a friend who has never heard of it, in exactly 3 sentences.

Model Answer

Dijkstra's algorithm finds the cheapest path through a weighted graph by always extending the shortest known path first. It uses a priority queue to pick the next node to explore, guaranteeing that once a node is finalized, no cheaper route to it exists. The algorithm works only when all edge weights are non-negative, because it relies on the principle that adding more edges to a path can never make it cheaper.

🤔 Think about it: In our NetProbe implementation, if two users are connected by an edge with weight 100 and also by a two-hop path through a third user where both edges have weight 10, which path does Dijkstra's strongest-path choose?

View Answer

The direct edge with weight 100 has cost 1/100 = 0.01. The two-hop path has cost 1/10 + 1/10 = 0.2. Dijkstra picks the direct edge because 0.01 is less than 0.2. This makes intuitive sense: one strong connection is "closer" than two medium connections. The cost function 1/weight naturally prefers concentrated strength over diffuse paths.


Negative Weights: Where Dijkstra Falls Apart

Dijkstra is the right tool in the vast majority of cases, but it has one non-negotiable constraint: all edge weights must be non-negative. This is not a minor footnote. It is a fundamental limitation that will bite you in production if you ignore it. The greedy property, that the first time you visit a node is optimal, depends on the assumption that adding more edges to a path can never reduce its total cost. With negative weights, a longer path could actually be cheaper, which violates the invariant.

Here is a concrete example. Imagine three users: A, B, and C. A to B costs 5, A to C costs 2, and C to B costs negative 10. Dijkstra visits B through the direct edge A-B with cost 5 and marks B as done. It never discovers that A-C-B has cost 2 + (-10) = -8, which is much cheaper. The algorithm confidently returns the wrong answer with no error or warning.

If you ever need negative weights, use the Bellman-Ford algorithm instead. Bellman-Ford relaxes every edge V-1 times, handling negative weights correctly at the cost of O(V * E) complexity instead of O((V + E) log V). It can even detect negative-weight cycles, situations where you can reduce cost infinitely by going in circles. In financial systems, negative-weight cycles represent arbitrage opportunities: currency exchange rates where converting A to B to C to A yields a profit. Bloomberg's trading systems use Bellman-Ford variants for exactly this purpose.

For NetProbe, negative weights do not apply. Interaction frequency is always positive. You cannot exchange negative-three messages with someone. But I mention this because I have seen engineers reach for Dijkstra in contexts where weights represent things like profit minus cost, which can go negative. If you remember one thing: Dijkstra requires non-negative weights. If that constraint does not hold, switch algorithms.

📌 Remember: Dijkstra = non-negative weights only. Bellman-Ford = handles negatives but is slower. This is not a preference. It is a correctness requirement.

AlgorithmNegative Weights?Negative Cycles?Complexity
DijkstraNoNoO((V + E) log V)
Bellman-FordYesDetects themO(V * E)
Floyd-WarshallYesDetects themO(V-cubed)

That table captures the complete landscape of shortest-path algorithms you will encounter in practice. Dijkstra is the workhorse for the common case. Bellman-Ford is the safety net for tricky weight functions. Floyd-Warshall computes all pairs at once, which is useful for small, dense graphs but impractical for anything at scale.

💡 Key Takeaway: Always verify that your edge weights are non-negative before using Dijkstra. If they are not, your algorithm will produce wrong answers silently, which is the worst kind of bug.


Implementation: Building Dijkstra into NetProbe

Let me walk through the key implementation decisions before you see the complete code. The NetProbe class maintains two parallel graph representations: self.adj (a defaultdict(set) for unweighted traversal) and self.weights (a defaultdict(dict) for weighted operations). Every call to add_connection updates both. This dual representation costs a small amount of extra memory but keeps BFS and Dijkstra cleanly separated.

The strongest_path method is the heart of this lesson. It implements Dijkstra's algorithm using three local variables: dist (the distance dictionary, keyed by node name), prev (the predecessor dictionary for path reconstruction), and heap (Python's heapq-managed list of (cost, node) tuples). When we pop a node from the heap, we first check if it has been visited. If yes, it is a stale entry from a previous, non-optimal push, and we skip it. This lazy deletion pattern, which we first saw implicitly in Lesson 8's heap discussion, is the standard Python approach since heapq lacks a decrease-key operation.

The cost inversion 1.0 / weight is applied at edge relaxation time, not at storage time. I store raw interaction frequencies in self.weights because those numbers are meaningful to users. Converting to cost only happens inside the algorithm. This separation of concerns matters: if you later want a different cost function (say, 1.0 / log(weight) for diminishing returns), you change one line in strongest_path instead of migrating all your stored data.

Path reconstruction uses the predecessor dictionary. When Dijkstra reaches the target node, we walk backward through prev from the target to the source, collecting nodes into a list, then reverse it. Virtually every shortest-path algorithm uses this same technique. The alternative, storing the full path at each node, wastes O(V-squared) memory and is unnecessary.

The top_connectors method is a direct application of Lesson 8's Top-K pattern. We maintain a min-heap of size K. For each user, we compute their total interaction weight (sum of all edge weights). If the heap has fewer than K items, we push. If the user's total exceeds the heap's minimum (heap[0][0]), we use heapreplace to swap out the smallest. At the end, we sort the heap in descending order. This is O(N log K) instead of O(N log N) for a full sort, which matters when N is millions of users and K is 10.

The validate method checks two invariants. First, every edge must be symmetric: if A's weight dict contains B, then B's weight dict must contain A with the same value. Second, all weights must be positive. These checks catch the kind of data construction bugs that Lesson 9 warned us about. At Google, graph validation was a mandatory step in any pipeline that constructed a graph from raw data. I once spent two days debugging a recommendation system that worked perfectly on 99.7% of users but crashed on the rest. The bug was a handful of asymmetric edges from a race condition in the data ingestion pipeline. A simple validation pass would have caught it in seconds.

Deep Dive: Tie-breaking in the heap

When two nodes have the same distance, Python's heapq compares the second element of the tuple. In our case, that is a string (the node name), and strings are comparable, so this works. But if your nodes were custom objects without a __lt__ method, you would get a TypeError. The production-safe pattern is to include a counter as a tiebreaker:

Python
counter = 0
heapq.heappush(heap, (dist, counter, node))
counter += 1

This guarantees a unique ordering even when distances are equal, without requiring nodes to be comparable. Python's official heapq documentation recommends this exact pattern.

🤔 Think about it: Our strongest_path returns (None, float('inf')) when no path exists. Why float('inf') instead of None or -1 for the cost?

View Answer

Using float('inf') is consistent with the algorithm's semantics: an unreachable node has infinite distance. It also allows callers to compare costs without special-casing: cost_a < cost_b works correctly even when one path does not exist, because any finite number is less than infinity. Returning -1 would require callers to check for the sentinel value before comparing, which is error-prone. This convention is used throughout NetworkX, scipy's sparse graph module, and most graph algorithm libraries.


Testing the Complete System: A pytest Suite

I have a rule: if your algorithm does not have a test for disconnected nodes, you do not actually have a tested algorithm. The happy path, finding a path that exists, is what everyone tests first. The edge cases are what catch you in production. Disconnected nodes, self-loops, single-node graphs, graphs where BFS and Dijkstra disagree, and graphs with uniform weights where they should agree. These are the tests that save you at 2 AM.

Our test suite covers seven scenarios. Each one targets a specific behavior that could break silently. The first test, test_bfs_shortest_path, verifies that BFS picks the fewest hops regardless of weights. The second, test_dijkstra_prefers_strong_connections, builds a three-node graph where the direct path is weak and the two-hop path is strong, confirming Dijkstra takes the longer but stronger route. The third tests disconnected nodes. The fourth tests self-paths (start equals end). The fifth verifies community detection with multiple components. The sixth checks that the heap-based top-K ranking returns the correct order. The seventh validates that our graph validation catches non-positive weights.

Notice the structure of the Dijkstra-versus-BFS test. We build a tiny graph with just three nodes: A, B, and C. Edge A-B has weight 1 (very weak), while A-C and C-B both have weight 100 (very strong). BFS returns [A, B] because it is one hop. Dijkstra returns [A, C, B] because 1/100 + 1/100 = 0.02 is much less than 1/1 = 1.0. This three-node test captures the entire conceptual difference between the two algorithms. Every time I write tests for a new algorithm, I start with the smallest possible input that demonstrates the core property. Remember from Lesson 6 when we discussed sorting algorithms: a four-element array was enough to expose the difference between stable and unstable sorts. The same principle applies here.

Running the tests is simple: pytest netprobe.py -v. Because the test functions are defined in the same file as the class (for this course's single-file format), pytest discovers them automatically. In a production codebase, you would typically put tests in a separate tests/ directory, but the pattern is identical. Each test function starts with test_, creates a small NetProbe instance, performs operations, and asserts expected results.

⚠️ Common Pitfall: Testing only the happy path. I have reviewed hundreds of graph algorithm implementations in code reviews, and the most common gap is missing tests for disconnected components. If your BFS or Dijkstra does not handle "no path exists" cleanly, you will get an infinite loop, an index error, or a silently wrong answer. Test it.

One testing pattern worth highlighting is the "BFS and Dijkstra agree" check. When all edges have the same weight, both algorithms should return paths of the same length (though not necessarily the same path, since there may be ties). This is a great property-based test: generate a random unweighted graph, assign uniform weights, and verify that shortest_path and strongest_path produce paths with identical hop counts. If they diverge, something is wrong. This kind of cross-validation between algorithms is a technique I learned at Google, where we would run the same query through two different ranking systems and flag discrepancies.

💡 Key Takeaway: Your test suite should include at least one test where two different algorithms are expected to agree. Disagreements reveal bugs faster than checking either algorithm in isolation.


Architecture Review: Every Data Structure in One System

Look at what NetProbe actually contains, because this is the entire course in a single file. The __init__ method creates three data structures: a defaultdict(set) (the adjacency list from Lesson 9), a defaultdict(dict) (the weighted graph from today), and a plain dict (the hash map from Lesson 2 for user profiles). Each serves a different purpose, and choosing the wrong one for any sub-problem would degrade performance or correctness.

The shortest_path method uses a deque as a queue and a set for visited tracking. The deque gives us O(1) append and popleft, which is the entire reason BFS is O(V + E). If we used a list and called pop(0), we would get O(V) for each dequeue and O(V-squared + V * E) overall. We learned this in Lesson 3 when comparing stacks, queues, and deques. The visited set uses hash-based membership testing at O(1). Together, they keep BFS fast.

The strongest_path method uses heapq as a priority queue and a dict for the distance table. The heap gives us O(log n) for push and pop, which drives Dijkstra's O((V + E) log V) complexity. The distance dictionary gives us O(1) lookups to check if a new tentative distance is better than the current best. If we used a sorted list instead of a heap (like inserting into a BST from Lesson 7), we would get O(log n) lookups but O(n) insertions, which is worse for this use case.

The detect_communities method uses a list as a stack for iterative DFS. We chose iterative over recursive DFS in Lesson 9 to avoid Python's recursion limit, and that decision carries forward. The stack gives O(1) push and pop. The visited set, shared with the outer loop, ensures each node is processed exactly once across all components.

The top_connectors method is a textbook application of the min-heap Top-K pattern from Lesson 8. Remember heapq.nlargest? Our manual implementation with heappush and heapreplace does the same thing but gives us more control over the comparison key. We sum each user's edge weights and keep only the K largest in the heap. The heap's minimum is always at index 0, so checking whether a new candidate beats the current worst is O(1).

Sub-ProblemNaive ApproachOur ApproachSpeedup
Find shortest pathTry all paths (exponential)BFS with deque (O(V+E))Exponential to linear
Find strongest pathTry all pathsDijkstra with heapq (O((V+E) log V))Exponential to near-linear
Find top K connectorsSort all users (O(N log N))Min-heap of size K (O(N log K))log N / log K factor
Check connectivityRepeated BFS from each nodeSingle DFS pass (O(V+E))V factor
Validate graphNested loop all pairs (O(V-squared))Iterate edges only (O(V+E))V/degree factor

That table summarizes every algorithmic decision in NetProbe. Each row shows what you would do naively, what we actually do, and why it matters. For a graph with 100,000 users and 500,000 edges, the difference between exponential and linear is the difference between "does not finish before the heat death of the universe" and "finishes in under a second." Even the smaller speedups matter at scale. If you are sorting a million users to find the top 10, the heap-based approach is 20x faster (log 1,000,000 / log 10 is approximately 6/1). At Spotify's scale of 600 million users, that difference translates directly to server costs.

💡 Key Insight: Real applications are rarely about one data structure. They are about the ensemble. Hash maps for fast lookup, heaps for priority, graphs for relationships, queues for traversal. Knowing when to reach for each one is what separates an engineer from someone who just knows syntax.


Ship It: The Final NetProbe

Your NetProbe is portfolio-worthy. It combines six data structures, four algorithms, and a test suite. If you put this on GitHub with a README that explains the architecture, it demonstrates more practical understanding than most coding bootcamp capstone projects. Here is how I would present it.

First, the README should lead with the "compare" output. The moment where BFS and Dijkstra disagree is the most compelling proof that you understand graph algorithms. Show the output, explain why the paths differ, and link to the relevant lesson. Hiring managers and interviewers notice when a candidate can explain trade-offs, not just write code.

Second, expand the test suite. The seven tests we wrote cover core behavior, but a production-grade suite would add property-based tests (generate random graphs, verify invariants), performance benchmarks (time BFS and Dijkstra on graphs of increasing size), and integration tests (build a graph from a CSV file, run all commands). If you are using pytest, the pytest-benchmark plugin makes performance testing trivial.

Third, consider adding a real CLI with argparse. Right now our main() function runs a fixed demo. A real tool would accept commands like netprobe shortest-path Alice Judy or netprobe top-connectors 10. This is a straightforward extension that turns the script into an actual utility.

📌 Remember: Shipping is a skill separate from coding. A project that runs locally on your machine with hardcoded data is a prototype. A project with tests, documentation, and a clean interface is a product. The gap between them is smaller than most people think.


🔨 Project Update

This is the complete NetProbe application. Copy, paste, and run it. The code below includes the full class from Lessons 9 and 10, the pytest test suite, and a demonstration with a sample social network. Everything new in this lesson is marked with comments.

Python
"""
NetProbe v1.0 - Social Network Analyzer (Complete)
Python Data Structures and Algorithms - Lesson 10 (Final)

Run demo:   python netprobe.py
Run tests:  pytest netprobe.py -v
"""

import heapq
from collections import defaultdict, deque


# --- NetProbe Core ---

class NetProbe:
    """Social network analyzer: BFS, Dijkstra, DFS communities, Top-K."""

    def __init__(self):
        self.adj = defaultdict(set)        # unweighted adjacency list (Lesson 9)
        self.weights = defaultdict(dict)   # weighted adjacency list (NEW Lesson 10)
        self.profiles = {}                 # user metadata cache (hash map)

    def add_user(self, name, profile=None):
        """Register a user with optional profile metadata."""
        self.profiles[name] = profile or {}
        self.adj[name]                     # ensure node exists even with no edges
        self.weights[name]

    def add_connection(self, user_a, user_b, strength=1):
        """Add a weighted, undirected edge between two users."""
        self.adj[user_a].add(user_b)
        self.adj[user_b].add(user_a)
        self.weights[user_a][user_b] = strength   # NEW: store interaction weight
        self.weights[user_b][user_a] = strength

    # --- BFS: Shortest Path by Hop Count (Lesson 9) ---

    def shortest_path(self, start, end):
        """Find the path with fewest hops, ignoring edge weights."""
        if start not in self.adj or end not in self.adj:
            return None
        if start == end:
            return [start]

        visited = {start}                          # visited-on-enqueue pattern
        queue = deque([(start, [start])])

        while queue:
            current, path = queue.popleft()
            for neighbor in sorted(self.adj[current]):  # sorted for deterministic output
                if neighbor == end:
                    return path + [neighbor]
                if neighbor not in visited:
                    visited.add(neighbor)
                    queue.append((neighbor, path + [neighbor]))
        return None                                # no path exists

    # --- Dijkstra: Strongest Path by Connection Quality (NEW Lesson 10) ---

    def strongest_path(self, start, end):
        """Find the path through strongest connections using Dijkstra.

        Higher strength = stronger bond. Internally uses cost = 1/strength
        so that min-heap naturally prefers high-strength edges.
        Returns (path_list, total_cost) or (None, inf).
        """
        if start not in self.weights or end not in self.weights:
            return None, float("inf")
        if start == end:
            return [start], 0.0

        dist = {start: 0.0}                       # shortest distance from start
        prev = {start: None}                       # predecessor map for path rebuild
        heap = [(0.0, start)]                      # min-heap: (cost, node)
        visited = set()

        while heap:
            d, node = heapq.heappop(heap)          # pop cheapest unvisited node

            if node in visited:                    # lazy deletion: skip stale entries
                continue
            visited.add(node)

            if node == end:                        # reached target, rebuild path
                path = []
                while node is not None:
                    path.append(node)
                    node = prev[node]
                return path[::-1], d

            for nbr, w in self.weights[node].items():
                if nbr not in visited and w > 0:
                    cost = 1.0 / w                 # invert: high weight = low cost
                    new_d = d + cost
                    if new_d < dist.get(nbr, float("inf")):
                        dist[nbr] = new_d          # update best known distance
                        prev[nbr] = node
                        heapq.heappush(heap, (new_d, nbr))

        return None, float("inf")                  # target unreachable

    # --- DFS: Community Detection (Lesson 9) ---

    def detect_communities(self):
        """Find all connected components using iterative DFS."""
        visited = set()
        communities = []

        for user in sorted(self.adj):              # sorted for deterministic order
            if user not in visited:
                component = []
                stack = [user]
                while stack:
                    node = stack.pop()
                    if node not in visited:
                        visited.add(node)
                        component.append(node)
                        for nbr in sorted(self.adj[node]):
                            if nbr not in visited:
                                stack.append(nbr)
                communities.append(sorted(component))

        return sorted(communities, key=len, reverse=True)

    # --- Top-K Connectors using Min-Heap (NEW Lesson 10, pattern from Lesson 8) ---

    def top_connectors(self, k=5):
        """Find K users with highest total interaction weight using a min-heap."""
        heap = []
        for user in self.weights:
            total = sum(self.weights[user].values())
            if len(heap) < k:
                heapq.heappush(heap, (total, user))
            elif total > heap[0][0]:               # beat the current minimum?
                heapq.heapreplace(heap, (total, user))  # O(log k) swap
        return sorted(heap, reverse=True)

    # --- Graph Validation (Lesson 9, extended for weights) ---

    def validate(self):
        """Check symmetry and positive weights across all edges."""
        errors = []
        for u in self.weights:
            for v, w in self.weights[u].items():
                if u not in self.weights.get(v, {}):
                    errors.append(f"Asymmetric edge: {u} -> {v}")
                if w <= 0:
                    errors.append(f"Non-positive weight: {u}-{v} = {w}")
        return errors

    # --- Network Statistics ---

    def stats(self):
        """Return summary statistics for the network."""
        all_nodes = set(self.adj) | set(self.weights)
        edge_count = sum(len(v) for v in self.adj.values()) // 2
        comms = self.detect_communities()
        return {
            "nodes": len(all_nodes),
            "edges": edge_count,
            "communities": len(comms),
            "largest": len(comms[0]) if comms else 0,
        }


# --- pytest Test Suite (run: pytest netprobe.py -v) ---

def test_bfs_shortest_path():
    """BFS finds fewest hops regardless of weight."""
    net = NetProbe()
    net.add_connection("A", "B", 1)
    net.add_connection("A", "C", 100)
    net.add_connection("C", "B", 100)
    path = net.shortest_path("A", "B")
    assert path == ["A", "B"]
    assert len(path) - 1 == 1

def test_dijkstra_prefers_strong_connections():
    """Dijkstra routes through stronger edges even with more hops."""
    net = NetProbe()
    net.add_connection("A", "B", 1)
    net.add_connection("A", "C", 100)
    net.add_connection("C", "B", 100)
    path, cost = net.strongest_path("A", "B")
    assert path == ["A", "C", "B"]
    assert cost < 1.0

def test_disconnected_returns_none():
    """Both algorithms return None for unreachable nodes."""
    net = NetProbe()
    net.add_user("X")
    net.add_user("Y")
    assert net.shortest_path("X", "Y") is None
    path, cost = net.strongest_path("X", "Y")
    assert path is None

def test_self_path():
    """Path from a node to itself."""
    net = NetProbe()
    net.add_user("A")
    assert net.shortest_path("A", "A") == ["A"]
    path, cost = net.strongest_path("A", "A")
    assert path == ["A"] and cost == 0.0

def test_community_detection():
    """DFS finds separate connected components."""
    net = NetProbe()
    net.add_connection("A", "B", 10)
    net.add_connection("C", "D", 20)
    comms = net.detect_communities()
    assert len(comms) == 2

def test_top_connectors_ranking():
    """Heap returns users ranked by total interaction weight."""
    net = NetProbe()
    net.add_connection("A", "B", 50)
    net.add_connection("A", "C", 30)
    net.add_connection("B", "C", 10)
    top = net.top_connectors(2)
    assert top[0][1] == "A"
    assert top[1][1] == "B"

def test_validate_catches_bad_weight():
    """Validation flags non-positive weights."""
    net = NetProbe()
    net.weights["A"]["B"] = -5
    net.weights["B"]["A"] = -5
    errors = net.validate()
    assert any("Non-positive" in e for e in errors)


# --- Demonstration ---

def build_sample_network():
    """Create a sample social network with 10 users and weighted edges."""
    net = NetProbe()

    for name in ["Alice", "Bob", "Carol", "Dave", "Eve",
                  "Frank", "Grace", "Heidi", "Ivan", "Judy"]:
        net.add_user(name)

    # Weighted connections: (user_a, user_b, messages_per_month)
    connections = [("Alice", "Bob", 50),    ("Alice", "Carol", 30),
        ("Bob",   "Carol", 45),  ("Bob",   "Dave", 10),
        ("Carol", "Dave", 25),   ("Dave",  "Eve", 5),
        ("Eve",   "Frank", 40),  ("Frank", "Grace", 35),
        ("Grace", "Heidi", 20),  ("Heidi", "Ivan", 15),
        ("Ivan",  "Judy", 8),    ("Alice", "Eve", 3),
        ("Bob",   "Frank", 2),]
    for a, b, w in connections:
        net.add_connection(a, b, w)

    return net


def main():
    """Run the full NetProbe demonstration."""
    net = build_sample_network()

    print("=" * 55)
    print("  NetProbe v1.0 - Social Network Analyzer")
    print("=" * 55)

    # --- Network Statistics ---
    s = net.stats()
    print(f"\n[stats]")
    print(f"  Nodes: {s['nodes']}  Edges: {s['edges']}")
    print(f"  Communities: {s['communities']}  Largest: {s['largest']} members")

    # --- BFS: Shortest Path (fewest hops) ---
    print(f"\n[shortest-path] Alice -> Judy (fewest hops)")
    path = net.shortest_path("Alice", "Judy")
    if path:
        print(f"  {' -> '.join(path)}")
        print(f"  Hops: {len(path) - 1}")

    # --- Dijkstra: Strongest Path (highest connection quality) ---
    print(f"\n[strongest-path] Alice -> Judy (strongest connections)")
    path, cost = net.strongest_path("Alice", "Judy")
    if path:
        print(f"  {' -> '.join(path)}")
        print(f"  Cost: {cost:.4f}")

    # --- Compare BFS vs Dijkstra ---
    print(f"\n[compare] Alice -> Frank")
    bp = net.shortest_path("Alice", "Frank")
    dp, dc = net.strongest_path("Alice", "Frank")
    print(f"  BFS (fewest hops):    {' -> '.join(bp)}  ({len(bp)-1} hops)")
    print(f"  Dijkstra (strongest): {' -> '.join(dp)}  (cost {dc:.4f})")

    # --- Community Detection ---
    print(f"\n[communities]")
    for i, c in enumerate(net.detect_communities(), 1):
        print(f"  {i}. {', '.join(c)}")

    # --- Top Connectors ---
    print(f"\n[top-connectors]")
    for weight, user in net.top_connectors(5):
        print(f"  {user}: {weight} total interactions")

    # --- Validation ---
    errors = net.validate()
    print(f"\n[validate] {'All checks passed.' if not errors else errors}")

    print("\n" + "=" * 55)


if __name__ == "__main__":
    main()

Run the project you have built so far. Save the code above as netprobe.py and execute it:

Bash
python netprobe.py

Expected output:

Text
=======================================================
  NetProbe v1.0 - Social Network Analyzer
=======================================================

[stats]
  Nodes: 10  Edges: 13
  Communities: 1  Largest: 10 members

[shortest-path] Alice -> Judy (fewest hops)
  Alice -> Bob -> Frank -> Grace -> Heidi -> Ivan -> Judy
  Hops: 6

[strongest-path] Alice -> Judy (strongest connections)
  Alice -> Carol -> Dave -> Eve -> Frank -> Grace -> Heidi -> Ivan -> Judy
  Cost: 0.5686

[compare] Alice -> Frank
  BFS (fewest hops):    Alice -> Bob -> Frank  (2 hops)
  Dijkstra (strongest): Alice -> Carol -> Dave -> Eve -> Frank  (cost 0.2983)

[communities]
  1. Alice, Bob, Carol, Dave, Eve, Frank, Grace, Heidi, Ivan, Judy

[top-connectors]
  Bob: 107 total interactions
  Carol: 100 total interactions
  Alice: 83 total interactions
  Frank: 77 total interactions
  Grace: 55 total interactions

[validate] All checks passed.

=======================================================

Then run the tests:

Bash
pytest netprobe.py -v

All seven tests should pass. If any fail, check that you copied the entire file, including the imports at the top.


Course Integration: What You Built Across Ten Lessons

Let me take you through the full journey one last time. In Lesson 1, you learned that list.append() is amortized O(1) because Python over-allocates memory. That understanding is why you can trust component.append(node) inside DFS without worrying about performance. In Lesson 2, you learned that dictionary lookups are O(1) because of hash tables. That is why dist.get(nbr, float("inf")) in Dijkstra is instant, not a linear scan. In Lesson 3, you learned why deque exists: O(1) popleft, which makes BFS O(V + E) instead of O(V-squared + V * E).

Lessons 4 and 5 gave you the foundation for graph traversal. Linked list pointer logic in Lesson 4 taught you to think carefully about references, which is the same skill you need when managing prev[node] in Dijkstra's path reconstruction. Recursion and memoization in Lesson 5 taught you that recomputing is wasteful, which is why Dijkstra uses a dist dictionary to remember the best known distance instead of exploring from scratch.

Lesson 6's sorting discussion shows up in top_connectors. We could sort all users by total weight and take the first K, which is O(N log N) using Timsort. Instead, we use a heap of size K, which is O(N log K). When K is small relative to N, the difference is significant. Lesson 7's binary search and BST work prepared you for the idea that maintaining sorted order during insertion is powerful, which is exactly what the heap does (in a weaker form: the heap maintains the minimum, not full sort order, which is why it is faster).

Lesson 8 is the most direct ancestor of today's Dijkstra implementation. The heapq module, heappush, heappop, heapreplace, the min-heap property, the array-based representation: all of these concepts from Lesson 8 are now load-bearing infrastructure inside strongest_path. Without the heap, Dijkstra would need O(V) to find the minimum each step, making the total O(V-squared). With it, we get O((V + E) log V). That single data structure choice is the entire difference.

Lesson 9's BFS and DFS carry forward unchanged. The shortest_path and detect_communities methods are nearly identical to what you wrote last lesson. The visited-on-enqueue pattern for BFS, the iterative stack-based DFS, the defaultdict(set) adjacency list: all still here, all still working. New features layer on top of a proven foundation. This is how real codebases grow.

💡 Key Insight: You did not learn ten isolated topics. You learned ten components of a system. The course was designed so that each lesson's output becomes the next lesson's input. That is how professional software engineering works too: every feature you ship becomes infrastructure for the next team.

🤔 Think about it: If you replaced the defaultdict(dict) weighted graph with a 2D numpy array (adjacency matrix), how would the space complexity change for a network of 1 million users with 5 million edges?

View Answer

The adjacency dict uses O(N + 2E) space: 1 million dict entries plus 10 million weight values (each undirected edge stored twice). That is roughly 11 million entries. A 2D numpy array of float64 for 1 million users would be 1,000,000 times 1,000,000 = 1 trillion entries, consuming approximately 8 terabytes of memory. The adjacency dict is about 700,000x more space-efficient for this sparse graph. This is why adjacency matrices are only practical for small or very dense graphs.


Summary Table

ConceptWhat It DoesWhen to UseWatch Out For
BFS (deque + visited set)Finds the path with the fewest hops between two nodesUnweighted graphs, hop-count routing, reachability checksDoes not consider edge weights. A short path may cross weak connections
DFS (iterative stack)Finds all nodes reachable from a source, identifying connected componentsCommunity detection, cycle detection, flood fillUse iterative (not recursive) in Python to avoid hitting the recursion limit
Unweighted adjacency list (defaultdict(set))Stores neighbors without metadata for fast traversalBFS, DFS, any algorithm that only needs connectivityCannot represent edge strength, cost, or capacity
Weighted adjacency list (defaultdict(dict))Stores neighbors with numeric edge weightsDijkstra, weight-based ranking, any cost-sensitive traversalKeep alongside the unweighted list if BFS and Dijkstra both need to run
Dijkstra's algorithmFinds the minimum-cost path from source to target using a min-heapNon-negative weighted graphs where total path cost mattersFails silently with negative weights. Does not crash, just gives wrong answers
Cost inversion (1/weight)Converts "higher is better" weights into "lower is better" costsAny time your natural metric is strength or frequency but you need min-cost pathfindingZero weights cause division by zero. Always validate that w > 0 before calling the algorithm
Lazy deletion in heapqPushes duplicate entries instead of updating priorities, skips stale popsAll Python heapq usage where priorities change after insertionHeap can grow larger than V entries. Memory is wasted but correctness is preserved
Predecessor dictionaryMaps each node to the prior node on its shortest path, enabling path reconstructionAny shortest-path algorithm (BFS or Dijkstra)Walk backward from target to source, then reverse the list
Top-K with min-heapMaintains a heap of size K, evicting the minimum when a larger item arrivesFinding the K most connected users, top-K in a stream or large datasetOnly beats O(N log N) sorting when K is much smaller than N
Graph validationChecks edge symmetry and positive weights after constructionAfter building any graph from external data (files, APIs, databases)Run once after construction, not on every query. Catches race conditions in data ingestion
pytest edge case coverageTests disconnected nodes, self-paths, and algorithm agreement in addition to happy pathsEvery algorithm you implement, without exceptionHappy path tests alone are not enough. The bugs live at the boundaries

This table is your decision framework. Print it, bookmark it, or keep it open during interviews. When a weighted graph problem appears, start at the top and work down: do you need hops or cost? Are weights always positive? Is K small? Each question points to a row.


What Comes Next

You have completed the entire course. From dynamic arrays to weighted graphs, from O(n) list searches to O((V+E) log V) Dijkstra, from a single list.append() to a multi-algorithm network analyzer. The data structures and algorithms you learned are the same ones powering Google Maps (Dijkstra for routing), Twitter's follower graph (BFS for suggestions), Spotify's recommendations (graph + heap-based ranking), and Netflix's content delivery network (shortest-path optimization).

My parting advice: build things. Every data structure in this course becomes real the moment you use it in a project that matters to you. Build a pathfinder for a game. Build a dependency resolver for a build tool. Build a social graph analyzer for your favorite API. The concepts stick when they solve your problems, not mine.


Difficulty Fork

Too Easy? (review and extend)

Key takeaways from the full course: You can now implement and analyze arrays, hash maps, stacks, queues, linked lists, trees, heaps, and graphs. You understand BFS, DFS, Dijkstra, binary search, and multiple sorting algorithms. You know when each structure is appropriate and what trade-offs to expect.

Extension challenge: Implement the A-star (A*) algorithm, which extends Dijkstra with a heuristic function to guide the search toward the target. A* is what video game pathfinding (Unity, Unreal Engine) actually uses. The heuristic must be admissible (never overestimate), and with a zero heuristic, A* degrades to Dijkstra. Add it to NetProbe as a guided_path method that accepts a heuristic function parameter.

Just Right (reinforce)

Alternative analogy for Dijkstra: Think of it as a flood filling a landscape. Water starts at one point and spreads outward, always flowing to the lowest available ground first. The water level at any point when the flood first reaches it is the shortest distance from the source. The min-heap is the "frontier" of the flood, always picking the lowest edge of the water surface to expand next.

Extra practice: Modify NetProbe so that add_connection can update an existing edge's weight (for example, if Alice and Bob exchange more messages next month). Verify that both the adj set and weights dict stay consistent, and write a test that updates a weight and confirms Dijkstra recalculates correctly.

Want a Challenge?

Interview problem (commonly asked at Amazon and Google): Given a weighted graph of cities and flight costs, and a constraint K (maximum number of stops), find the cheapest flight from source to destination with at most K stops. This is Dijkstra with a twist: the state in your heap is not just (cost, city) but (cost, city, stops_remaining). You can visit the same city multiple times if you arrive with different stop budgets. This problem is LeetCode 787 ("Cheapest Flights Within K Stops") and tests whether you truly understand how to modify Dijkstra's state space.

Hint: The key insight is that a node is not just a city. It is a (city, stops_left) pair. Your visited set must track this compound state.

Code Playground

Python

Q&A