Building a Merkle Tree: Summarizing Thousands of Transactions into a Single Hash

Lesson 823min4,662 chars

Learning Objectives

  • 머클 트리의 구조와 머클 루트 계산 과정을 단계별로 설명할 수 있다
  • 임의 개수의 트랜잭션으로부터 머클 루트를 계산하는 함수를 구현할 수 있다
  • 머클 증명을 생성하고 검증하는 코드를 작성하여 특정 트랜잭션의 포함 여부를 O(log n)으로 확인할 수 있다

Building a Merkle Tree: Summarizing Thousands of Transactions into a Single Hash

From Lesson 7 to Here

Last time, we implemented the UTXO model and built a structure where transactions accumulate inside a block. We got as far as creating coins via coinbase, spending UTXOs, and blocking double spends. But let's ask one question.

A block contains 2,000 transactions. How do you verify that your transaction is actually included in this block?

The brute-force approach is to download all 2,000 and compare them one by one. Back in Lesson 4 when we covered block anatomy, we said a block header has 6 fields. One of them is the Merkle Root. At the time we glossed over it as "a summary of transactions," but today we fully dissect its internals.

Today's Mission

📦 What we'll build: merkle.py module
├── build_merkle_tree()    — transaction list → build full tree
├── get_merkle_root()      — return merkle root hash
├── generate_proof()       — generate inclusion proof for a specific transaction
└── verify_proof()         — verify proof (core of SPV clients)
🔗 Integration: add merkle_root field to Block class

Once done, you'll be able to verify one transaction out of 2,000 with exactly 11 hash operations. The moment O(n) becomes O(log n).


1. What is a Merkle Tree — The 'Tournament Bracket' Analogy

Think of the World Cup. 32 teams enter, go through Round of 16 → Quarterfinals → Semifinals → Finals, and exactly one champion emerges. A Merkle tree has exactly the same structure.

  • Leaves: The hash of each transaction — the SHA-256 we learned in Lesson 1 is used here
  • Intermediate nodes: Hash the concatenation of two children
  • Root: The final single hash — this goes into the block header

Key property: If even 1 bit changes in any transaction at the bottom, all hashes above it change in a cascade, and ultimately the root becomes completely different. It's the same principle as when we "chained previous block hashes" to verify chain integrity in Lesson 6 — if one changes, everything collapses.

🤔 Think about it: If there are 1,024 transactions, what is the height (number of levels) of the Merkle tree?

See answer

log₂(1024) = 10 levels. 1,024 leaf nodes → 512 → 256 → … → 1 (root). It goes through 10 "rounds" in total. This is also the number of hash operations needed for verification.


2. Computing the Merkle Root — Implementing from Scratch

Now that we understand the structure, let's build it in code.

2-1. The Simplest Merkle Root

import hashlib

def hash_data(data: str) -> str:
    """SHA-256 hash (same function built in Lesson 1)"""
    return hashlib.sha256(data.encode()).hexdigest()

def hash_pair(left: str, right: str) -> str:
    """Concatenate two hashes and hash again"""
    return hash_data(left + right)

# 4 transactions
txs = ["Alice->Bob:1", "Bob->Carol:0.5", "Carol->Dave:0.3", "Dave->Eve:0.2"]

# Step 1: Leaf nodes — hash each transaction
leaves = [hash_data(tx) for tx in txs]
print("Number of leaf nodes:", len(leaves))
print("First leaf:", leaves[0][:16] + "...")

# Step 2: Pair up and work upward
level_1 = [
    hash_pair(leaves[0], leaves[1]),
    hash_pair(leaves[2], leaves[3])
]
print("Level 1 node count:", len(level_1))

# Step 3: Root
root = hash_pair(level_1[0], level_1[1])
print("Merkle root:", root[:16] + "...")

# Output:
# Number of leaf nodes: 4
# First leaf: 73e68cb8de3f2e8f...
# Level 1 node count: 2
# Merkle root: 6b34e2c180f82b54...

This code hashes 4 transactions as leaf nodes, then pairs them up to work toward a single root. It's exactly the process of filling in a tournament bracket from bottom to top.

Simple. But there's a problem — what if the number of transactions is odd?

2-2. Handling Odd Numbers: Duplicate the Last Node

Bitcoin's actual implementation also duplicates the last hash when the count is odd to make it even. I'll be honest — this duplication approach can create a second preimage attack vulnerability in Ethereum, which is why it takes a different approach. But since learning Bitcoin is our goal, we'll go the Bitcoin way.

❌ → 🤔 → ✅ How Odd Transaction Handling Evolves

Odd-number handling is where beginners make the most mistakes with Merkle root calculation. Let's see how it improves step by step.

❌ WRONG WAY — Ignore odd: just drop the leftover node

def get_merkle_root_wrong(transactions):
    current_level = [hash_data(tx) for tx in transactions]
    while len(current_level) > 1:
        next_level = []
        # If odd, the last one gets dropped!
        for i in range(0, len(current_level) - 1, 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))
        current_level = next_level
    return current_level[0]

# Problem: with 5 transactions, the last Tx5 is not reflected in the root!
txs = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5"]
root_wrong = get_merkle_root_wrong(txs)

# Tampering with Tx5 produces the same root — a critical security hole
txs_tampered = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5_FAKE"]
root_tampered = get_merkle_root_wrong(txs_tampered)
print(f"Original root: {root_wrong[:16]}...")
print(f"Tampered root: {root_tampered[:16]}...")
print(f"Same?          {root_wrong == root_tampered}")  # True — disaster!
# Output:
# Original root: f1c8a2e9b7d34a01...
# Tampered root: f1c8a2e9b7d34a01...
# Same?          True

Since the last transaction isn't reflected in the root, an attacker can freely alter odd-indexed transactions without detection.

🤔 BETTER — Promote unpaired: carry the leftover node to the next level as-is

def get_merkle_root_better(transactions):
    current_level = [hash_data(tx) for tx in transactions]
    while len(current_level) > 1:
        next_level = []
        for i in range(0, len(current_level), 2):
            if i + 1 < len(current_level):
                next_level.append(hash_pair(current_level[i], current_level[i+1]))
            else:
                # Carry the leftover node up without hashing
                next_level.append(current_level[i])
        current_level = next_level
    return current_level[0]

# Tampering is detected
txs = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5"]
root_better = get_merkle_root_better(txs)
txs_tampered = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5_FAKE"]
root_better_t = get_merkle_root_better(txs_tampered)
print(f"Same? {root_better == root_better_t}")  # False — tampering detected
# Output:
# Same? False

Tampering is detected, but there's a problem. When an odd node is promoted without hashing, the tree shape becomes asymmetric. The "duplicated sibling" rule can't be applied when generating proofs, which complicates the logic in generate_proof() and verify_proof(). Also, this produces a different Merkle root than other Bitcoin network nodes, causing incompatibility.

✅ BEST — Duplicate the last node: Bitcoin's actual approach

def get_merkle_root_best(transactions):
    current_level = [hash_data(tx) for tx in transactions]
    while len(current_level) > 1:
        # Key: if odd, duplicate the last node to make it even
        if len(current_level) % 2 == 1:
            current_level.append(current_level[-1])
        next_level = []
        for i in range(0, len(current_level), 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))
        current_level = next_level
    return current_level[0]

# Tamper detection + symmetric tree + Bitcoin compatible
txs = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5"]
root_best = get_merkle_root_best(txs)
txs_tampered = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5_FAKE"]
root_best_t = get_merkle_root_best(txs_tampered)
print(f"Same? {root_best == root_best_t}")  # False
# Output:
# Same? False

The duplication approach has three advantages: (1) all transactions are reflected in the root, (2) the tree is always a complete binary tree form so proof logic stays simple, (3) it produces the same root as the Bitcoin network. The implementation below uses this approach directly.


The code below is the complete form applying this ✅ BEST approach.

def get_merkle_root(transactions: list) -> str:
    """Compute the Merkle root from a list of transactions."""
    if not transactions:
        return hash_data("")  # Handle empty block

    # Generate leaf nodes
    current_level = [hash_data(tx) for tx in transactions]

    # Repeat until one node remains
    while len(current_level) > 1:
        next_level = []
        # Duplicate the last node if odd
        if len(current_level) % 2 == 1:
            current_level.append(current_level[-1])

        # Pair up and hash
        for i in range(0, len(current_level), 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))

        current_level = next_level

    return current_level[0]

# Test with odd count
txs_odd = ["Tx1", "Tx2", "Tx3"]  # 3 — odd!
root = get_merkle_root(txs_odd)
print(f"Merkle root for 3 transactions: {root[:16]}...")
# Output:
# Merkle root for 3 transactions: a92f4b7c3d1e8a06...

# Even count
txs_even = ["Tx1", "Tx2", "Tx3", "Tx4"]
root2 = get_merkle_root(txs_even)
print(f"Merkle root for 4 transactions: {root2[:16]}...")
# Output:
# Merkle root for 4 transactions: 9c1d5f8e2a7b4093...

# Change just one transaction
txs_tampered = ["Tx1", "Tx2_TAMPERED", "Tx3", "Tx4"]
root3 = get_merkle_root(txs_tampered)
print(f"Tampered merkle root:           {root3[:16]}...")
print(f"Same root? {root2 == root3}")  # False!
# Output:
# Tampered merkle root:           e7a2bc091f4d6358...
# Same root? False

This function converges any number of transactions — even or odd — into a single root. The while loop halves the node count each round, and when odd, it duplicates the last node to make a pair.

We only slightly touched one transaction, yet the root changed completely. The avalanche effect we learned in Lesson 1 was amplified through the tree structure.

🤔 Think about it: In the proof-of-work (PoW) we implemented in Lesson 5, we repeatedly calculated hashes while changing the nonce. If one new transaction is added during mining, what happens to the Merkle root, and does mining have to start over?

See answer

Yes. When a transaction is added or changed, the Merkle root changes, and since the Merkle root is part of the block header, the block hash itself changes. The nonce that previously satisfied the difficulty target is no longer valid, so mining must start over from scratch. This is why miners finalize their transaction selection before starting to iterate the nonce.


3. Building the Full Tree — build_merkle_tree()

Computing only the Merkle root is only half the job. We need to store all intermediate nodes in order to build Merkle proofs later.

def build_merkle_tree(transactions: list) -> list:
    """
    Returns the full Merkle tree as a list of levels.
    tree[0] = leaf nodes, tree[-1] = [root]
    """
    if not transactions:
        return [[hash_data("")]]

    tree = []

    # Leaf node level
    current_level = [hash_data(tx) for tx in transactions]
    tree.append(current_level[:])  # Copy and store

    while len(current_level) > 1:
        # Handle odd
        if len(current_level) % 2 == 1:
            current_level.append(current_level[-1])

        next_level = []
        for i in range(0, len(current_level), 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))

        tree.append(next_level[:])
        current_level = next_level

    return tree

# Test
txs = ["Tx1", "Tx2", "Tx3", "Tx4", "Tx5"]
tree = build_merkle_tree(txs)

print(f"Number of tree levels: {len(tree)}")
for i, level in enumerate(tree):
    short = [h[:8] + "..." for h in level]
    print(f"  Level {i} ({len(level)} nodes): {short}")

# Output:
# Number of tree levels: 4
#   Level 0 (5 nodes): ['a4e624d2...', '2e7d2c03...', 'b2f5ff47...', 'c0a5d62e...', '1b645389...']
#   Level 1 (3 nodes): ['f1c8a2e9...', '3d7b41a0...', '94e5c167...']
#   Level 2 (2 nodes): ['7a29c3f1...', 'db5e2871...']
#   Level 3 (1 node): ['e82f1a09...']

The logic is identical to get_merkle_root(), with the difference that it accumulates the result of each level into the tree list. These stored intermediate nodes are the material for the Merkle proofs we'll build in the next section.

Level 0 has 5 nodes but Level 1 has 3 — because 5 is odd, the last is duplicated to make 6, then paired down to 3. Level 1 also has 3 (odd), so the same process yields 2 nodes at Level 2.

The red nodes are the parts duplicated due to odd counts.


4. Merkle Proof — The Magic of O(log n)

This is the real core. When I was building an airdrop system for an Ethereum DeFi protocol, I couldn't store 100,000 whitelist addresses on-chain. The gas fees would be astronomical. The solution was to store just one Merkle root in the contract, and have each user submit their own proof. That's exactly Merkle proof.

How Proofs Work

To prove "TxB is included in this block," you don't need the whole tree. All you need are the sibling nodes on the path up to the root.

To verify TxB:

  1. I compute H(TxB)
  2. I receive sibling H(TxA) → compute H(TxA) + H(TxB) = H(AB)
  3. I receive sibling H(CD) → compute H(AB) + H(CD) = root
  4. If this root matches the Merkle root in the block header → inclusion proof complete!

Data needed: 2 sibling nodes. Even with 1,024 transactions, you only need 10 sibling nodes (log₂ 1024 = 10). Compared to iterating the entire UTXO pool in Lesson 7, this is a different dimension of efficiency.

Transaction countFull scan (O(n))Merkle proof (O(log n))
1616 times4 times
1,0241,024 times10 times
65,53665,536 times16 times
1,000,0001,000,000 times20 times

Verify one out of 1 million with 20 hash operations. This is why Merkle trees are beautiful.

Implementation: generate_proof() and verify_proof()

def generate_proof(tree: list, tx_index: int) -> list:
    """
    Generate a Merkle proof for a specific transaction.
    Returns: list of (sibling_hash, direction) — direction is 'left' or 'right'
    """
    proof = []
    index = tx_index

    for level in tree[:-1]:  # Exclude the root level
        # Handle odd: apply padding if needed
        working_level = level[:]
        if len(working_level) % 2 == 1:
            working_level.append(working_level[-1])

        # Find sibling node
        if index % 2 == 0:  # I'm on the left → sibling is on the right
            sibling = working_level[index + 1]
            proof.append((sibling, 'right'))
        else:  # I'm on the right → sibling is on the left
            sibling = working_level[index - 1]
            proof.append((sibling, 'left'))

        # Index at the next level
        index = index // 2

    return proof


def verify_proof(tx_data: str, proof: list, expected_root: str) -> bool:
    """Verify a Merkle proof."""
    current_hash = hash_data(tx_data)

    for sibling_hash, direction in proof:
        if direction == 'left':
            # Sibling is on the left → sibling + me
            current_hash = hash_pair(sibling_hash, current_hash)
        else:
            # Sibling is on the right → me + sibling
            current_hash = hash_pair(current_hash, sibling_hash)

    return current_hash == expected_root

# Test
txs = ["Tx1", "Tx2", "Tx3", "Tx4"]
tree = build_merkle_tree(txs)
root = tree[-1][0]

# Generate inclusion proof for Tx2 (index 1)
proof = generate_proof(tree, 1)
print("Proof data:")
for sibling, direction in proof:
    print(f"  sibling: {sibling[:16]}... direction: {direction}")

# Verify
is_valid = verify_proof("Tx2", proof, root)
print(f"\nTx2 inclusion verification: {is_valid}")  # True

# Verify a fake transaction
is_fake = verify_proof("Tx2_FAKE", proof, root)
print(f"Tx2_FAKE inclusion verification: {is_fake}")  # False

# Output:
# Proof data:
#   sibling: a4e624d264de159e... direction: left
#   sibling: c01e7b2e17488aab... direction: right
#
# Tx2 inclusion verification: True
# Tx2_FAKE inclusion verification: False

generate_proof() traverses the tree from bottom to top, collecting the sibling node at each level. verify_proof() recombines those sibling nodes in order to recompute up to the root. If the recomputed root matches the Merkle root in the block header, it means the transaction is included in the block.

The proof requires exactly 2 pieces of data (4 transactions → log₂4 = 2). Fake transactions are rejected immediately.

🤔 Think about it: Why do we need to store direction in generate_proof()? Are the results of hash_pair(A, B) and hash_pair(B, A) the same?

See answer

They're different! hash_data("AB") and hash_data("BA") produce completely different results. It's a property of hash functions we learned in Lesson 1 — even a 1-bit difference in input completely changes the output. So we must record whether the sibling goes on the left or right, so we can combine them in the same order during verification.


5. SPV Client — Verifying Without a Full Node

This is where Merkle proofs shine brightest. SPV (Simplified Payment Verification) was the part that impressed me most when I first studied Bitcoin. It's also a concept that Satoshi Nakamoto explained directly in Section 8 of the Bitcoin whitepaper.

Full node: Stores every transaction in every block (approximately 600GB+ as of 2026).

SPV client: Stores only block headers. Since one header is 80 bytes, even hundreds of thousands of block headers total only a few dozen MB.

This is how Bitcoin mobile wallets work. Since you can't store 600GB on a phone, you sync only block headers and verify transactions with Merkle proofs. In Lesson 4 we said "the block header is the block's ID card" — with just the ID card (header), you can prove what's inside (transactions).

🔍 Deep Dive: SPV Security Limitations

SPV only proves "is this transaction included in a block." Whether that block belongs to the longest chain requires separate verification. In the 51% attack we'll learn about in Lesson 9, an attacker could put a fake transaction in a shorter chain and fool an SPV client. To prevent this, you'd need to cross-verify block headers against multiple full nodes. But for typical transactions, SPV is sufficient — because the proof-of-work difficulty we learned in Lesson 5 is itself the basis of security.


6. Integrating Merkle Root into the Block Class

Now that we have the theory and functions, it's time to attach the Merkle tree to our blockchain. We add the merkle_root field to the block header we designed in Lesson 4.

import hashlib
import time

def hash_data(data: str) -> str:
    return hashlib.sha256(data.encode()).hexdigest()

def hash_pair(left: str, right: str) -> str:
    return hash_data(left + right)

def get_merkle_root(transactions: list) -> str:
    """transaction list → merkle root"""
    if not transactions:
        return hash_data("")
    current_level = [hash_data(str(tx)) for tx in transactions]
    while len(current_level) > 1:
        if len(current_level) % 2 == 1:
            current_level.append(current_level[-1])
        next_level = []
        for i in range(0, len(current_level), 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))
        current_level = next_level
    return current_level[0]


class Block:
    def __init__(self, index, transactions, previous_hash, difficulty=4):
        self.index = index
        self.timestamp = time.time()
        self.transactions = transactions
        self.previous_hash = previous_hash
        self.difficulty = difficulty
        # ⬇️ Newly added: Merkle root!
        self.merkle_root = get_merkle_root(
            [str(tx) for tx in transactions]
        )
        self.nonce = 0
        self.hash = self.mine_block()

    def calculate_hash(self):
        # Include Merkle root in block hash calculation
        block_string = (
            f"{self.index}{self.timestamp}{self.merkle_root}"
            f"{self.previous_hash}{self.nonce}"
        )
        return hash_data(block_string)

    def mine_block(self):
        target = "0" * self.difficulty
        while True:
            h = self.calculate_hash()
            if h[:self.difficulty] == target:
                return h
            self.nonce += 1

# Test — difficulty 2 for speed
block = Block(
    index=1,
    transactions=["Alice->Bob:1BTC", "Bob->Carol:0.5BTC", "Coinbase->Alice:6.25BTC"],
    previous_hash="0" * 64,
    difficulty=2
)

print(f"Block #{block.index}")
print(f"  Transaction count: {len(block.transactions)}")
print(f"  Merkle root:       {block.merkle_root[:24]}...")
print(f"  Block hash:        {block.hash[:24]}...")
print(f"  Nonce:             {block.nonce}")

# Output:
# Block #1
#   Transaction count: 3
#   Merkle root:   8f4a2c1e9b7d3f05a628e4...
#   Block hash:    00a7c3f1d829e5b461...
#   Nonce:         142

Note: calculate_hash() no longer lists individual transactions directly. A single Merkle root represents all transactions. Actual Bitcoin works the same way.

Don't do this:

# Include all transactions in block hash — inefficient
block_string = f"{self.index}{self.transactions}{self.previous_hash}{self.nonce}"

With 2,000 transactions, the hash input string becomes enormous, and you'd have to hash that giant string millions of times while iterating the nonce.

With Merkle root:

# Include only the Merkle root (fixed 64 chars) — constant regardless of transaction count
block_string = f"{self.index}{self.merkle_root}{self.previous_hash}{self.nonce}"

The Merkle root is always 64 characters (SHA-256), so the hash input size is the same whether there's 1 or 100,000 transactions.


7. Practical Intuition: Integrity Verification Scenario

We've built individual functions and integrated them into the block. Now let's connect everything with a real attack scenario.

# Scenario: a malicious miner tries to secretly alter a transaction

import hashlib, time

def hash_data(data):
    return hashlib.sha256(data.encode()).hexdigest()

def hash_pair(left, right):
    return hash_data(left + right)

def build_merkle_tree(transactions):
    if not transactions:
        return [[hash_data("")]]
    tree = []
    current_level = [hash_data(tx) for tx in transactions]
    tree.append(current_level[:])
    while len(current_level) > 1:
        if len(current_level) % 2 == 1:
            current_level.append(current_level[-1])
        next_level = []
        for i in range(0, len(current_level), 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))
        tree.append(next_level[:])
        current_level = next_level
    return tree

def generate_proof(tree, tx_index):
    proof = []
    index = tx_index
    for level in tree[:-1]:
        working_level = level[:]
        if len(working_level) % 2 == 1:
            working_level.append(working_level[-1])
        if index % 2 == 0:
            sibling = working_level[index + 1]
            proof.append((sibling, 'right'))
        else:
            sibling = working_level[index - 1]
            proof.append((sibling, 'left'))
        index = index // 2
    return proof

def verify_proof(tx_data, proof, expected_root):
    current_hash = hash_data(tx_data)
    for sibling_hash, direction in proof:
        if direction == 'left':
            current_hash = hash_pair(sibling_hash, current_hash)
        else:
            current_hash = hash_pair(current_hash, sibling_hash)
    return current_hash == expected_root


# Original transactions
original_txs = [
    "Alice->Bob:1BTC",
    "Bob->Carol:0.5BTC",
    "Carol->Dave:0.3BTC",
    "Dave->Eve:0.2BTC"
]

# Build tree & save Merkle root
tree = build_merkle_tree(original_txs)
saved_root = tree[-1][0]
print(f"[Merkle root saved in block header]: {saved_root[:24]}...")

# Generate proof for Alice's Tx (index 0)
proof = generate_proof(tree, 0)
print(f"\n[Alice verification] Legitimate transaction:")
result = verify_proof("Alice->Bob:1BTC", proof, saved_root)
print(f"  Result: {'✅ Inclusion confirmed' if result else '❌ Not included'}")

# Attacker tampers with the amount
print(f"\n[Attacker attempt] Tampered transaction:")
fake_result = verify_proof("Alice->Bob:100BTC", proof, saved_root)
print(f"  Result: {'✅ Inclusion confirmed' if fake_result else '❌ Forgery detected!'}")

# What if the attacker tampers with the tree itself?
tampered_txs = [
    "Alice->Bob:100BTC",  # Tampered!
    "Bob->Carol:0.5BTC",
    "Carol->Dave:0.3BTC",
    "Dave->Eve:0.2BTC"
]
tampered_tree = build_merkle_tree(tampered_txs)
tampered_root = tampered_tree[-1][0]
print(f"\n[Tampered tree root]: {tampered_root[:24]}...")
print(f"[Same as saved root?]: {saved_root == tampered_root}")

# Output:
# [Merkle root saved in block header]: 6b34e2c180f82b5418c7d8...
#
# [Alice verification] Legitimate transaction:
#   Result: ✅ Inclusion confirmed
#
# [Attacker attempt] Tampered transaction:
#   Result: ❌ Forgery detected!
#
# [Tampered tree root]: a91c4f7e38b20d6923f158...
# [Same as saved root?]: False

The attacker has two remaining options: change a transaction, or reconstruct the entire tree. In both cases, it won't match the root stored in the block header. What about changing the block header itself? As we learned in Lesson 5, they'd have to redo PoW, and as we learned in Lesson 6, they'd have to re-mine all subsequent blocks. Merkle tree, proof-of-work, chain linking — these three layers of defense are blockchain security.


📝 Self-Review Checklist

ItemCheck
Does build_merkle_tree() correctly handle odd transaction counts?
Is the result of get_merkle_root() the same as build_merkle_tree()[-1][0]?
Is the number of sibling nodes returned by generate_proof() = tree height - 1?
Does verify_proof() correctly reject tampered transactions?
Is merkle_root included in calculate_hash() of the Block class?

Top 3 Common Mistakes:

  1. Wrong hash_pair order: hash_pair(left, right)hash_pair(right, left) — ignoring direction in the proof breaks verification
  2. Missing odd handling: Always test with 3, 5, 7 transactions etc.
  3. Not handling empty transactions: A block with 0 transactions must still have a Merkle root

🚀 Next Level — How Senior Devs Do It

Things I learned when implementing Merkle proof verification in Solidity:

1. Sorted Hash Pair Above we stored direction, but OpenZeppelin's MerkleProof library takes a different approach. It always sorts the two hashes before combining them:

def hash_pair_sorted(a: str, b: str) -> str:
    """Always put the smaller value on the left — no direction storage needed"""
    if a < b:
        return hash_pair(a, b)
    return hash_pair(b, a)

This eliminates the need for direction information in the proof, reducing data size. On-chain, gas savings matter, so this approach is preferred.

2. Double Hashing Bitcoin actually applies SHA-256 twice: SHA256(SHA256(data)). This defends against length extension attacks. Once is sufficient for our educational code, but production implementations must use double hashing.


🔨 Project Update

Here's the full code with Merkle tree integrated into the PyChain project built so far. The Merkle tree sits on top of the UTXO model, PoW, and chain structure from previous lessons.

"""
PyChain — Cumulative project through Lesson 8
Merkle tree integrated version
"""
import hashlib
import time

# ========== Hash Utilities ==========
def hash_data(data: str) -> str:
    """SHA-256 hash (Lesson 1)"""
    return hashlib.sha256(data.encode()).hexdigest()

def hash_pair(left: str, right: str) -> str:
    """Concatenate two hashes and hash (Lesson 8)"""
    return hash_data(left + right)

# ========== Merkle Tree (Lesson 8 — New) ==========
def build_merkle_tree(transactions: list) -> list:
    """Build the full Merkle tree as a list of levels"""
    if not transactions:
        return [[hash_data("")]]
    tree = []
    current_level = [hash_data(str(tx)) for tx in transactions]
    tree.append(current_level[:])
    while len(current_level) > 1:
        if len(current_level) % 2 == 1:
            current_level.append(current_level[-1])
        next_level = []
        for i in range(0, len(current_level), 2):
            next_level.append(hash_pair(current_level[i], current_level[i+1]))
        tree.append(next_level[:])
        current_level = next_level
    return tree

def get_merkle_root(transactions: list) -> str:
    """Return the Merkle root of a transaction list"""
    tree = build_merkle_tree(transactions)
    return tree[-1][0]

def generate_proof(tree: list, tx_index: int) -> list:
    """Generate a Merkle proof for a specific transaction"""
    proof = []
    index = tx_index
    for level in tree[:-1]:
        working_level = level[:]
        if len(working_level) % 2 == 1:
            working_level.append(working_level[-1])
        if index % 2 == 0:
            sibling = working_level[index + 1]
            proof.append((sibling, 'right'))
        else:
            sibling = working_level[index - 1]
            proof.append((sibling, 'left'))
        index = index // 2
    return proof

def verify_proof(tx_data: str, proof: list, expected_root: str) -> bool:
    """Verify a Merkle proof"""
    current_hash = hash_data(tx_data)
    for sibling_hash, direction in proof:
        if direction == 'left':
            current_hash = hash_pair(sibling_hash, current_hash)
        else:
            current_hash = hash_pair(current_hash, sibling_hash)
    return current_hash == expected_root

# ========== UTXO Model (Lesson 7) ==========
class TxOutput:
    def __init__(self, recipient: str, amount: float):
        self.recipient = recipient
        self.amount = amount
    def __repr__(self):
        return f"{self.recipient}:{self.amount}"

class TxInput:
    def __init__(self, tx_id: str, output_index: int):
        self.tx_id = tx_id
        self.output_index = output_index
    def __repr__(self):
        return f"{self.tx_id[:8]}...[{self.output_index}]"

class Transaction:
    def __init__(self, inputs: list, outputs: list, is_coinbase=False):
        self.inputs = inputs
        self.outputs = outputs
        self.is_coinbase = is_coinbase
        self.tx_id = self.calculate_id()
    def calculate_id(self):
        data = str([(inp.tx_id, inp.output_index) for inp in self.inputs])
        data += str([(out.recipient, out.amount) for out in self.outputs])
        return hash_data(data)
    def __repr__(self):
        return f"Tx({self.tx_id[:8]}...)"

# ========== Block (Lesson 4 + Lesson 8 integrated) ==========
class Block:
    def __init__(self, index, transactions, previous_hash, difficulty=4):
        self.index = index
        self.timestamp = time.time()
        self.transactions = transactions
        self.previous_hash = previous_hash
        self.difficulty = difficulty
        # Compute Merkle root (Lesson 8 integration)
        self.merkle_root = get_merkle_root(
            [str(tx) for tx in transactions]
        )
        self.nonce = 0
        self.hash = self.mine_block()

    def calculate_hash(self):
        block_string = (
            f"{self.index}{self.timestamp}{self.merkle_root}"
            f"{self.previous_hash}{self.nonce}"
        )
        return hash_data(block_string)

    def mine_block(self):
        target = "0" * self.difficulty
        while True:
            h = self.calculate_hash()
            if h[:self.difficulty] == target:
                return h
            self.nonce += 1

# ========== Blockchain (Lesson 6 + 7) ==========
class Blockchain:
    def __init__(self, difficulty=2):
        self.difficulty = difficulty
        self.chain = []
        self.utxo_pool = {}
        self.pending_transactions = []
        self._create_genesis_block()

    def _create_genesis_block(self):
        coinbase = Transaction(
            inputs=[TxInput("0" * 64, 0)],
            outputs=[TxOutput("genesis", 50.0)],
            is_coinbase=True
        )
        self.utxo_pool[(coinbase.tx_id, 0)] = coinbase.outputs[0]
        genesis = Block(0, [coinbase], "0" * 64, self.difficulty)
        self.chain.append(genesis)

    def add_transaction(self, tx: Transaction):
        if tx.is_coinbase or self.validate_transaction(tx):
            self.pending_transactions.append(tx)
            return True
        return False

    def validate_transaction(self, tx: Transaction) -> bool:
        input_sum = 0
        for inp in tx.inputs:
            key = (inp.tx_id, inp.output_index)
            if key not in self.utxo_pool:
                return False
            input_sum += self.utxo_pool[key].amount
        output_sum = sum(out.amount for out in tx.outputs)
        return input_sum >= output_sum

    def mine_pending(self, miner_address: str):
        coinbase = Transaction(
            inputs=[TxInput("0" * 64, 0)],
            outputs=[TxOutput(miner_address, 6.25)],
            is_coinbase=True
        )
        txs = [coinbase] + self.pending_transactions

        # Update UTXO pool
        for tx in txs:
            if not tx.is_coinbase:
                for inp in tx.inputs:
                    del self.utxo_pool[(inp.tx_id, inp.output_index)]
            for i, out in enumerate(tx.outputs):
                self.utxo_pool[(tx.tx_id, i)] = out

        block = Block(
            len(self.chain), txs,
            self.chain[-1].hash, self.difficulty
        )
        self.chain.append(block)
        self.pending_transactions = []
        return block

    def get_balance(self, address: str) -> float:
        return sum(
            utxo.amount for utxo in self.utxo_pool.values()
            if utxo.recipient == address
        )

    def verify_chain(self) -> bool:
        for i in range(1, len(self.chain)):
            block = self.chain[i]
            prev = self.chain[i - 1]
            if block.previous_hash != prev.hash:
                return False
            # Re-verify Merkle root (Lesson 8 addition)
            recalculated_root = get_merkle_root(
                [str(tx) for tx in block.transactions]
            )
            if block.merkle_root != recalculated_root:
                return False
        return True


# ========== Run Tests ==========
if __name__ == "__main__":
    print("=" * 50)
    print("PyChain v0.8 — Merkle Tree Integration")
    print("=" * 50)

    bc = Blockchain(difficulty=2)
    print(f"\nGenesis block created")
    print(f"  genesis balance: {bc.get_balance('genesis')} BTC")

    # Mine rewards to Alice
    block1 = bc.mine_pending("Alice")
    print(f"\nBlock #{block1.index} mined")
    print(f"  Merkle root: {block1.merkle_root[:24]}...")
    print(f"  Block hash:  {block1.hash[:24]}...")
    print(f"  Alice balance: {bc.get_balance('Alice')} BTC")

    # Alice → Bob transfer
    alice_utxo = None
    for key, utxo in bc.utxo_pool.items():
        if utxo.recipient == "Alice":
            alice_utxo = key
            break

    if alice_utxo:
        tx = Transaction(
            inputs=[TxInput(alice_utxo[0], alice_utxo[1])],
            outputs=[
                TxOutput("Bob", 2.0),
                TxOutput("Alice", 4.25)  # Change
            ]
        )
        bc.add_transaction(tx)
        block2 = bc.mine_pending("Alice")
        print(f"\nBlock #{block2.index} mined (Alice→Bob 2BTC)")
        print(f"  Transaction count: {len(block2.transactions)}")
        print(f"  Merkle root: {block2.merkle_root[:24]}...")

    # Merkle proof test
    print(f"\n--- Merkle Proof Test ---")
    tx_strings = [str(tx) for tx in block2.transactions]
    tree = build_merkle_tree(tx_strings)
    proof = generate_proof(tree, 1)  # Second transaction
    is_valid = verify_proof(tx_strings[1], proof, block2.merkle_root)
    print(f"  Tx inclusion proof: {'✅ Success' if is_valid else '❌ Failed'}")
    print(f"  Proof size: {len(proof)} sibling hashes")

    # Chain integrity
    print(f"\n--- Chain Verification ---")
    print(f"  Chain length: {len(bc.chain)} blocks")
    print(f"  Integrity: {'✅ Valid' if bc.verify_chain() else '❌ Corrupted'}")

    # Balance check
    print(f"\n--- Final Balances ---")
    print(f"  Alice: {bc.get_balance('Alice')} BTC")
    print(f"  Bob:   {bc.get_balance('Bob')} BTC")

Expected output:

==================================================
PyChain v0.8 — Merkle Tree Integration
==================================================

Genesis block created
  genesis balance: 50.0 BTC

Block #1 mined
  Merkle root: 8a3c2f1e7b4d9058a6...
  Block hash:  00b7c3f1d829e5b461...
  Alice balance: 6.25 BTC

Block #2 mined (Alice→Bob 2BTC)
  Transaction count: 2
  Merkle root: f2a9c7e3b814d605...
  
--- Merkle Proof Test ---
  Tx inclusion proof: ✅ Success
  Proof size: 1 sibling hashes

--- Chain Verification ---
  Chain length: 3 blocks
  Integrity: ✅ Valid

--- Final Balances ---
  Alice: 10.5 BTC
  Bob:   2.0 BTC

Run it yourself — you can observe the flow where the Merkle root is included in the block header and mined, and where chain verification re-validates the Merkle root.


Merkle Tree at a Glance

Coming up in the next lesson: In Lesson 9, we'll analyze and simulate attempts to break all these security mechanisms — 51% attacks, selfish mining, Sybil attacks. We'll directly attack the PyChain we've built to see just how secure it is.


Difficulty Fork

🟢 If it was easy

Key takeaways:

  • Merkle tree = repeatedly combine transactions in hash pairs to summarize them into one root
  • Merkle proof = with only the sibling nodes on the path to the root, verify in O(log n)
  • SPV = a lightweight client that verifies transaction inclusion using only block headers (containing the Merkle root)

The next lesson analyzes the blockchain security model from an attacker's perspective. We'll simulate what becomes possible when mining hash power exceeds 50%.

🟡 If it was difficult

Let me explain the Merkle tree with a different analogy — think of a team competition.

A company has 8 teams, each submitting a report. The CEO can't read all 8, so:

  1. Pair up every 2 teams and summarize → 4 summaries
  2. Pair up those 4 summaries and summarize again → 2 summaries
  3. Combine the last 2 → a 1-page core summary

The CEO keeps only this 1 page (= Merkle root). Later, to verify "did Team 3 really submit this report?", you only need Team 3's original + Team 4's summary + the summary of the 1-2 pair + the summary of the 5-8 group.

Additional practice:

# Observe the tree shape by changing transaction count to 1, 2, 3, 5, 8
for n in [1, 2, 3, 5, 8]:
    txs = [f"Tx{i}" for i in range(n)]
    tree = build_merkle_tree(txs)
    print(f"Tx {n} → tree height: {len(tree)}, root: {tree[-1][0][:12]}...")
🔴 Challenge

Challenge 1 — Batch multi-proof verification: Verify multiple transactions from a single block simultaneously, removing duplicate sibling nodes to minimize proof size. (This is the optimization actual Bitcoin nodes perform.)

Challenge 2 — Merkle Patricia Trie (MPT): Ethereum doesn't use a simple Merkle tree but a Merkle Patricia Trie. It's a structure that combines a key-value store with a Merkle tree — implement a simplified version. Hint: add the hash of each node to a trie data structure.

Interview question: "If you change the order of transactions in a Merkle tree, does the root change? If so, what are the security implications?" — You should be able to answer within 3 minutes.

Code Playground

Python
Python**❌ WRONG WAY — 홀수 무시: 남는 노드를 그냥 버린다**
Python**🤔 BETTER — 단독 승격: 남는 노드를 그대로 다음 레벨로 올린다**
Python**✅ BEST — 마지막 노드 복제: 비트코인의 실제 방식**
Python아래 코드가 바로 이 ✅ BEST 방식을 적용한 완성형이다.
Python머클 루트만 계산하면 절반밖에 못 한 것이다. **중간 노드까지 전부 저장해야** 나중에 머클 증명을 만들 수 있다.
Python
Python이론과 함수가 갖춰졌으니, 이제 우리 블록체인에 머클 트리를 붙일 차례다. 레슨 4에서 설계한 블록 헤더에 `merkle_root` 필드를 추가한다.
Python❌ **이렇게 하면 안 된다:**
Python✅ **머클 루트를 쓰면:**

Q&A