CUBRID Page Buffer Module - Code Analysis Report

File: src/storage/page_buffer.c (16,935 lines) + src/storage/page_buffer.h (499 lines) Date: 2026-03-18 Purpose: Team code review and presentation


Table of Contents

  1. Executive Summary
  2. Architecture Overview
  3. Core Data Structures
  4. Page Fix/Unfix Lifecycle
  5. LRU & Page Replacement
  6. Concurrency & Locking
  7. Flush Mechanisms
  8. Module Integration
  9. Key Design Patterns
  10. Observations & Discussion Points

1. Executive Summary

The page buffer module is CUBRID’s buffer pool manager – the critical layer between disk I/O and all upper-level storage modules (heap, B-tree, overflow, etc.). It manages a fixed-size pool of in-memory page frames, providing:

  • Page caching with hash-based VPID lookup
  • Latch-based concurrency control (read/write/flush modes)
  • 3-zone LRU replacement with private/shared list partitioning
  • Write-Ahead Logging (WAL) enforcement before page flush
  • Background daemon threads for flush, maintenance, and post-flush processing
  • Direct victim assignment for high-throughput page replacement
  • 2Q-style Aout history for improved replacement decisions

Key metrics: ~17K lines of C/C++, 24+ consumer modules, 4 background daemons, atomic CAS-based latching.


2. Architecture Overview

+------------------------------------------------------------------+
|                     Consumer Modules                              |
|  (heap_file, btree, overflow, vacuum, log_recovery, file_mgr)    |
+------------------------------------------------------------------+
          |  pgbuf_fix() / pgbuf_unfix() / pgbuf_set_dirty()
          v
+------------------------------------------------------------------+
|                    PAGE BUFFER MODULE                             |
|                                                                  |
|  +------------------+  +------------------+  +----------------+  |
|  |   Hash Table     |  |   BCB Table      |  |  IO Page Table |  |
|  | (VPID -> BCB)    |  | (metadata array) |  | (page frames)  |  |
|  +------------------+  +------------------+  +----------------+  |
|                                                                  |
|  +------------------+  +------------------+  +----------------+  |
|  | LRU Lists        |  | Invalid List     |  | Aout List      |  |
|  | (shared+private) |  | (free BCBs)      |  | (2Q history)   |  |
|  +------------------+  +------------------+  +----------------+  |
|                                                                  |
|  +------------------+  +------------------+  +----------------+  |
|  | Direct Victims   |  | Flush Daemons    |  | Page Quota     |  |
|  | (lock-free queue) |  | (4 daemons)      |  | (per-thread)   |  |
|  +------------------+  +------------------+  +----------------+  |
+------------------------------------------------------------------+
          |  fileio_read() / fileio_write() / dwb_add_page()
          v
+------------------------------------------------------------------+
|               File I/O Layer + Double Write Buffer                |
+------------------------------------------------------------------+

Memory Layout

The buffer pool allocates three parallel arrays at initialization:

Array Element Purpose
BCB_table[num_buffers] PGBUF_BCB Per-page metadata (VPID, latch state, flags, LRU links)
iopage_table[num_buffers] PGBUF_IOPAGE_BUFFER Actual page data (IO_PAGESIZE bytes + header)
buf_hash_table[PGBUF_HASH_SIZE] PGBUF_BUFFER_HASH Hash buckets for VPID lookup

BCB index i corresponds to iopage index i. The PGBUF_FIND_BCB_PTR(i) and PGBUF_FIND_IOPAGE_PTR(i) macros compute addresses via pointer arithmetic.


3. Core Data Structures

3.1 PGBUF_BCB (Buffer Control Block)

struct pgbuf_bcb {
    pthread_mutex_t mutex;          // Per-BCB mutex (SERVER_MODE only)
    int owner_mutex;                // Mutex owner thread
    VPID vpid;                      // Volume + Page ID of resident page
    PGBUF_ATOMIC_LATCH atomic_latch; // Atomic 64-bit: {latch_mode, waiter_exists, fix_count}
    volatile int flags;             // Dirty, flushing, victim, vacuum flags + zone + LRU index
    THREAD_ENTRY *next_wait_thrd;   // Wait queue for latch conflicts
    THREAD_ENTRY *latch_last_thread; // Debug: last thread to acquire latch
    PGBUF_BCB *hash_next;           // Hash chain link
    PGBUF_BCB *prev_BCB, *next_BCB; // Doubly-linked LRU chain
    int tick_lru_list;              // Age tracking for boost decisions
    int tick_lru3;                  // Position in victim zone
    volatile int count_fix_and_avoid_dealloc; // Dual-purpose: fix count (upper 16) + avoid dealloc (lower 16)
    int hit_age;                    // Last hit age for quota/activity tracking
    LOG_LSA oldest_unflush_lsa;     // Oldest unflushed LSA (WAL constraint)
    PGBUF_IOPAGE_BUFFER *iopage_buffer; // Pointer to actual page data
};

Key design choice: The atomic_latch packs latch mode, waiter flag, and fix count into a single 64-bit atomic, enabling CAS-based operations without holding the BCB mutex for read-only fast paths.

union pgbuf_atomic_latch_impl {
    uint64_t raw;           // For atomic CAS
    struct {
        PGBUF_LATCH_MODE latch_mode;  // 2 bytes
        uint16_t waiter_exists;       // 2 bytes
        int32_t fcnt;                 // 4 bytes (fix count)
    } impl;
};

3.2 BCB Flags (Packed in volatile int flags)

Flag Bit Description
PGBUF_BCB_DIRTY_FLAG 0x80000000 Page modified, needs flush
PGBUF_BCB_FLUSHING_TO_DISK_FLAG 0x40000000 Flush in progress
PGBUF_BCB_VICTIM_DIRECT_FLAG 0x20000000 Assigned as direct victim
PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG 0x10000000 Direct victim was re-fixed
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG 0x08000000 Move to LRU bottom on unfix
PGBUF_BCB_TO_VACUUM_FLAG 0x04000000 Needs vacuuming
PGBUF_BCB_ASYNC_FLUSH_REQ 0x02000000 Async flush requested

The lower bits encode the zone (LRU_1, LRU_2, LRU_3, INVALID, VOID) and LRU list index (16 bits).

3.3 PGBUF_BUFFER_HASH

struct pgbuf_buffer_hash {
    pthread_mutex_t hash_mutex;     // Per-bucket mutex
    PGBUF_BCB *hash_next;          // Hash chain anchor (linked list of BCBs)
    PGBUF_BUFFER_LOCK *lock_next;  // Buffer lock chain (for page-level locks)
};

Hash function: Mirror-bit hash on VPID, table size = 2^20 (1M buckets).

#define HASH_SIZE_BITS 20
#define PGBUF_HASH_SIZE (1 << HASH_SIZE_BITS)  // 1,048,576 buckets

3.4 PGBUF_LRU_LIST

struct pgbuf_lru_list {
    pthread_mutex_t mutex;          // Per-LRU mutex
    PGBUF_BCB *top, *bottom;       // Doubly-linked list endpoints
    PGBUF_BCB *bottom_1, *bottom_2; // Zone boundary markers
    PGBUF_BCB *volatile victim_hint; // Hint for victim search start
    int count_lru1, count_lru2, count_lru3; // Per-zone counts
    int count_vict_cand;            // Victim candidate count
    int threshold_lru1, threshold_lru2;     // Zone size thresholds
    int quota;                      // Private list quota
    int tick_list, tick_lru3;       // Age ticks
    volatile int flags;
    int index;
};

3.5 PGBUF_BUFFER_POOL (The Global Pool)

struct pgbuf_buffer_pool {
    int num_buffers;                // Total buffer count
    PGBUF_BCB *BCB_table;          // BCB array
    PGBUF_BUFFER_HASH *buf_hash_table;  // Hash table (1M buckets)
    PGBUF_BUFFER_LOCK *buf_lock_table;  // Lock table (1 per thread)
    PGBUF_IOPAGE_BUFFER *iopage_table;  // IO page array
    int num_LRU_list;              // Shared LRU count
    PGBUF_LRU_LIST *buf_LRU_list;  // All LRU lists (shared + private)
    PGBUF_AOUT_LIST buf_AOUT_list; // 2Q Aout history
    PGBUF_INVALID_LIST buf_invalid_list; // Free BCB list
    PGBUF_PAGE_MONITOR monitor;    // Stats & monitoring
    PGBUF_PAGE_QUOTA quota;        // Quota management
    PGBUF_HOLDER_ANCHOR *thrd_holder_info; // Per-thread holder tracking
    // SERVER_MODE only:
    PGBUF_DIRECT_VICTIM direct_victims;           // Direct victim assignment queues
    lockfree::circular_queue<PGBUF_BCB *> *flushed_bcbs;  // Post-flush queue
    lockfree::circular_queue<int> *private_lrus_with_victims; // LFCQ for victim search
    lockfree::circular_queue<int> *shared_lrus_with_victims;
};

4. Page Fix/Unfix Lifecycle

4.1 pgbuf_fix() – The Main Entry Point

This is the most critical function. Every page access in CUBRID goes through pgbuf_fix().

pgbuf_fix(thread_p, vpid, fetch_mode, request_mode, condition)
  |
  |-- 1. Parameter validation (latch mode, condition)
  |-- 2. Interrupt check
  |-- 3. FAST PATH: lock-free read-only fix (pgbuf_lockfree_fix_ro)
  |     |-- Atomic increment of fix count (no BCB mutex!)
  |     |-- Only for: OLD_PAGE + PGBUF_LATCH_READ + UNCONDITIONAL
  |     |-- Returns immediately on success
  |
  |-- 4. NORMAL PATH: hash chain search
  |     |-- hash_anchor = buf_hash_table[PGBUF_HASH_VALUE(vpid)]
  |     |-- Lock hash_mutex
  |     |-- Walk hash chain to find matching VPID
  |     |
  |     |-- HIT: BCB found in hash
  |     |   |-- Lock BCB mutex, unlock hash mutex
  |     |   |-- Handle direct victim conflicts
  |     |   |-- Proceed to latch acquisition
  |     |
  |     |-- MISS: BCB not found
  |         |-- If OLD_PAGE_IF_IN_BUFFER: return NULL
  |         |-- pgbuf_claim_bcb_for_fix():
  |             |-- pgbuf_lock_page() (prevent duplicate loads)
  |             |-- pgbuf_allocate_bcb() -> pgbuf_get_victim()
  |             |-- Read page from disk (fileio_read)
  |             |-- Insert BCB into hash chain
  |             |-- pgbuf_unlock_page()
  |
  |-- 5. Register fix count (pgbuf_bcb_register_fix)
  |-- 6. Validate page VPID
  |-- 7. Handle OLD_PAGE_PREVENT_DEALLOC
  |-- 8. LATCH ACQUISITION (pgbuf_latch_bcb_upon_fix):
  |     |-- If no conflict: set latch atomically, return
  |     |-- If conflict: pgbuf_block_bcb() -> thread sleeps
  |     |-- Conditional latch: return NULL on conflict
  |
  |-- 9. Handle deallocated pages per fetch_mode
  |-- 10. Performance tracking
  |-- 11. Return page pointer

4.2 Page Fetch Modes

Mode Description
OLD_PAGE Standard fetch – page must exist on disk or in buffer
NEW_PAGE Newly allocated page – can be created in buffer without disk read
OLD_PAGE_IF_IN_BUFFER Opportunistic – return NULL if not in buffer (no disk I/O)
OLD_PAGE_PREVENT_DEALLOC Fix + prevent concurrent deallocation
OLD_PAGE_DEALLOCATED Fix a deallocated page (expected)
OLD_PAGE_MAYBE_DEALLOCATED Fix page that may have been deallocated (no error if so)
RECOVERY_PAGE Recovery context – no validation constraints

4.3 Latch Modes

Mode Value Description
PGBUF_NO_LATCH 0 No latch held
PGBUF_LATCH_READ 1 Shared read latch (multiple concurrent readers)
PGBUF_LATCH_WRITE 2 Exclusive write latch
PGBUF_LATCH_FLUSH 3 Used only as block mode during flush

4.4 pgbuf_unfix() Flow

pgbuf_unfix(thread_p, pgptr)
  |
  |-- 1. Convert page pointer to BCB pointer (CAST_PGPTR_TO_BFPTR)
  |-- 2. Validation checks (debug mode)
  |-- 3. Unlatch thread holder (pgbuf_unlatch_thrd_holder)
  |     |-- Decrement holder fix_count
  |     |-- If fix_count reaches 0: remove holder from thread's list
  |-- 4. FAST PATH: lock-free read-only unfix (pgbuf_lockfree_unfix_ro)
  |     |-- Atomic decrement fix count
  |     |-- If read latch + no other state changes needed: return
  |-- 5. NORMAL PATH: lock BCB mutex
  |     |-- pgbuf_unlatch_bcb_upon_unfix():
  |         |-- Decrement fix count atomically
  |         |-- If fix_count > 0: just release mutex
  |         |-- If fix_count == 0:
  |             |-- LRU boost decision (if bcb is "old enough")
  |             |-- Wake blocked waiters
  |             |-- Handle move-to-bottom flag
  |             |-- Handle direct victim assignment
  |-- 6. Performance tracking

4.5 Lock-Free Fast Path (Critical Optimization)

For the common case of read-only page access (OLD_PAGE + PGBUF_LATCH_READ + UNCONDITIONAL):

Fix: pgbuf_lockfree_fix_ro() searches hash chain without BCB mutex, uses atomic CAS to increment fix count only if latch mode is already READ. No mutex acquired at all.

Unfix: pgbuf_lockfree_unfix_ro() atomically decrements fix count if no state changes are pending (no dirty marking, no waiters, etc.).

This is a major performance optimization for read-heavy workloads.

4.6 Latch Promotion

pgbuf_promote_read_latch() upgrades READ -> WRITE:

  • In-place promotion: If the caller is the only holder (fix_count == holder’s fix_count), directly change latch mode via CAS.
  • Blocking promotion: If other readers exist and condition is PGBUF_PROMOTE_SHARED_READER, temporarily unfix, queue as first blocker with wait_for_latch_promote = true, and re-acquire as WRITE.
  • Fail: If another promoter is already waiting, or condition is PGBUF_PROMOTE_ONLY_READER with multiple readers -> return ER_PAGE_LATCH_PROMOTE_FAIL.

4.7 Ordered Fix (Deadlock Prevention)

pgbuf_ordered_fix() / pgbuf_ordered_unfix() implement VPID-ordered latch acquisition to prevent deadlocks when multiple pages must be fixed simultaneously:

  • Uses PGBUF_WATCHER to track fix order and group (HFID-based grouping for heap pages)
  • When a page with lower VPID needs to be fixed while a higher VPID is held, the higher one is unfixed first, then both are fixed in VPID order
  • Ranks: PGBUF_ORDERED_HEAP_HDR > PGBUF_ORDERED_HEAP_NORMAL > PGBUF_ORDERED_HEAP_OVERFLOW

5. LRU & Page Replacement

5.1 Three-Zone LRU Design

Each LRU list is divided into three zones:

  TOP (most recently used)
  +---------------------------+
  |        LRU Zone 1         |  HOT zone - no victimization
  |    (hottest pages)        |  No boost on unfix (already hot)
  +---------------------------+  <-- bottom_1
  |        LRU Zone 2         |  BUFFER zone - no victimization
  |    (warm pages)           |  Pages can be boosted back to top
  +---------------------------+  <-- bottom_2
  |        LRU Zone 3         |  VICTIM zone - candidates for replacement
  |    (cold pages)           |  victim_hint starts search here
  +---------------------------+
  BOTTOM (least recently used)
  • Zone 1: Hottest pages. No boost needed (they’re already at the top). Cannot be victimized.
  • Zone 2: Buffer/warm zone. Pages falling from zone 1 get a chance to be boosted back. Not victimizable.
  • Zone 3: Victim zone. Pages here are candidates for replacement. Can still be boosted if accessed.

Zone sizes are controlled by configurable ratios (ratio_lru1, ratio_lru2) with min/max bounds (5%-90%).

5.2 Shared vs Private LRU Lists

LRU List Array:
  [0..num_LRU_list-1]                              : Shared LRU lists (all threads)
  [num_LRU_list..num_LRU_list+num_garbage-1]       : Shared garbage LRU lists
  [num_LRU_list+num_garbage..TOTAL_LRU-1]          : Private LRU lists (per-transaction)
  • Shared LRUs: Any thread can add/victim from these. New pages go to a round-robin selected shared list.
  • Shared Garbage LRUs: Hold shared pages pending vacuum or relocated from destroyed private LRUs. BCBs here are prioritized for victimization.
  • Private LRUs: Each transaction (thread) has its own LRU list. Pages fixed by a thread with a private LRU go to that list.
  • Quota system: Private LRUs have quotas based on transaction activity. Transactions exceeding quota have their BCBs moved to shared lists.

5.3 Victim Selection (pgbuf_get_victim())

pgbuf_get_victim()
  |
  |-- 1. Try direct victim (pgbuf_get_direct_victim)
  |     |-- Check if another thread has pre-assigned a victim BCB
  |     |-- Lock-free circular queue based
  |
  |-- 2. Try LFCQ (Lock-Free Circular Queue) victim search
  |     |-- Check private_lrus_with_victims queue
  |     |-- Check big_private_lrus_with_victims queue
  |     |-- Check shared_lrus_with_victims queue
  |     |-- For each LRU: scan zone 3 for non-dirty, non-fixed BCBs
  |
  |-- 3. Try invalid list (pgbuf_get_bcb_from_invalid_list)
  |     |-- Free BCBs that haven't been assigned to any page yet
  |
  |-- 4. If all fail: flush victim candidates and retry
  |     |-- pgbuf_wakeup_page_flush_daemon()
  |     |-- Thread waits for direct victim assignment

5.4 Aout List (2Q History)

The Aout list maintains a FIFO history of recently evicted VPIDs:

  • When a BCB is victimized, its VPID is added to the Aout list
  • When a page is fetched and found in Aout (recently evicted), it’s placed at the top of the LRU (hot position)
  • When a page is fetched and NOT in Aout, it’s placed in the middle (zone 2 boundary)

This implements the “second chance” aspect of the 2Q algorithm, preventing scan-resistant pages from being immediately evicted after a single access.

5.5 Direct Victim Assignment (SERVER_MODE)

An optimization to avoid victim search overhead:

struct pgbuf_direct_victim {
    PGBUF_BCB **bcb_victims;                    // Pre-assigned victim BCBs
    circular_queue<THREAD_ENTRY *> *waiter_threads_high_priority;
    circular_queue<THREAD_ENTRY *> *waiter_threads_low_priority;
};

When a flush thread cleans a dirty page, it can directly assign the now-clean BCB to a waiting thread via a lock-free queue, bypassing the normal victim search entirely.


6. Concurrency & Locking

6.1 Lock Hierarchy (Coarse to Fine)

Lock Type Granularity Protects
Hash mutex pthread_mutex_t Per hash bucket (1M buckets) Hash chain integrity
BCB mutex pthread_mutex_t Per BCB BCB metadata, wait queue, flags
LRU mutex pthread_mutex_t Per LRU list LRU list structure
Invalid list mutex pthread_mutex_t Global (1) Free BCB list
Aout mutex pthread_mutex_t Global (1) Aout history list
Atomic latch std::atomic<uint64_t> Per BCB Latch mode + fix count + waiter flag
Free holder set mutex pthread_mutex_t Global (1) BCB holder allocation

6.2 Lock Ordering (Inferred)

To prevent deadlocks, the following ordering is observed:

hash_mutex  -->  BCB mutex  -->  LRU mutex
    (never hold LRU mutex while acquiring hash_mutex or BCB mutex)
  • Hash mutex is acquired first during page lookup
  • BCB mutex is acquired while holding hash mutex, then hash mutex is released
  • LRU mutex is acquired independently (never while holding hash or BCB mutex in the fix path)

6.3 Atomic Latch Operations

All latch state changes use CAS loops on the 64-bit atomic_latch:

// Example: set_latch_and_add_fcnt
do {
    impl.raw = latch->load(std::memory_order_acquire);
    new_impl = impl;
    new_impl.impl.latch_mode = latch_mode;
    new_impl.impl.fcnt += cnt;
} while (!latch->compare_exchange_weak(impl.raw, new_impl.raw,
         std::memory_order_acq_rel, std::memory_order_acquire));

This allows the lock-free fast path to modify latch state without the BCB mutex.

6.4 Thread Wait Queue

When a latch conflict occurs, the requesting thread is placed on the BCB’s wait queue (next_wait_thrd). The pgbuf_block_bcb() function:

  1. Adds the thread to the wait queue (linked list via THREAD_ENTRY)
  2. Sets waiter_exists flag on the atomic latch
  3. Releases BCB mutex
  4. Calls pgbuf_timed_sleep() – thread blocks with timeout (default 300 seconds)

When a latch is released and waiters exist, pgbuf_wakeup_reader_writer() wakes the next compatible waiter.

6.5 BCB Mutex Monitoring

Debug infrastructure for detecting mutex leaks:

struct pgbuf_monitor_bcb_mutex {
    PGBUF_BCB *bcb;
    PGBUF_BCB *bcb_second;
    int line, line_second;
};

When pgbuf_Monitor_locks is enabled, all BCB lock/unlock operations are tracked per-thread to detect leaks.

6.6 SA_MODE (Single-Thread) Stubs

In standalone mode, all mutex operations are no-ops:

#if !defined(SERVER_MODE)
#define pthread_mutex_lock(a)   0
#define pthread_mutex_unlock(a)
#define PGBUF_BCB_LOCK(bcb)
#define PGBUF_BCB_UNLOCK(bcb)
#endif

7. Flush Mechanisms

7.1 Background Daemons (SERVER_MODE)

Daemon Purpose
pgbuf_Page_flush_daemon Flushes dirty victim candidates
pgbuf_Page_maintenance_daemon LRU maintenance, quota adjustment
pgbuf_Page_post_flush_daemon Post-flush processing (assigns flushed BCBs to waiters)
pgbuf_Flush_control_daemon Adaptive flush rate control

7.2 Flush Paths

Individual flush (pgbuf_flush_with_wal): - Called by threads holding WRITE latch - Acquires BCB mutex -> pgbuf_bcb_safe_flush_force_unlock() - Ensures WAL constraint: log must be flushed up to oldest_unflush_lsa before page write

Bulk flush (pgbuf_flush_all_helper): - Scans entire BCB table - For each dirty BCB: lock, check conditions, flush, unlock - Used for pgbuf_flush_all() and pgbuf_flush_all_unfixed()

Checkpoint flush (pgbuf_flush_checkpoint): - Flushes pages with LSA <= flush_upto_lsa - Uses sequential flusher with rate control - Tracks chkpt_smallest_lsa for next checkpoint

Victim candidate flush (pgbuf_flush_victim_candidates): - Collects dirty BCBs from LRU zone 3 - Sorts by VPID for sequential I/O - Flushes with neighbor page batching

7.3 Neighbor Flush Optimization

When flushing a page, the system also flushes neighbor pages (pages with adjacent page IDs on the same volume) to improve sequential I/O:

#define PGBUF_MAX_NEIGHBOR_PAGES 32  // max neighbors in one batch

pgbuf_flush_page_and_neighbors_fb() collects dirty neighbors and flushes them together.

7.4 WAL Enforcement

Before any page write to disk:

pgbuf_bcb_flush_with_wal():
    if (oldest_unflush_lsa > log_Gl.append.prev_lsa)
        log_flush_up_to(oldest_unflush_lsa)  // Ensure log is flushed first
    fileio_write() or dwb_add_page()  // Then write page

7.5 Flush Rate Control

The sequential flusher uses interval-based rate control: - Each 1-second period is divided into sub-intervals - Pages are flushed equally across intervals - Burst mode vs spread mode configurable - Compensation applied across intervals to meet overall target rate


8. Module Integration

8.1 Consumer Usage Pattern

All upper modules follow the same pattern:

// Standard read pattern
pgptr = pgbuf_fix(thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_READ, PGBUF_UNCONDITIONAL_LATCH);
// ... read page content ...
pgbuf_unfix(thread_p, pgptr);

// Standard modify pattern
pgptr = pgbuf_fix(thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);
// ... modify page content ...
// ... log the change (WAL) ...
pgbuf_set_dirty(thread_p, pgptr, FREE);  // marks dirty + unfixes

8.2 Key Consumers (24+ modules)

Module File Usage
Heap File heap_file.c Record storage, page allocation/dealloc
B-tree btree_load.c Index page management
Overflow overflow_file.c Large record overflow pages
Log Manager log_manager.c WAL coordination, checkpoint
Log Recovery log_recovery.c Page restoration during recovery
Vacuum vacuum.c MVCC garbage collection
Disk Manager disk_manager.c Volume/page allocation
File I/O file_io.c Actual disk read/write
File Manager file_manager.c File/page mapping
External Sort external_sort.c Sort temp pages
Lock Manager lock_manager.c Lock table pages
Slotted Page slotted_page.c Slotted page operations
Hash Scan query_hash_scan.c Hash join pages

8.3 WAL Protocol Integration

[Modify page]
     |
     v
[pgbuf_set_dirty() -> sets DIRTY flag, records oldest_unflush_lsa]
     |
     v
[log_append_*() -> writes log record to log buffer]
     |
     v
[Eventually: flush daemon or checkpoint]
     |
     v
[pgbuf_bcb_flush_with_wal() -> ensures log is flushed FIRST, then writes page]

The oldest_unflush_lsa on each BCB tracks the earliest log record that must be on disk before the page can be flushed. This is the core WAL guarantee.

8.4 Double Write Buffer Integration

Pages are written through the Double Write Buffer (DWB) to prevent torn page writes:

pgbuf_bcb_flush_with_wal()
  -> dwb_add_page()          // Add to DWB
  -> When DWB is full:
     -> Write entire DWB block to DWB file (sequential I/O)
     -> Then write individual pages to their actual locations

8.5 Transparent Data Encryption (TDE)

Pages can be encrypted/decrypted transparently: - pgbuf_set_tde_algorithm() – set encryption algorithm per page - pgbuf_get_tde_algorithm() – get current algorithm - Encryption/decryption happens during I/O operations


9. Key Design Patterns

9.1 BCB Pointer <-> Page Pointer Conversion

CAST_PGPTR_TO_BFPTR(bufptr, pgptr)   // PAGE_PTR -> PGBUF_BCB*
CAST_BFPTR_TO_PGPTR(pgptr, bufptr)   // PGBUF_BCB* -> PAGE_PTR

The page pointer points to iopage_buffer->iopage.page, and the BCB is recovered via pointer arithmetic using offsetof.

9.2 Holder Tracking

Each thread maintains a holder list tracking which BCBs it has fixed: - thrd_holder_info[thread_index] -> linked list of PGBUF_HOLDER - Each holder has: fix_count, bufptr, performance stats, watcher list - Default 7 pre-allocated holders per thread (PGBUF_DEFAULT_FIX_COUNT) - Overflow allocation from shared free_holder_set (one PGBUF_HOLDER_SET at a time, each containing PGBUF_NUM_ALLOC_HOLDER entries)

9.3 Watcher System (Ordered Fix Support)

PGBUF_WATCHER structs track page fixes with ordering metadata: - group_id: HFID-based group for heap pages - initial_rank / curr_rank: Priority ordering - page_was_unfixed: Flag indicating refix occurred - Attached to holders via doubly-linked list

9.4 Performance Monitoring Hooks

Extensive perfmon_* calls track: - Fix counts by page type, latch mode, conditional type - Latch wait times (lock acquisition, holder wait, total fix time) - Promote success/failure counts - Unfix statistics by dirty state and latch mode - Page type distribution snapshots


10. Observations & Discussion Points

10.1 Strengths

  1. Lock-free read fast path: The pgbuf_lockfree_fix_ro() / pgbuf_lockfree_unfix_ro() optimization is excellent for read-heavy workloads – no mutex acquisition at all for the common case.

  2. 3-zone LRU with 2Q Aout: Sophisticated replacement policy that handles both frequency and recency, with scan resistance through the Aout history.

  3. Private/Shared LRU partitioning: Reduces contention between transactions with different working sets and provides per-transaction page quota management.

  4. Direct victim assignment: Avoids wasted CPU cycles on victim search by pre-assigning flushed BCBs to waiting threads via lock-free queues.

  5. Comprehensive debugging: BCB mutex monitoring, page validation levels, fixed-at tracking, watcher magic numbers, and resource tracking in debug builds.

10.2 Complexity Concerns

  1. File size (17K lines): The module is very large for a single file. Could benefit from splitting into sub-modules (lru.c, flush.c, hash.c, etc.).

  2. Flag packing complexity: BCB flags, zone, and LRU index are all packed into a single volatile int. While space-efficient, this creates complex bitwise operations and potential for subtle bugs.

  3. count_fix_and_avoid_dealloc dual-purpose field: Using a single int for two independent counters (fix count in upper 16 bits, avoid dealloc in lower 16 bits) is fragile and the comment acknowledges the reason (need for atomic operations) is somewhat forced.

  4. TODO comment in LRU list (line ~589-591): ``` /* TODO: I have noticed while investigating core files from TPCC that hint is

    • sometimes before first bcb that can be victimized. this means there is
    • a logic error somewhere. */ ``` This is an acknowledged bug in victim hint management.

10.3 Concurrency Risk Areas

  1. BCB flags are volatile int with non-atomic read-modify-write: While BCB mutex should be held for most flag changes, the volatile qualifier alone doesn’t prevent torn reads/writes on some architectures. The code uses pgbuf_bcb_update_flags() which does CAS on the flags, but there are also direct flag reads without the mutex.

  2. Hash chain search with BCB trylock: pgbuf_search_hash_chain() walks the hash chain under hash_mutex and tries to lock BCB mutex with trylock. On failure, it must restart the search, which can lead to livelock under extreme contention.

  3. Lock ordering between BCB mutex and LRU mutex: While generally well-ordered, the code has complex paths where LRU operations happen after BCB unlock, creating windows where BCB state can change between operations.

10.4 Potential Improvements to Discuss

  1. Partitioned hash table locks: Current 1M hash buckets with per-bucket mutex is good, but consider lock-free hash table for the read path.

  2. NUMA awareness: The current design doesn’t appear to have NUMA-aware buffer placement, which could matter for large buffer pools.

  3. Adaptive zone sizing: Zone thresholds are currently ratio-based. Could benefit from adaptive sizing based on workload characteristics.

  4. Metric-driven flush tuning: The flush rate control system (PGBUF_FLUSH_VICTIM_BOOST_MULT) uses a fixed multiplier. Adaptive algorithms could improve I/O scheduling.

10.5 Key Configuration Parameters

Parameter Description Impact
PRM_ID_PB_NBUFFERS Total buffer count Memory vs hit ratio tradeoff
PRM_ID_PB_NEIGHBOR_FLUSH_PAGES Neighbor flush count Sequential I/O vs flush overhead
PRM_ID_PB_NEIGHBOR_FLUSH_NONDIRTY Flush non-dirty neighbors too I/O pattern optimization
LRU zone ratios Zone 1/2 size ratios Hot page protection vs victim availability

Appendix: Function Index (Key Functions)

Function Line Purpose
pgbuf_initialize 1515 Initialize buffer pool
pgbuf_fix (debug/release) 2034 Fix (pin) a page in buffer
pgbuf_lockfree_fix_ro ~2132 Lock-free read-only fix fast path
pgbuf_unfix 2847 Unfix (unpin) a page
pgbuf_promote_read_latch 2620 Upgrade READ latch to WRITE
pgbuf_ordered_fix (macro) VPID-ordered fix for deadlock prevention
pgbuf_set_dirty (macro) Mark page as dirty
pgbuf_flush_with_wal 3361 Flush single page with WAL
pgbuf_flush_checkpoint 3957 Checkpoint flush
pgbuf_flush_victim_candidates (h:346) Flush dirty victim candidates
pgbuf_get_victim ~1065 Find a victim BCB for replacement
pgbuf_search_hash_chain ~1035 Look up VPID in hash table
pgbuf_latch_bcb_upon_fix ~1031 Acquire latch during fix
pgbuf_block_bcb ~1029 Block thread on latch conflict
pgbuf_adjust_quotas (h:480) Adjust private LRU quotas
pgbuf_direct_victims_maintenance (h:484) Maintain direct victim assignment

Report generated for CUBRID page buffer code review. Based on analysis of src/storage/page_buffer.c (develop branch).