CUBRID Page Buffer Module - Code Analysis Report¶

File: src/storage/page_buffer.c (16,935 lines) + src/storage/page_buffer.h (499 lines) Date: 2026-03-18 Purpose: Team code review and presentation

Table of Contents¶

Executive Summary
Architecture Overview
Core Data Structures
Page Fix/Unfix Lifecycle
LRU & Page Replacement
Concurrency & Locking
Flush Mechanisms
Module Integration
Key Design Patterns
Observations & Discussion Points

1. Executive Summary¶

The page buffer module is CUBRID’s buffer pool manager – the critical layer between disk I/O and all upper-level storage modules (heap, B-tree, overflow, etc.). It manages a fixed-size pool of in-memory page frames, providing:

Page caching with hash-based VPID lookup
Latch-based concurrency control (read/write/flush modes)
3-zone LRU replacement with private/shared list partitioning
Write-Ahead Logging (WAL) enforcement before page flush
Background daemon threads for flush, maintenance, and post-flush processing
Direct victim assignment for high-throughput page replacement
2Q-style Aout history for improved replacement decisions

Key metrics: ~17K lines of C/C++, 24+ consumer modules, 4 background daemons, atomic CAS-based latching.

2. Architecture Overview¶

+------------------------------------------------------------------+
|                     Consumer Modules                              |
|  (heap_file, btree, overflow, vacuum, log_recovery, file_mgr)    |
+------------------------------------------------------------------+
          |  pgbuf_fix() / pgbuf_unfix() / pgbuf_set_dirty()
          v
+------------------------------------------------------------------+
|                    PAGE BUFFER MODULE                             |
|                                                                  |
|  +------------------+  +------------------+  +----------------+  |
|  |   Hash Table     |  |   BCB Table      |  |  IO Page Table |  |
|  | (VPID -> BCB)    |  | (metadata array) |  | (page frames)  |  |
|  +------------------+  +------------------+  +----------------+  |
|                                                                  |
|  +------------------+  +------------------+  +----------------+  |
|  | LRU Lists        |  | Invalid List     |  | Aout List      |  |
|  | (shared+private) |  | (free BCBs)      |  | (2Q history)   |  |
|  +------------------+  +------------------+  +----------------+  |
|                                                                  |
|  +------------------+  +------------------+  +----------------+  |
|  | Direct Victims   |  | Flush Daemons    |  | Page Quota     |  |
|  | (lock-free queue) |  | (4 daemons)      |  | (per-thread)   |  |
|  +------------------+  +------------------+  +----------------+  |
+------------------------------------------------------------------+
          |  fileio_read() / fileio_write() / dwb_add_page()
          v
+------------------------------------------------------------------+
|               File I/O Layer + Double Write Buffer                |
+------------------------------------------------------------------+

Memory Layout¶

The buffer pool allocates three parallel arrays at initialization:

Array	Element	Purpose
`BCB_table[num_buffers]`	`PGBUF_BCB`	Per-page metadata (VPID, latch state, flags, LRU links)
`iopage_table[num_buffers]`	`PGBUF_IOPAGE_BUFFER`	Actual page data (IO_PAGESIZE bytes + header)
`buf_hash_table[PGBUF_HASH_SIZE]`	`PGBUF_BUFFER_HASH`	Hash buckets for VPID lookup

BCB index i corresponds to iopage index i. The PGBUF_FIND_BCB_PTR(i) and PGBUF_FIND_IOPAGE_PTR(i) macros compute addresses via pointer arithmetic.

3. Core Data Structures¶

3.1 PGBUF_BCB (Buffer Control Block)¶

struct pgbuf_bcb {
    pthread_mutex_t mutex;          // Per-BCB mutex (SERVER_MODE only)
    int owner_mutex;                // Mutex owner thread
    VPID vpid;                      // Volume + Page ID of resident page
    PGBUF_ATOMIC_LATCH atomic_latch; // Atomic 64-bit: {latch_mode, waiter_exists, fix_count}
    volatile int flags;             // Dirty, flushing, victim, vacuum flags + zone + LRU index
    THREAD_ENTRY *next_wait_thrd;   // Wait queue for latch conflicts
    THREAD_ENTRY *latch_last_thread; // Debug: last thread to acquire latch
    PGBUF_BCB *hash_next;           // Hash chain link
    PGBUF_BCB *prev_BCB, *next_BCB; // Doubly-linked LRU chain
    int tick_lru_list;              // Age tracking for boost decisions
    int tick_lru3;                  // Position in victim zone
    volatile int count_fix_and_avoid_dealloc; // Dual-purpose: fix count (upper 16) + avoid dealloc (lower 16)
    int hit_age;                    // Last hit age for quota/activity tracking
    LOG_LSA oldest_unflush_lsa;     // Oldest unflushed LSA (WAL constraint)
    PGBUF_IOPAGE_BUFFER *iopage_buffer; // Pointer to actual page data
};

Key design choice: The atomic_latch packs latch mode, waiter flag, and fix count into a single 64-bit atomic, enabling CAS-based operations without holding the BCB mutex for read-only fast paths.

union pgbuf_atomic_latch_impl {
    uint64_t raw;           // For atomic CAS
    struct {
        PGBUF_LATCH_MODE latch_mode;  // 2 bytes
        uint16_t waiter_exists;       // 2 bytes
        int32_t fcnt;                 // 4 bytes (fix count)
    } impl;
};

3.2 BCB Flags (Packed in `volatile int flags`)¶

Flag	Bit	Description
`PGBUF_BCB_DIRTY_FLAG`	0x80000000	Page modified, needs flush
`PGBUF_BCB_FLUSHING_TO_DISK_FLAG`	0x40000000	Flush in progress
`PGBUF_BCB_VICTIM_DIRECT_FLAG`	0x20000000	Assigned as direct victim
`PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG`	0x10000000	Direct victim was re-fixed
`PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG`	0x08000000	Move to LRU bottom on unfix
`PGBUF_BCB_TO_VACUUM_FLAG`	0x04000000	Needs vacuuming
`PGBUF_BCB_ASYNC_FLUSH_REQ`	0x02000000	Async flush requested

The lower bits encode the zone (LRU_1, LRU_2, LRU_3, INVALID, VOID) and LRU list index (16 bits).

3.3 PGBUF_BUFFER_HASH¶

struct pgbuf_buffer_hash {
    pthread_mutex_t hash_mutex;     // Per-bucket mutex
    PGBUF_BCB *hash_next;          // Hash chain anchor (linked list of BCBs)
    PGBUF_BUFFER_LOCK *lock_next;  // Buffer lock chain (for page-level locks)
};

Hash function: Mirror-bit hash on VPID, table size = 2^20 (1M buckets).

#define HASH_SIZE_BITS 20
#define PGBUF_HASH_SIZE (1 << HASH_SIZE_BITS)  // 1,048,576 buckets

3.4 PGBUF_LRU_LIST¶

struct pgbuf_lru_list {
    pthread_mutex_t mutex;          // Per-LRU mutex
    PGBUF_BCB *top, *bottom;       // Doubly-linked list endpoints
    PGBUF_BCB *bottom_1, *bottom_2; // Zone boundary markers
    PGBUF_BCB *volatile victim_hint; // Hint for victim search start
    int count_lru1, count_lru2, count_lru3; // Per-zone counts
    int count_vict_cand;            // Victim candidate count
    int threshold_lru1, threshold_lru2;     // Zone size thresholds
    int quota;                      // Private list quota
    int tick_list, tick_lru3;       // Age ticks
    volatile int flags;
    int index;
};

3.5 PGBUF_BUFFER_POOL (The Global Pool)¶

struct pgbuf_buffer_pool {
    int num_buffers;                // Total buffer count
    PGBUF_BCB *BCB_table;          // BCB array
    PGBUF_BUFFER_HASH *buf_hash_table;  // Hash table (1M buckets)
    PGBUF_BUFFER_LOCK *buf_lock_table;  // Lock table (1 per thread)
    PGBUF_IOPAGE_BUFFER *iopage_table;  // IO page array
    int num_LRU_list;              // Shared LRU count
    PGBUF_LRU_LIST *buf_LRU_list;  // All LRU lists (shared + private)
    PGBUF_AOUT_LIST buf_AOUT_list; // 2Q Aout history
    PGBUF_INVALID_LIST buf_invalid_list; // Free BCB list
    PGBUF_PAGE_MONITOR monitor;    // Stats & monitoring
    PGBUF_PAGE_QUOTA quota;        // Quota management
    PGBUF_HOLDER_ANCHOR *thrd_holder_info; // Per-thread holder tracking
    // SERVER_MODE only:
    PGBUF_DIRECT_VICTIM direct_victims;           // Direct victim assignment queues
    lockfree::circular_queue<PGBUF_BCB *> *flushed_bcbs;  // Post-flush queue
    lockfree::circular_queue<int> *private_lrus_with_victims; // LFCQ for victim search
    lockfree::circular_queue<int> *shared_lrus_with_victims;
};

4. Page Fix/Unfix Lifecycle¶

4.1 pgbuf_fix() – The Main Entry Point¶

This is the most critical function. Every page access in CUBRID goes through pgbuf_fix().

pgbuf_fix(thread_p, vpid, fetch_mode, request_mode, condition)
  |
  |-- 1. Parameter validation (latch mode, condition)
  |-- 2. Interrupt check
  |-- 3. FAST PATH: lock-free read-only fix (pgbuf_lockfree_fix_ro)
  |     |-- Atomic increment of fix count (no BCB mutex!)
  |     |-- Only for: OLD_PAGE + PGBUF_LATCH_READ + UNCONDITIONAL
  |     |-- Returns immediately on success
  |
  |-- 4. NORMAL PATH: hash chain search
  |     |-- hash_anchor = buf_hash_table[PGBUF_HASH_VALUE(vpid)]
  |     |-- Lock hash_mutex
  |     |-- Walk hash chain to find matching VPID
  |     |
  |     |-- HIT: BCB found in hash
  |     |   |-- Lock BCB mutex, unlock hash mutex
  |     |   |-- Handle direct victim conflicts
  |     |   |-- Proceed to latch acquisition
  |     |
  |     |-- MISS: BCB not found
  |         |-- If OLD_PAGE_IF_IN_BUFFER: return NULL
  |         |-- pgbuf_claim_bcb_for_fix():
  |             |-- pgbuf_lock_page() (prevent duplicate loads)
  |             |-- pgbuf_allocate_bcb() -> pgbuf_get_victim()
  |             |-- Read page from disk (fileio_read)
  |             |-- Insert BCB into hash chain
  |             |-- pgbuf_unlock_page()
  |
  |-- 5. Register fix count (pgbuf_bcb_register_fix)
  |-- 6. Validate page VPID
  |-- 7. Handle OLD_PAGE_PREVENT_DEALLOC
  |-- 8. LATCH ACQUISITION (pgbuf_latch_bcb_upon_fix):
  |     |-- If no conflict: set latch atomically, return
  |     |-- If conflict: pgbuf_block_bcb() -> thread sleeps
  |     |-- Conditional latch: return NULL on conflict
  |
  |-- 9. Handle deallocated pages per fetch_mode
  |-- 10. Performance tracking
  |-- 11. Return page pointer

4.2 Page Fetch Modes¶

Mode	Description
`OLD_PAGE`	Standard fetch – page must exist on disk or in buffer
`NEW_PAGE`	Newly allocated page – can be created in buffer without disk read
`OLD_PAGE_IF_IN_BUFFER`	Opportunistic – return NULL if not in buffer (no disk I/O)
`OLD_PAGE_PREVENT_DEALLOC`	Fix + prevent concurrent deallocation
`OLD_PAGE_DEALLOCATED`	Fix a deallocated page (expected)
`OLD_PAGE_MAYBE_DEALLOCATED`	Fix page that may have been deallocated (no error if so)
`RECOVERY_PAGE`	Recovery context – no validation constraints

4.3 Latch Modes¶

Mode	Value	Description
`PGBUF_NO_LATCH`	0	No latch held
`PGBUF_LATCH_READ`	1	Shared read latch (multiple concurrent readers)
`PGBUF_LATCH_WRITE`	2	Exclusive write latch
`PGBUF_LATCH_FLUSH`	3	Used only as block mode during flush

4.4 pgbuf_unfix() Flow¶

pgbuf_unfix(thread_p, pgptr)
  |
  |-- 1. Convert page pointer to BCB pointer (CAST_PGPTR_TO_BFPTR)
  |-- 2. Validation checks (debug mode)
  |-- 3. Unlatch thread holder (pgbuf_unlatch_thrd_holder)
  |     |-- Decrement holder fix_count
  |     |-- If fix_count reaches 0: remove holder from thread's list
  |-- 4. FAST PATH: lock-free read-only unfix (pgbuf_lockfree_unfix_ro)
  |     |-- Atomic decrement fix count
  |     |-- If read latch + no other state changes needed: return
  |-- 5. NORMAL PATH: lock BCB mutex
  |     |-- pgbuf_unlatch_bcb_upon_unfix():
  |         |-- Decrement fix count atomically
  |         |-- If fix_count > 0: just release mutex
  |         |-- If fix_count == 0:
  |             |-- LRU boost decision (if bcb is "old enough")
  |             |-- Wake blocked waiters
  |             |-- Handle move-to-bottom flag
  |             |-- Handle direct victim assignment
  |-- 6. Performance tracking

4.5 Lock-Free Fast Path (Critical Optimization)¶

For the common case of read-only page access (OLD_PAGE + PGBUF_LATCH_READ + UNCONDITIONAL):

Fix: pgbuf_lockfree_fix_ro() searches hash chain without BCB mutex, uses atomic CAS to increment fix count only if latch mode is already READ. No mutex acquired at all.

Unfix: pgbuf_lockfree_unfix_ro() atomically decrements fix count if no state changes are pending (no dirty marking, no waiters, etc.).

This is a major performance optimization for read-heavy workloads.

4.6 Latch Promotion¶

pgbuf_promote_read_latch() upgrades READ -> WRITE:

In-place promotion: If the caller is the only holder (fix_count == holder’s fix_count), directly change latch mode via CAS.
Blocking promotion: If other readers exist and condition is PGBUF_PROMOTE_SHARED_READER, temporarily unfix, queue as first blocker with wait_for_latch_promote = true, and re-acquire as WRITE.
Fail: If another promoter is already waiting, or condition is PGBUF_PROMOTE_ONLY_READER with multiple readers -> return ER_PAGE_LATCH_PROMOTE_FAIL.

4.7 Ordered Fix (Deadlock Prevention)¶

pgbuf_ordered_fix() / pgbuf_ordered_unfix() implement VPID-ordered latch acquisition to prevent deadlocks when multiple pages must be fixed simultaneously:

Uses PGBUF_WATCHER to track fix order and group (HFID-based grouping for heap pages)
When a page with lower VPID needs to be fixed while a higher VPID is held, the higher one is unfixed first, then both are fixed in VPID order
Ranks: PGBUF_ORDERED_HEAP_HDR > PGBUF_ORDERED_HEAP_NORMAL > PGBUF_ORDERED_HEAP_OVERFLOW

5. LRU & Page Replacement¶

5.1 Three-Zone LRU Design¶

Each LRU list is divided into three zones:

  TOP (most recently used)
  +---------------------------+
  |        LRU Zone 1         |  HOT zone - no victimization
  |    (hottest pages)        |  No boost on unfix (already hot)
  +---------------------------+  <-- bottom_1
  |        LRU Zone 2         |  BUFFER zone - no victimization
  |    (warm pages)           |  Pages can be boosted back to top
  +---------------------------+  <-- bottom_2
  |        LRU Zone 3         |  VICTIM zone - candidates for replacement
  |    (cold pages)           |  victim_hint starts search here
  +---------------------------+
  BOTTOM (least recently used)

Zone 1: Hottest pages. No boost needed (they’re already at the top). Cannot be victimized.
Zone 2: Buffer/warm zone. Pages falling from zone 1 get a chance to be boosted back. Not victimizable.
Zone 3: Victim zone. Pages here are candidates for replacement. Can still be boosted if accessed.

Zone sizes are controlled by configurable ratios (ratio_lru1, ratio_lru2) with min/max bounds (5%-90%).

5.2 Shared vs Private LRU Lists¶

LRU List Array:
  [0..num_LRU_list-1]                              : Shared LRU lists (all threads)
  [num_LRU_list..num_LRU_list+num_garbage-1]       : Shared garbage LRU lists
  [num_LRU_list+num_garbage..TOTAL_LRU-1]          : Private LRU lists (per-transaction)

Shared LRUs: Any thread can add/victim from these. New pages go to a round-robin selected shared list.
Shared Garbage LRUs: Hold shared pages pending vacuum or relocated from destroyed private LRUs. BCBs here are prioritized for victimization.
Private LRUs: Each transaction (thread) has its own LRU list. Pages fixed by a thread with a private LRU go to that list.
Quota system: Private LRUs have quotas based on transaction activity. Transactions exceeding quota have their BCBs moved to shared lists.

5.3 Victim Selection (`pgbuf_get_victim()`)¶

pgbuf_get_victim()
  |
  |-- 1. Try direct victim (pgbuf_get_direct_victim)
  |     |-- Check if another thread has pre-assigned a victim BCB
  |     |-- Lock-free circular queue based
  |
  |-- 2. Try LFCQ (Lock-Free Circular Queue) victim search
  |     |-- Check private_lrus_with_victims queue
  |     |-- Check big_private_lrus_with_victims queue
  |     |-- Check shared_lrus_with_victims queue
  |     |-- For each LRU: scan zone 3 for non-dirty, non-fixed BCBs
  |
  |-- 3. Try invalid list (pgbuf_get_bcb_from_invalid_list)
  |     |-- Free BCBs that haven't been assigned to any page yet
  |
  |-- 4. If all fail: flush victim candidates and retry
  |     |-- pgbuf_wakeup_page_flush_daemon()
  |     |-- Thread waits for direct victim assignment

5.4 Aout List (2Q History)¶

The Aout list maintains a FIFO history of recently evicted VPIDs:

When a BCB is victimized, its VPID is added to the Aout list
When a page is fetched and found in Aout (recently evicted), it’s placed at the top of the LRU (hot position)
When a page is fetched and NOT in Aout, it’s placed in the middle (zone 2 boundary)

This implements the “second chance” aspect of the 2Q algorithm, preventing scan-resistant pages from being immediately evicted after a single access.

5.5 Direct Victim Assignment (SERVER_MODE)¶

An optimization to avoid victim search overhead:

struct pgbuf_direct_victim {
    PGBUF_BCB **bcb_victims;                    // Pre-assigned victim BCBs
    circular_queue<THREAD_ENTRY *> *waiter_threads_high_priority;
    circular_queue<THREAD_ENTRY *> *waiter_threads_low_priority;
};

When a flush thread cleans a dirty page, it can directly assign the now-clean BCB to a waiting thread via a lock-free queue, bypassing the normal victim search entirely.

6. Concurrency & Locking¶

6.1 Lock Hierarchy (Coarse to Fine)¶

Lock	Type	Granularity	Protects
Hash mutex	`pthread_mutex_t`	Per hash bucket (1M buckets)	Hash chain integrity
BCB mutex	`pthread_mutex_t`	Per BCB	BCB metadata, wait queue, flags
LRU mutex	`pthread_mutex_t`	Per LRU list	LRU list structure
Invalid list mutex	`pthread_mutex_t`	Global (1)	Free BCB list
Aout mutex	`pthread_mutex_t`	Global (1)	Aout history list
Atomic latch	`std::atomic<uint64_t>`	Per BCB	Latch mode + fix count + waiter flag
Free holder set mutex	`pthread_mutex_t`	Global (1)	BCB holder allocation

6.2 Lock Ordering (Inferred)¶

To prevent deadlocks, the following ordering is observed:

hash_mutex  -->  BCB mutex  -->  LRU mutex
    (never hold LRU mutex while acquiring hash_mutex or BCB mutex)

Hash mutex is acquired first during page lookup
BCB mutex is acquired while holding hash mutex, then hash mutex is released
LRU mutex is acquired independently (never while holding hash or BCB mutex in the fix path)

6.3 Atomic Latch Operations¶

All latch state changes use CAS loops on the 64-bit atomic_latch:

// Example: set_latch_and_add_fcnt
do {
    impl.raw = latch->load(std::memory_order_acquire);
    new_impl = impl;
    new_impl.impl.latch_mode = latch_mode;
    new_impl.impl.fcnt += cnt;
} while (!latch->compare_exchange_weak(impl.raw, new_impl.raw,
         std::memory_order_acq_rel, std::memory_order_acquire));

This allows the lock-free fast path to modify latch state without the BCB mutex.

6.4 Thread Wait Queue¶

When a latch conflict occurs, the requesting thread is placed on the BCB’s wait queue (next_wait_thrd). The pgbuf_block_bcb() function:

Adds the thread to the wait queue (linked list via THREAD_ENTRY)
Sets waiter_exists flag on the atomic latch
Releases BCB mutex
Calls pgbuf_timed_sleep() – thread blocks with timeout (default 300 seconds)

When a latch is released and waiters exist, pgbuf_wakeup_reader_writer() wakes the next compatible waiter.

6.5 BCB Mutex Monitoring¶

Debug infrastructure for detecting mutex leaks:

struct pgbuf_monitor_bcb_mutex {
    PGBUF_BCB *bcb;
    PGBUF_BCB *bcb_second;
    int line, line_second;
};

When pgbuf_Monitor_locks is enabled, all BCB lock/unlock operations are tracked per-thread to detect leaks.

6.6 SA_MODE (Single-Thread) Stubs¶

In standalone mode, all mutex operations are no-ops:

#if !defined(SERVER_MODE)
#define pthread_mutex_lock(a)   0
#define pthread_mutex_unlock(a)
#define PGBUF_BCB_LOCK(bcb)
#define PGBUF_BCB_UNLOCK(bcb)
#endif

7. Flush Mechanisms¶

7.1 Background Daemons (SERVER_MODE)¶

Daemon	Purpose
`pgbuf_Page_flush_daemon`	Flushes dirty victim candidates
`pgbuf_Page_maintenance_daemon`	LRU maintenance, quota adjustment
`pgbuf_Page_post_flush_daemon`	Post-flush processing (assigns flushed BCBs to waiters)
`pgbuf_Flush_control_daemon`	Adaptive flush rate control

7.2 Flush Paths¶

Individual flush (pgbuf_flush_with_wal): - Called by threads holding WRITE latch - Acquires BCB mutex -> pgbuf_bcb_safe_flush_force_unlock() - Ensures WAL constraint: log must be flushed up to oldest_unflush_lsa before page write

Bulk flush (pgbuf_flush_all_helper): - Scans entire BCB table - For each dirty BCB: lock, check conditions, flush, unlock - Used for pgbuf_flush_all() and pgbuf_flush_all_unfixed()

Checkpoint flush (pgbuf_flush_checkpoint): - Flushes pages with LSA <= flush_upto_lsa - Uses sequential flusher with rate control - Tracks chkpt_smallest_lsa for next checkpoint

Victim candidate flush (pgbuf_flush_victim_candidates): - Collects dirty BCBs from LRU zone 3 - Sorts by VPID for sequential I/O - Flushes with neighbor page batching

7.3 Neighbor Flush Optimization¶

When flushing a page, the system also flushes neighbor pages (pages with adjacent page IDs on the same volume) to improve sequential I/O:

#define PGBUF_MAX_NEIGHBOR_PAGES 32  // max neighbors in one batch

pgbuf_flush_page_and_neighbors_fb() collects dirty neighbors and flushes them together.

7.4 WAL Enforcement¶

Before any page write to disk:

pgbuf_bcb_flush_with_wal():
    if (oldest_unflush_lsa > log_Gl.append.prev_lsa)
        log_flush_up_to(oldest_unflush_lsa)  // Ensure log is flushed first
    fileio_write() or dwb_add_page()  // Then write page

7.5 Flush Rate Control¶

The sequential flusher uses interval-based rate control: - Each 1-second period is divided into sub-intervals - Pages are flushed equally across intervals - Burst mode vs spread mode configurable - Compensation applied across intervals to meet overall target rate

8. Module Integration¶

8.1 Consumer Usage Pattern¶

All upper modules follow the same pattern:

// Standard read pattern
pgptr = pgbuf_fix(thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_READ, PGBUF_UNCONDITIONAL_LATCH);
// ... read page content ...
pgbuf_unfix(thread_p, pgptr);

// Standard modify pattern
pgptr = pgbuf_fix(thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);
// ... modify page content ...
// ... log the change (WAL) ...
pgbuf_set_dirty(thread_p, pgptr, FREE);  // marks dirty + unfixes

8.2 Key Consumers (24+ modules)¶

Module	File	Usage
Heap File	`heap_file.c`	Record storage, page allocation/dealloc
B-tree	`btree_load.c`	Index page management
Overflow	`overflow_file.c`	Large record overflow pages
Log Manager	`log_manager.c`	WAL coordination, checkpoint
Log Recovery	`log_recovery.c`	Page restoration during recovery
Vacuum	`vacuum.c`	MVCC garbage collection
Disk Manager	`disk_manager.c`	Volume/page allocation
File I/O	`file_io.c`	Actual disk read/write
File Manager	`file_manager.c`	File/page mapping
External Sort	`external_sort.c`	Sort temp pages
Lock Manager	`lock_manager.c`	Lock table pages
Slotted Page	`slotted_page.c`	Slotted page operations
Hash Scan	`query_hash_scan.c`	Hash join pages

8.3 WAL Protocol Integration¶

[Modify page]
     |
     v
[pgbuf_set_dirty() -> sets DIRTY flag, records oldest_unflush_lsa]
     |
     v
[log_append_*() -> writes log record to log buffer]
     |
     v
[Eventually: flush daemon or checkpoint]
     |
     v
[pgbuf_bcb_flush_with_wal() -> ensures log is flushed FIRST, then writes page]

The oldest_unflush_lsa on each BCB tracks the earliest log record that must be on disk before the page can be flushed. This is the core WAL guarantee.

8.4 Double Write Buffer Integration¶

Pages are written through the Double Write Buffer (DWB) to prevent torn page writes:

pgbuf_bcb_flush_with_wal()
  -> dwb_add_page()          // Add to DWB
  -> When DWB is full:
     -> Write entire DWB block to DWB file (sequential I/O)
     -> Then write individual pages to their actual locations

8.5 Transparent Data Encryption (TDE)¶

Pages can be encrypted/decrypted transparently: - pgbuf_set_tde_algorithm() – set encryption algorithm per page - pgbuf_get_tde_algorithm() – get current algorithm - Encryption/decryption happens during I/O operations

9. Key Design Patterns¶

9.1 BCB Pointer <-> Page Pointer Conversion¶

CAST_PGPTR_TO_BFPTR(bufptr, pgptr)   // PAGE_PTR -> PGBUF_BCB*
CAST_BFPTR_TO_PGPTR(pgptr, bufptr)   // PGBUF_BCB* -> PAGE_PTR

The page pointer points to iopage_buffer->iopage.page, and the BCB is recovered via pointer arithmetic using offsetof.

9.2 Holder Tracking¶

Each thread maintains a holder list tracking which BCBs it has fixed: - thrd_holder_info[thread_index] -> linked list of PGBUF_HOLDER - Each holder has: fix_count, bufptr, performance stats, watcher list - Default 7 pre-allocated holders per thread (PGBUF_DEFAULT_FIX_COUNT) - Overflow allocation from shared free_holder_set (one PGBUF_HOLDER_SET at a time, each containing PGBUF_NUM_ALLOC_HOLDER entries)

9.3 Watcher System (Ordered Fix Support)¶

PGBUF_WATCHER structs track page fixes with ordering metadata: - group_id: HFID-based group for heap pages - initial_rank / curr_rank: Priority ordering - page_was_unfixed: Flag indicating refix occurred - Attached to holders via doubly-linked list

9.4 Performance Monitoring Hooks¶

Extensive perfmon_* calls track: - Fix counts by page type, latch mode, conditional type - Latch wait times (lock acquisition, holder wait, total fix time) - Promote success/failure counts - Unfix statistics by dirty state and latch mode - Page type distribution snapshots

10. Observations & Discussion Points¶

10.1 Strengths¶

Lock-free read fast path: The pgbuf_lockfree_fix_ro() / pgbuf_lockfree_unfix_ro() optimization is excellent for read-heavy workloads – no mutex acquisition at all for the common case.
3-zone LRU with 2Q Aout: Sophisticated replacement policy that handles both frequency and recency, with scan resistance through the Aout history.
Private/Shared LRU partitioning: Reduces contention between transactions with different working sets and provides per-transaction page quota management.
Direct victim assignment: Avoids wasted CPU cycles on victim search by pre-assigning flushed BCBs to waiting threads via lock-free queues.
Comprehensive debugging: BCB mutex monitoring, page validation levels, fixed-at tracking, watcher magic numbers, and resource tracking in debug builds.

10.2 Complexity Concerns¶

File size (17K lines): The module is very large for a single file. Could benefit from splitting into sub-modules (lru.c, flush.c, hash.c, etc.).
Flag packing complexity: BCB flags, zone, and LRU index are all packed into a single volatile int. While space-efficient, this creates complex bitwise operations and potential for subtle bugs.
count_fix_and_avoid_dealloc dual-purpose field: Using a single int for two independent counters (fix count in upper 16 bits, avoid dealloc in lower 16 bits) is fragile and the comment acknowledges the reason (need for atomic operations) is somewhat forced.
TODO comment in LRU list (line ~589-591): ``` /* TODO: I have noticed while investigating core files from TPCC that hint is
- sometimes before first bcb that can be victimized. this means there is
- a logic error somewhere. */ ``` This is an acknowledged bug in victim hint management.

10.3 Concurrency Risk Areas¶

BCB flags are volatile int with non-atomic read-modify-write: While BCB mutex should be held for most flag changes, the volatile qualifier alone doesn’t prevent torn reads/writes on some architectures. The code uses pgbuf_bcb_update_flags() which does CAS on the flags, but there are also direct flag reads without the mutex.
Hash chain search with BCB trylock: pgbuf_search_hash_chain() walks the hash chain under hash_mutex and tries to lock BCB mutex with trylock. On failure, it must restart the search, which can lead to livelock under extreme contention.
Lock ordering between BCB mutex and LRU mutex: While generally well-ordered, the code has complex paths where LRU operations happen after BCB unlock, creating windows where BCB state can change between operations.

10.4 Potential Improvements to Discuss¶

Partitioned hash table locks: Current 1M hash buckets with per-bucket mutex is good, but consider lock-free hash table for the read path.
NUMA awareness: The current design doesn’t appear to have NUMA-aware buffer placement, which could matter for large buffer pools.
Adaptive zone sizing: Zone thresholds are currently ratio-based. Could benefit from adaptive sizing based on workload characteristics.
Metric-driven flush tuning: The flush rate control system (PGBUF_FLUSH_VICTIM_BOOST_MULT) uses a fixed multiplier. Adaptive algorithms could improve I/O scheduling.

10.5 Key Configuration Parameters¶

Parameter	Description	Impact
`PRM_ID_PB_NBUFFERS`	Total buffer count	Memory vs hit ratio tradeoff
`PRM_ID_PB_NEIGHBOR_FLUSH_PAGES`	Neighbor flush count	Sequential I/O vs flush overhead
`PRM_ID_PB_NEIGHBOR_FLUSH_NONDIRTY`	Flush non-dirty neighbors too	I/O pattern optimization
LRU zone ratios	Zone 1/2 size ratios	Hot page protection vs victim availability

Appendix: Function Index (Key Functions)¶

Function	Line	Purpose
`pgbuf_initialize`	1515	Initialize buffer pool
`pgbuf_fix` (debug/release)	2034	Fix (pin) a page in buffer
`pgbuf_lockfree_fix_ro`	~2132	Lock-free read-only fix fast path
`pgbuf_unfix`	2847	Unfix (unpin) a page
`pgbuf_promote_read_latch`	2620	Upgrade READ latch to WRITE
`pgbuf_ordered_fix`	(macro)	VPID-ordered fix for deadlock prevention
`pgbuf_set_dirty`	(macro)	Mark page as dirty
`pgbuf_flush_with_wal`	3361	Flush single page with WAL
`pgbuf_flush_checkpoint`	3957	Checkpoint flush
`pgbuf_flush_victim_candidates`	(h:346)	Flush dirty victim candidates
`pgbuf_get_victim`	~1065	Find a victim BCB for replacement
`pgbuf_search_hash_chain`	~1035	Look up VPID in hash table
`pgbuf_latch_bcb_upon_fix`	~1031	Acquire latch during fix
`pgbuf_block_bcb`	~1029	Block thread on latch conflict
`pgbuf_adjust_quotas`	(h:480)	Adjust private LRU quotas
`pgbuf_direct_victims_maintenance`	(h:484)	Maintain direct victim assignment

Report generated for CUBRID page buffer code review. Based on analysis of src/storage/page_buffer.c (develop branch).