CUBRID Page Buffer Module - Code Analysis Report¶
File: src/storage/page_buffer.c (16,935 lines) + src/storage/page_buffer.h (499 lines)
Date: 2026-03-18
Purpose: Team code review and presentation
Table of Contents¶
- Executive Summary
- Architecture Overview
- Core Data Structures
- Page Fix/Unfix Lifecycle
- LRU & Page Replacement
- Concurrency & Locking
- Flush Mechanisms
- Module Integration
- Key Design Patterns
- Observations & Discussion Points
1. Executive Summary¶
The page buffer module is CUBRID’s buffer pool manager – the critical layer between disk I/O and all upper-level storage modules (heap, B-tree, overflow, etc.). It manages a fixed-size pool of in-memory page frames, providing:
- Page caching with hash-based VPID lookup
- Latch-based concurrency control (read/write/flush modes)
- 3-zone LRU replacement with private/shared list partitioning
- Write-Ahead Logging (WAL) enforcement before page flush
- Background daemon threads for flush, maintenance, and post-flush processing
- Direct victim assignment for high-throughput page replacement
- 2Q-style Aout history for improved replacement decisions
Key metrics: ~17K lines of C/C++, 24+ consumer modules, 4 background daemons, atomic CAS-based latching.
2. Architecture Overview¶
+------------------------------------------------------------------+
| Consumer Modules |
| (heap_file, btree, overflow, vacuum, log_recovery, file_mgr) |
+------------------------------------------------------------------+
| pgbuf_fix() / pgbuf_unfix() / pgbuf_set_dirty()
v
+------------------------------------------------------------------+
| PAGE BUFFER MODULE |
| |
| +------------------+ +------------------+ +----------------+ |
| | Hash Table | | BCB Table | | IO Page Table | |
| | (VPID -> BCB) | | (metadata array) | | (page frames) | |
| +------------------+ +------------------+ +----------------+ |
| |
| +------------------+ +------------------+ +----------------+ |
| | LRU Lists | | Invalid List | | Aout List | |
| | (shared+private) | | (free BCBs) | | (2Q history) | |
| +------------------+ +------------------+ +----------------+ |
| |
| +------------------+ +------------------+ +----------------+ |
| | Direct Victims | | Flush Daemons | | Page Quota | |
| | (lock-free queue) | | (4 daemons) | | (per-thread) | |
| +------------------+ +------------------+ +----------------+ |
+------------------------------------------------------------------+
| fileio_read() / fileio_write() / dwb_add_page()
v
+------------------------------------------------------------------+
| File I/O Layer + Double Write Buffer |
+------------------------------------------------------------------+
Memory Layout¶
The buffer pool allocates three parallel arrays at initialization:
| Array | Element | Purpose |
|---|---|---|
BCB_table[num_buffers] |
PGBUF_BCB |
Per-page metadata (VPID, latch state, flags, LRU links) |
iopage_table[num_buffers] |
PGBUF_IOPAGE_BUFFER |
Actual page data (IO_PAGESIZE bytes + header) |
buf_hash_table[PGBUF_HASH_SIZE] |
PGBUF_BUFFER_HASH |
Hash buckets for VPID lookup |
BCB index i corresponds to iopage index i. The PGBUF_FIND_BCB_PTR(i) and PGBUF_FIND_IOPAGE_PTR(i) macros compute addresses via pointer arithmetic.
3. Core Data Structures¶
3.1 PGBUF_BCB (Buffer Control Block)¶
struct pgbuf_bcb {
pthread_mutex_t mutex; // Per-BCB mutex (SERVER_MODE only)
int owner_mutex; // Mutex owner thread
VPID vpid; // Volume + Page ID of resident page
PGBUF_ATOMIC_LATCH atomic_latch; // Atomic 64-bit: {latch_mode, waiter_exists, fix_count}
volatile int flags; // Dirty, flushing, victim, vacuum flags + zone + LRU index
THREAD_ENTRY *next_wait_thrd; // Wait queue for latch conflicts
THREAD_ENTRY *latch_last_thread; // Debug: last thread to acquire latch
PGBUF_BCB *hash_next; // Hash chain link
PGBUF_BCB *prev_BCB, *next_BCB; // Doubly-linked LRU chain
int tick_lru_list; // Age tracking for boost decisions
int tick_lru3; // Position in victim zone
volatile int count_fix_and_avoid_dealloc; // Dual-purpose: fix count (upper 16) + avoid dealloc (lower 16)
int hit_age; // Last hit age for quota/activity tracking
LOG_LSA oldest_unflush_lsa; // Oldest unflushed LSA (WAL constraint)
PGBUF_IOPAGE_BUFFER *iopage_buffer; // Pointer to actual page data
};
Key design choice: The atomic_latch packs latch mode, waiter flag, and fix count into a single 64-bit atomic, enabling CAS-based operations without holding the BCB mutex for read-only fast paths.
union pgbuf_atomic_latch_impl {
uint64_t raw; // For atomic CAS
struct {
PGBUF_LATCH_MODE latch_mode; // 2 bytes
uint16_t waiter_exists; // 2 bytes
int32_t fcnt; // 4 bytes (fix count)
} impl;
};
3.2 BCB Flags (Packed in volatile int flags)¶
| Flag | Bit | Description |
|---|---|---|
PGBUF_BCB_DIRTY_FLAG |
0x80000000 | Page modified, needs flush |
PGBUF_BCB_FLUSHING_TO_DISK_FLAG |
0x40000000 | Flush in progress |
PGBUF_BCB_VICTIM_DIRECT_FLAG |
0x20000000 | Assigned as direct victim |
PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG |
0x10000000 | Direct victim was re-fixed |
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG |
0x08000000 | Move to LRU bottom on unfix |
PGBUF_BCB_TO_VACUUM_FLAG |
0x04000000 | Needs vacuuming |
PGBUF_BCB_ASYNC_FLUSH_REQ |
0x02000000 | Async flush requested |
The lower bits encode the zone (LRU_1, LRU_2, LRU_3, INVALID, VOID) and LRU list index (16 bits).
3.3 PGBUF_BUFFER_HASH¶
struct pgbuf_buffer_hash {
pthread_mutex_t hash_mutex; // Per-bucket mutex
PGBUF_BCB *hash_next; // Hash chain anchor (linked list of BCBs)
PGBUF_BUFFER_LOCK *lock_next; // Buffer lock chain (for page-level locks)
};
Hash function: Mirror-bit hash on VPID, table size = 2^20 (1M buckets).
#define HASH_SIZE_BITS 20
#define PGBUF_HASH_SIZE (1 << HASH_SIZE_BITS) // 1,048,576 buckets
3.4 PGBUF_LRU_LIST¶
struct pgbuf_lru_list {
pthread_mutex_t mutex; // Per-LRU mutex
PGBUF_BCB *top, *bottom; // Doubly-linked list endpoints
PGBUF_BCB *bottom_1, *bottom_2; // Zone boundary markers
PGBUF_BCB *volatile victim_hint; // Hint for victim search start
int count_lru1, count_lru2, count_lru3; // Per-zone counts
int count_vict_cand; // Victim candidate count
int threshold_lru1, threshold_lru2; // Zone size thresholds
int quota; // Private list quota
int tick_list, tick_lru3; // Age ticks
volatile int flags;
int index;
};
3.5 PGBUF_BUFFER_POOL (The Global Pool)¶
struct pgbuf_buffer_pool {
int num_buffers; // Total buffer count
PGBUF_BCB *BCB_table; // BCB array
PGBUF_BUFFER_HASH *buf_hash_table; // Hash table (1M buckets)
PGBUF_BUFFER_LOCK *buf_lock_table; // Lock table (1 per thread)
PGBUF_IOPAGE_BUFFER *iopage_table; // IO page array
int num_LRU_list; // Shared LRU count
PGBUF_LRU_LIST *buf_LRU_list; // All LRU lists (shared + private)
PGBUF_AOUT_LIST buf_AOUT_list; // 2Q Aout history
PGBUF_INVALID_LIST buf_invalid_list; // Free BCB list
PGBUF_PAGE_MONITOR monitor; // Stats & monitoring
PGBUF_PAGE_QUOTA quota; // Quota management
PGBUF_HOLDER_ANCHOR *thrd_holder_info; // Per-thread holder tracking
// SERVER_MODE only:
PGBUF_DIRECT_VICTIM direct_victims; // Direct victim assignment queues
lockfree::circular_queue<PGBUF_BCB *> *flushed_bcbs; // Post-flush queue
lockfree::circular_queue<int> *private_lrus_with_victims; // LFCQ for victim search
lockfree::circular_queue<int> *shared_lrus_with_victims;
};
4. Page Fix/Unfix Lifecycle¶
4.1 pgbuf_fix() – The Main Entry Point¶
This is the most critical function. Every page access in CUBRID goes through pgbuf_fix().
pgbuf_fix(thread_p, vpid, fetch_mode, request_mode, condition)
|
|-- 1. Parameter validation (latch mode, condition)
|-- 2. Interrupt check
|-- 3. FAST PATH: lock-free read-only fix (pgbuf_lockfree_fix_ro)
| |-- Atomic increment of fix count (no BCB mutex!)
| |-- Only for: OLD_PAGE + PGBUF_LATCH_READ + UNCONDITIONAL
| |-- Returns immediately on success
|
|-- 4. NORMAL PATH: hash chain search
| |-- hash_anchor = buf_hash_table[PGBUF_HASH_VALUE(vpid)]
| |-- Lock hash_mutex
| |-- Walk hash chain to find matching VPID
| |
| |-- HIT: BCB found in hash
| | |-- Lock BCB mutex, unlock hash mutex
| | |-- Handle direct victim conflicts
| | |-- Proceed to latch acquisition
| |
| |-- MISS: BCB not found
| |-- If OLD_PAGE_IF_IN_BUFFER: return NULL
| |-- pgbuf_claim_bcb_for_fix():
| |-- pgbuf_lock_page() (prevent duplicate loads)
| |-- pgbuf_allocate_bcb() -> pgbuf_get_victim()
| |-- Read page from disk (fileio_read)
| |-- Insert BCB into hash chain
| |-- pgbuf_unlock_page()
|
|-- 5. Register fix count (pgbuf_bcb_register_fix)
|-- 6. Validate page VPID
|-- 7. Handle OLD_PAGE_PREVENT_DEALLOC
|-- 8. LATCH ACQUISITION (pgbuf_latch_bcb_upon_fix):
| |-- If no conflict: set latch atomically, return
| |-- If conflict: pgbuf_block_bcb() -> thread sleeps
| |-- Conditional latch: return NULL on conflict
|
|-- 9. Handle deallocated pages per fetch_mode
|-- 10. Performance tracking
|-- 11. Return page pointer
4.2 Page Fetch Modes¶
| Mode | Description |
|---|---|
OLD_PAGE |
Standard fetch – page must exist on disk or in buffer |
NEW_PAGE |
Newly allocated page – can be created in buffer without disk read |
OLD_PAGE_IF_IN_BUFFER |
Opportunistic – return NULL if not in buffer (no disk I/O) |
OLD_PAGE_PREVENT_DEALLOC |
Fix + prevent concurrent deallocation |
OLD_PAGE_DEALLOCATED |
Fix a deallocated page (expected) |
OLD_PAGE_MAYBE_DEALLOCATED |
Fix page that may have been deallocated (no error if so) |
RECOVERY_PAGE |
Recovery context – no validation constraints |
4.3 Latch Modes¶
| Mode | Value | Description |
|---|---|---|
PGBUF_NO_LATCH |
0 | No latch held |
PGBUF_LATCH_READ |
1 | Shared read latch (multiple concurrent readers) |
PGBUF_LATCH_WRITE |
2 | Exclusive write latch |
PGBUF_LATCH_FLUSH |
3 | Used only as block mode during flush |
4.4 pgbuf_unfix() Flow¶
pgbuf_unfix(thread_p, pgptr)
|
|-- 1. Convert page pointer to BCB pointer (CAST_PGPTR_TO_BFPTR)
|-- 2. Validation checks (debug mode)
|-- 3. Unlatch thread holder (pgbuf_unlatch_thrd_holder)
| |-- Decrement holder fix_count
| |-- If fix_count reaches 0: remove holder from thread's list
|-- 4. FAST PATH: lock-free read-only unfix (pgbuf_lockfree_unfix_ro)
| |-- Atomic decrement fix count
| |-- If read latch + no other state changes needed: return
|-- 5. NORMAL PATH: lock BCB mutex
| |-- pgbuf_unlatch_bcb_upon_unfix():
| |-- Decrement fix count atomically
| |-- If fix_count > 0: just release mutex
| |-- If fix_count == 0:
| |-- LRU boost decision (if bcb is "old enough")
| |-- Wake blocked waiters
| |-- Handle move-to-bottom flag
| |-- Handle direct victim assignment
|-- 6. Performance tracking
4.5 Lock-Free Fast Path (Critical Optimization)¶
For the common case of read-only page access (OLD_PAGE + PGBUF_LATCH_READ + UNCONDITIONAL):
Fix: pgbuf_lockfree_fix_ro() searches hash chain without BCB mutex, uses atomic CAS to increment fix count only if latch mode is already READ. No mutex acquired at all.
Unfix: pgbuf_lockfree_unfix_ro() atomically decrements fix count if no state changes are pending (no dirty marking, no waiters, etc.).
This is a major performance optimization for read-heavy workloads.
4.6 Latch Promotion¶
pgbuf_promote_read_latch() upgrades READ -> WRITE:
- In-place promotion: If the caller is the only holder (fix_count == holder’s fix_count), directly change latch mode via CAS.
- Blocking promotion: If other readers exist and condition is
PGBUF_PROMOTE_SHARED_READER, temporarily unfix, queue as first blocker withwait_for_latch_promote = true, and re-acquire as WRITE. - Fail: If another promoter is already waiting, or condition is
PGBUF_PROMOTE_ONLY_READERwith multiple readers -> returnER_PAGE_LATCH_PROMOTE_FAIL.
4.7 Ordered Fix (Deadlock Prevention)¶
pgbuf_ordered_fix() / pgbuf_ordered_unfix() implement VPID-ordered latch acquisition to prevent deadlocks when multiple pages must be fixed simultaneously:
- Uses
PGBUF_WATCHERto track fix order and group (HFID-based grouping for heap pages) - When a page with lower VPID needs to be fixed while a higher VPID is held, the higher one is unfixed first, then both are fixed in VPID order
- Ranks:
PGBUF_ORDERED_HEAP_HDR>PGBUF_ORDERED_HEAP_NORMAL>PGBUF_ORDERED_HEAP_OVERFLOW
5. LRU & Page Replacement¶
5.1 Three-Zone LRU Design¶
Each LRU list is divided into three zones:
TOP (most recently used)
+---------------------------+
| LRU Zone 1 | HOT zone - no victimization
| (hottest pages) | No boost on unfix (already hot)
+---------------------------+ <-- bottom_1
| LRU Zone 2 | BUFFER zone - no victimization
| (warm pages) | Pages can be boosted back to top
+---------------------------+ <-- bottom_2
| LRU Zone 3 | VICTIM zone - candidates for replacement
| (cold pages) | victim_hint starts search here
+---------------------------+
BOTTOM (least recently used)
- Zone 1: Hottest pages. No boost needed (they’re already at the top). Cannot be victimized.
- Zone 2: Buffer/warm zone. Pages falling from zone 1 get a chance to be boosted back. Not victimizable.
- Zone 3: Victim zone. Pages here are candidates for replacement. Can still be boosted if accessed.
Zone sizes are controlled by configurable ratios (ratio_lru1, ratio_lru2) with min/max bounds (5%-90%).
5.2 Shared vs Private LRU Lists¶
LRU List Array:
[0..num_LRU_list-1] : Shared LRU lists (all threads)
[num_LRU_list..num_LRU_list+num_garbage-1] : Shared garbage LRU lists
[num_LRU_list+num_garbage..TOTAL_LRU-1] : Private LRU lists (per-transaction)
- Shared LRUs: Any thread can add/victim from these. New pages go to a round-robin selected shared list.
- Shared Garbage LRUs: Hold shared pages pending vacuum or relocated from destroyed private LRUs. BCBs here are prioritized for victimization.
- Private LRUs: Each transaction (thread) has its own LRU list. Pages fixed by a thread with a private LRU go to that list.
- Quota system: Private LRUs have quotas based on transaction activity. Transactions exceeding quota have their BCBs moved to shared lists.
5.3 Victim Selection (pgbuf_get_victim())¶
pgbuf_get_victim()
|
|-- 1. Try direct victim (pgbuf_get_direct_victim)
| |-- Check if another thread has pre-assigned a victim BCB
| |-- Lock-free circular queue based
|
|-- 2. Try LFCQ (Lock-Free Circular Queue) victim search
| |-- Check private_lrus_with_victims queue
| |-- Check big_private_lrus_with_victims queue
| |-- Check shared_lrus_with_victims queue
| |-- For each LRU: scan zone 3 for non-dirty, non-fixed BCBs
|
|-- 3. Try invalid list (pgbuf_get_bcb_from_invalid_list)
| |-- Free BCBs that haven't been assigned to any page yet
|
|-- 4. If all fail: flush victim candidates and retry
| |-- pgbuf_wakeup_page_flush_daemon()
| |-- Thread waits for direct victim assignment
5.4 Aout List (2Q History)¶
The Aout list maintains a FIFO history of recently evicted VPIDs:
- When a BCB is victimized, its VPID is added to the Aout list
- When a page is fetched and found in Aout (recently evicted), it’s placed at the top of the LRU (hot position)
- When a page is fetched and NOT in Aout, it’s placed in the middle (zone 2 boundary)
This implements the “second chance” aspect of the 2Q algorithm, preventing scan-resistant pages from being immediately evicted after a single access.
5.5 Direct Victim Assignment (SERVER_MODE)¶
An optimization to avoid victim search overhead:
struct pgbuf_direct_victim {
PGBUF_BCB **bcb_victims; // Pre-assigned victim BCBs
circular_queue<THREAD_ENTRY *> *waiter_threads_high_priority;
circular_queue<THREAD_ENTRY *> *waiter_threads_low_priority;
};
When a flush thread cleans a dirty page, it can directly assign the now-clean BCB to a waiting thread via a lock-free queue, bypassing the normal victim search entirely.
6. Concurrency & Locking¶
6.1 Lock Hierarchy (Coarse to Fine)¶
| Lock | Type | Granularity | Protects |
|---|---|---|---|
| Hash mutex | pthread_mutex_t |
Per hash bucket (1M buckets) | Hash chain integrity |
| BCB mutex | pthread_mutex_t |
Per BCB | BCB metadata, wait queue, flags |
| LRU mutex | pthread_mutex_t |
Per LRU list | LRU list structure |
| Invalid list mutex | pthread_mutex_t |
Global (1) | Free BCB list |
| Aout mutex | pthread_mutex_t |
Global (1) | Aout history list |
| Atomic latch | std::atomic<uint64_t> |
Per BCB | Latch mode + fix count + waiter flag |
| Free holder set mutex | pthread_mutex_t |
Global (1) | BCB holder allocation |
6.2 Lock Ordering (Inferred)¶
To prevent deadlocks, the following ordering is observed:
hash_mutex --> BCB mutex --> LRU mutex
(never hold LRU mutex while acquiring hash_mutex or BCB mutex)
- Hash mutex is acquired first during page lookup
- BCB mutex is acquired while holding hash mutex, then hash mutex is released
- LRU mutex is acquired independently (never while holding hash or BCB mutex in the fix path)
6.3 Atomic Latch Operations¶
All latch state changes use CAS loops on the 64-bit atomic_latch:
// Example: set_latch_and_add_fcnt
do {
impl.raw = latch->load(std::memory_order_acquire);
new_impl = impl;
new_impl.impl.latch_mode = latch_mode;
new_impl.impl.fcnt += cnt;
} while (!latch->compare_exchange_weak(impl.raw, new_impl.raw,
std::memory_order_acq_rel, std::memory_order_acquire));
This allows the lock-free fast path to modify latch state without the BCB mutex.
6.4 Thread Wait Queue¶
When a latch conflict occurs, the requesting thread is placed on the BCB’s wait queue (next_wait_thrd). The pgbuf_block_bcb() function:
- Adds the thread to the wait queue (linked list via
THREAD_ENTRY) - Sets
waiter_existsflag on the atomic latch - Releases BCB mutex
- Calls
pgbuf_timed_sleep()– thread blocks with timeout (default 300 seconds)
When a latch is released and waiters exist, pgbuf_wakeup_reader_writer() wakes the next compatible waiter.
6.5 BCB Mutex Monitoring¶
Debug infrastructure for detecting mutex leaks:
struct pgbuf_monitor_bcb_mutex {
PGBUF_BCB *bcb;
PGBUF_BCB *bcb_second;
int line, line_second;
};
When pgbuf_Monitor_locks is enabled, all BCB lock/unlock operations are tracked per-thread to detect leaks.
6.6 SA_MODE (Single-Thread) Stubs¶
In standalone mode, all mutex operations are no-ops:
#if !defined(SERVER_MODE)
#define pthread_mutex_lock(a) 0
#define pthread_mutex_unlock(a)
#define PGBUF_BCB_LOCK(bcb)
#define PGBUF_BCB_UNLOCK(bcb)
#endif
7. Flush Mechanisms¶
7.1 Background Daemons (SERVER_MODE)¶
| Daemon | Purpose |
|---|---|
pgbuf_Page_flush_daemon |
Flushes dirty victim candidates |
pgbuf_Page_maintenance_daemon |
LRU maintenance, quota adjustment |
pgbuf_Page_post_flush_daemon |
Post-flush processing (assigns flushed BCBs to waiters) |
pgbuf_Flush_control_daemon |
Adaptive flush rate control |
7.2 Flush Paths¶
Individual flush (pgbuf_flush_with_wal):
- Called by threads holding WRITE latch
- Acquires BCB mutex -> pgbuf_bcb_safe_flush_force_unlock()
- Ensures WAL constraint: log must be flushed up to oldest_unflush_lsa before page write
Bulk flush (pgbuf_flush_all_helper):
- Scans entire BCB table
- For each dirty BCB: lock, check conditions, flush, unlock
- Used for pgbuf_flush_all() and pgbuf_flush_all_unfixed()
Checkpoint flush (pgbuf_flush_checkpoint):
- Flushes pages with LSA <= flush_upto_lsa
- Uses sequential flusher with rate control
- Tracks chkpt_smallest_lsa for next checkpoint
Victim candidate flush (pgbuf_flush_victim_candidates):
- Collects dirty BCBs from LRU zone 3
- Sorts by VPID for sequential I/O
- Flushes with neighbor page batching
7.3 Neighbor Flush Optimization¶
When flushing a page, the system also flushes neighbor pages (pages with adjacent page IDs on the same volume) to improve sequential I/O:
#define PGBUF_MAX_NEIGHBOR_PAGES 32 // max neighbors in one batch
pgbuf_flush_page_and_neighbors_fb() collects dirty neighbors and flushes them together.
7.4 WAL Enforcement¶
Before any page write to disk:
pgbuf_bcb_flush_with_wal():
if (oldest_unflush_lsa > log_Gl.append.prev_lsa)
log_flush_up_to(oldest_unflush_lsa) // Ensure log is flushed first
fileio_write() or dwb_add_page() // Then write page
7.5 Flush Rate Control¶
The sequential flusher uses interval-based rate control: - Each 1-second period is divided into sub-intervals - Pages are flushed equally across intervals - Burst mode vs spread mode configurable - Compensation applied across intervals to meet overall target rate
8. Module Integration¶
8.1 Consumer Usage Pattern¶
All upper modules follow the same pattern:
// Standard read pattern
pgptr = pgbuf_fix(thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_READ, PGBUF_UNCONDITIONAL_LATCH);
// ... read page content ...
pgbuf_unfix(thread_p, pgptr);
// Standard modify pattern
pgptr = pgbuf_fix(thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_WRITE, PGBUF_UNCONDITIONAL_LATCH);
// ... modify page content ...
// ... log the change (WAL) ...
pgbuf_set_dirty(thread_p, pgptr, FREE); // marks dirty + unfixes
8.2 Key Consumers (24+ modules)¶
| Module | File | Usage |
|---|---|---|
| Heap File | heap_file.c |
Record storage, page allocation/dealloc |
| B-tree | btree_load.c |
Index page management |
| Overflow | overflow_file.c |
Large record overflow pages |
| Log Manager | log_manager.c |
WAL coordination, checkpoint |
| Log Recovery | log_recovery.c |
Page restoration during recovery |
| Vacuum | vacuum.c |
MVCC garbage collection |
| Disk Manager | disk_manager.c |
Volume/page allocation |
| File I/O | file_io.c |
Actual disk read/write |
| File Manager | file_manager.c |
File/page mapping |
| External Sort | external_sort.c |
Sort temp pages |
| Lock Manager | lock_manager.c |
Lock table pages |
| Slotted Page | slotted_page.c |
Slotted page operations |
| Hash Scan | query_hash_scan.c |
Hash join pages |
8.3 WAL Protocol Integration¶
[Modify page]
|
v
[pgbuf_set_dirty() -> sets DIRTY flag, records oldest_unflush_lsa]
|
v
[log_append_*() -> writes log record to log buffer]
|
v
[Eventually: flush daemon or checkpoint]
|
v
[pgbuf_bcb_flush_with_wal() -> ensures log is flushed FIRST, then writes page]
The oldest_unflush_lsa on each BCB tracks the earliest log record that must be on disk before the page can be flushed. This is the core WAL guarantee.
8.4 Double Write Buffer Integration¶
Pages are written through the Double Write Buffer (DWB) to prevent torn page writes:
pgbuf_bcb_flush_with_wal()
-> dwb_add_page() // Add to DWB
-> When DWB is full:
-> Write entire DWB block to DWB file (sequential I/O)
-> Then write individual pages to their actual locations
8.5 Transparent Data Encryption (TDE)¶
Pages can be encrypted/decrypted transparently:
- pgbuf_set_tde_algorithm() – set encryption algorithm per page
- pgbuf_get_tde_algorithm() – get current algorithm
- Encryption/decryption happens during I/O operations
9. Key Design Patterns¶
9.1 BCB Pointer <-> Page Pointer Conversion¶
CAST_PGPTR_TO_BFPTR(bufptr, pgptr) // PAGE_PTR -> PGBUF_BCB*
CAST_BFPTR_TO_PGPTR(pgptr, bufptr) // PGBUF_BCB* -> PAGE_PTR
The page pointer points to iopage_buffer->iopage.page, and the BCB is recovered via pointer arithmetic using offsetof.
9.2 Holder Tracking¶
Each thread maintains a holder list tracking which BCBs it has fixed:
- thrd_holder_info[thread_index] -> linked list of PGBUF_HOLDER
- Each holder has: fix_count, bufptr, performance stats, watcher list
- Default 7 pre-allocated holders per thread (PGBUF_DEFAULT_FIX_COUNT)
- Overflow allocation from shared free_holder_set (one PGBUF_HOLDER_SET at a time, each containing PGBUF_NUM_ALLOC_HOLDER entries)
9.3 Watcher System (Ordered Fix Support)¶
PGBUF_WATCHER structs track page fixes with ordering metadata:
- group_id: HFID-based group for heap pages
- initial_rank / curr_rank: Priority ordering
- page_was_unfixed: Flag indicating refix occurred
- Attached to holders via doubly-linked list
9.4 Performance Monitoring Hooks¶
Extensive perfmon_* calls track:
- Fix counts by page type, latch mode, conditional type
- Latch wait times (lock acquisition, holder wait, total fix time)
- Promote success/failure counts
- Unfix statistics by dirty state and latch mode
- Page type distribution snapshots
10. Observations & Discussion Points¶
10.1 Strengths¶
-
Lock-free read fast path: The
pgbuf_lockfree_fix_ro()/pgbuf_lockfree_unfix_ro()optimization is excellent for read-heavy workloads – no mutex acquisition at all for the common case. -
3-zone LRU with 2Q Aout: Sophisticated replacement policy that handles both frequency and recency, with scan resistance through the Aout history.
-
Private/Shared LRU partitioning: Reduces contention between transactions with different working sets and provides per-transaction page quota management.
-
Direct victim assignment: Avoids wasted CPU cycles on victim search by pre-assigning flushed BCBs to waiting threads via lock-free queues.
-
Comprehensive debugging: BCB mutex monitoring, page validation levels, fixed-at tracking, watcher magic numbers, and resource tracking in debug builds.
10.2 Complexity Concerns¶
-
File size (17K lines): The module is very large for a single file. Could benefit from splitting into sub-modules (lru.c, flush.c, hash.c, etc.).
-
Flag packing complexity: BCB flags, zone, and LRU index are all packed into a single
volatile int. While space-efficient, this creates complex bitwise operations and potential for subtle bugs. -
count_fix_and_avoid_deallocdual-purpose field: Using a single int for two independent counters (fix count in upper 16 bits, avoid dealloc in lower 16 bits) is fragile and the comment acknowledges the reason (need for atomic operations) is somewhat forced. -
TODO comment in LRU list (line ~589-591): ``` /* TODO: I have noticed while investigating core files from TPCC that hint is
- sometimes before first bcb that can be victimized. this means there is
- a logic error somewhere. */ ``` This is an acknowledged bug in victim hint management.
10.3 Concurrency Risk Areas¶
-
BCB flags are
volatile intwith non-atomic read-modify-write: While BCB mutex should be held for most flag changes, thevolatilequalifier alone doesn’t prevent torn reads/writes on some architectures. The code usespgbuf_bcb_update_flags()which does CAS on the flags, but there are also direct flag reads without the mutex. -
Hash chain search with BCB trylock:
pgbuf_search_hash_chain()walks the hash chain under hash_mutex and tries to lock BCB mutex with trylock. On failure, it must restart the search, which can lead to livelock under extreme contention. -
Lock ordering between BCB mutex and LRU mutex: While generally well-ordered, the code has complex paths where LRU operations happen after BCB unlock, creating windows where BCB state can change between operations.
10.4 Potential Improvements to Discuss¶
-
Partitioned hash table locks: Current 1M hash buckets with per-bucket mutex is good, but consider lock-free hash table for the read path.
-
NUMA awareness: The current design doesn’t appear to have NUMA-aware buffer placement, which could matter for large buffer pools.
-
Adaptive zone sizing: Zone thresholds are currently ratio-based. Could benefit from adaptive sizing based on workload characteristics.
-
Metric-driven flush tuning: The flush rate control system (
PGBUF_FLUSH_VICTIM_BOOST_MULT) uses a fixed multiplier. Adaptive algorithms could improve I/O scheduling.
10.5 Key Configuration Parameters¶
| Parameter | Description | Impact |
|---|---|---|
PRM_ID_PB_NBUFFERS |
Total buffer count | Memory vs hit ratio tradeoff |
PRM_ID_PB_NEIGHBOR_FLUSH_PAGES |
Neighbor flush count | Sequential I/O vs flush overhead |
PRM_ID_PB_NEIGHBOR_FLUSH_NONDIRTY |
Flush non-dirty neighbors too | I/O pattern optimization |
| LRU zone ratios | Zone 1/2 size ratios | Hot page protection vs victim availability |
Appendix: Function Index (Key Functions)¶
| Function | Line | Purpose |
|---|---|---|
pgbuf_initialize |
1515 | Initialize buffer pool |
pgbuf_fix (debug/release) |
2034 | Fix (pin) a page in buffer |
pgbuf_lockfree_fix_ro |
~2132 | Lock-free read-only fix fast path |
pgbuf_unfix |
2847 | Unfix (unpin) a page |
pgbuf_promote_read_latch |
2620 | Upgrade READ latch to WRITE |
pgbuf_ordered_fix |
(macro) | VPID-ordered fix for deadlock prevention |
pgbuf_set_dirty |
(macro) | Mark page as dirty |
pgbuf_flush_with_wal |
3361 | Flush single page with WAL |
pgbuf_flush_checkpoint |
3957 | Checkpoint flush |
pgbuf_flush_victim_candidates |
(h:346) | Flush dirty victim candidates |
pgbuf_get_victim |
~1065 | Find a victim BCB for replacement |
pgbuf_search_hash_chain |
~1035 | Look up VPID in hash table |
pgbuf_latch_bcb_upon_fix |
~1031 | Acquire latch during fix |
pgbuf_block_bcb |
~1029 | Block thread on latch conflict |
pgbuf_adjust_quotas |
(h:480) | Adjust private LRU quotas |
pgbuf_direct_victims_maintenance |
(h:484) | Maintain direct victim assignment |
Report generated for CUBRID page buffer code review. Based on analysis of src/storage/page_buffer.c (develop branch).