B-tree index: structure and rules

A B-tree index is designed to store sequential data, and it uses its operation classes to organize entries in a linear order.

It's a multi-way balanced tree optimized for consistent single-entry and range data retrieval. It's balanced because all leaf nodes are located in the bottom level, ensuring every search passes through the same number of layers. It is multi-way because each node can branch out into multiple child nodes, making the tree wider and minimizing the number of levels necessary for a query search.

The B-tree entries are stored in physical disk pages, while its structure is made of logical nodes that can have multiple roles depending on their position in the tree.

The root node, located at the top of the tree, is the starting point for every query. It provides the initial pointers to the first set of internal nodes.

The internal nodes, located between the root and leaf nodes, contain key-pointer pairs that direct the search to the corresponding child node.

The leaf node, located at the bottom of the tree, contains the actual entries.

An index entry is composed of a key-pointer pair. The key is the exact column value being stored in the b-tree, and its size depends on its data type and length. The pointer is a fixed size adress that points to the location of the entry data.

The role of a pointer depends on the node it's placed in: In an internal node, the pointer acts as a roadmap, referencing the specific child node that includes the searched entry value in its range. In a leaf node, the pointer is a specific address to the physical location of the table row containing the entry key column value; it can be a tuple identifier or a memory block notation.

In a standard B-tree, the single pointers in the internal nodes can be used to retrieve the actual table data, depending on the implementation.

Each leaf node entry TID points to the table row

B-tree nodes are designed to have the same size as a disk page. This allows the entire node to be retrieved in a single I/O operation, making the query require only one disk access on each index level.

A B-tree node can contain data equivalent to multiple disk pages (e.g., 4KB, 8KB, or 16KB), allowing the index to adapt efficiently to larger databases.

Both will require a single I/O access to the disk

The maximum size of a B-tree entry is 1/3 of the node it's placed in. This rule ensures that every node can branch out to a minimum number of child nodes, which helps maintain a wide, horizontal index structure.

The order (m) of a B-tree determines the maximum number of children an internal node can have. It also defines the maximum number (m−1) and the minimum number of keys (m/2) an index can have.

In an internal node, pointers act as dividers that create ranges between their key values, which are then represented by the child nodes. In a leaf node, the the m−1 key rule doesn't apply because each key-pointer pair points to the single physical location of its data.

An incomplete child node will only fill its empty spaces with entries that fall within the specific range implicitly defined by its parent node's pointers.

B-tree with incomplete child nodes and m-1 pointers

Once the number of keys in a node exceeds the index's order (m), the node splits. A new parent internal node is created containing the lowest median key value. The original node's entries are split between the newly created child nodes, using the internal node's median key as the divider.

The internal node contains only single values, not ranges like its child nodes. It gets updated each time one of its children splits. Once an internal node becomes full, it creates a new parent internal node with its median value. It splits the internal node's entries and their corresponding child nodes between the original node and the new sibling node.

The ranges defined by the internal node's entries have different boundaries for the left and right sides. The key-pointers for 5 and 12 will create child nodes that cover these respective ranges: x<5 , 5≤x<12 , x≥12 All nodes are sorted horizontally in a linear order.

The database can perform an index-only scan on queries that use only indexed columns for all of their elements, including the SELECT list and all clauses (WHERE, ORDER BY, etc.).

It's a highly optimized database operation triggered by the query planner. It retrieves the data directly from the index structure, without using the entry TIDs to access its table row data.

B-tree indexes are optimized for high-volume row operations; rather than writing directly to disk, changes are buffered in RAM and recorded sequentially in the WAL log.

chevron-rightB-tree RAM buffers for write optimitation and index entries size propertieshashtag

The index's INSERT operations are slowed down by non-linear or randomly generated data types. They can cause unpredictable page splits and lead to inefficient leaf-node entry distribution, but they won' affect SELECT operations.

The PostgreSQL SERIAL and MySQL AUTO_INCREMENT constructs create efficient sequences of column values in the index. They generate values in a linear order, allowing for an efficient leaf-node storage and reducing the node splits.

The database loads copies of frequently accessed data pages into its RAM cache, allowing it to update the pages without I/O operations. It then pushes the changes to the physical disk before the page split.

The database uses a sequential Write-Ahead Log (WAL) on the disk to prevent data loss. Any change to a data page is first recorded in the WAL before it's applied to the page in the RAM cache. The database will replay the WAL operations to restore the data after a system crash.

The entry's size impacts the B-tree structure and efficency. A large key data type causes the node to contain fewer entries before it splits, making the index taller and slowing down the data retrieval. A small key data type may not provide enough unique values due to its shorter length. An INT 32-bit integer has 2.1 billion possible values, while a BIGINT 64-bit integer has 9.2 quintillion values, offering more unique values in exchange for a larger entry size.

All RAM changes are stored in as binary code for recoverability

Insert and Update Procedures for Table Heap and Index Units

The database stores a table's rows in a collection of disk pages, which is called a table heap. It's optimized for size, and organized into unordered tuples, with each containing the column data for a single table row.

The database will require a table scan to retrieve tuple data from the heap, or an index scan. The index includes a TID (Tuple Identifier) for each entry of its indexed columns, which is used to locate the specific tuple in the heap.

The disk pages are stored within a relation file

The table heap will append any new INSERT tuple to the end of the heap, without any page range condition or ordering like in the B-tree. On UPDATE operations, the new tuple's position depends on whether the updated tuple columns are part of a table's index.

An UPDATE value including an indexed column will be appended as a new tuple in the heap, and it creates a new entry in the index, which points to the new version of the tuple with its TID. Any outdated tuple and index entry will be kept and handled by the later maintenance operations. An UPDATE value that doesn't include an indexed column will instead result in a HOT update.

A Heap-Only-Tuple(HOT) UPDATE modifies the table heap without triggering an index update. It adds a new version of the tuple to the same disk page as the old one. It also adds a forward pointer to the old tuple, which allows its index entry TID to find the correct data. If there is no available space on the old tuple's disk page, it defaults to an indexed update. The database management system contains the rules for the HOT update. The query planner provides the execution plan for the database after analyzing a query.

The t_ctid tuple property gets updated with a forward pointer on HOT update

Index and table heap maintenance operations for outdated values

The version bloat forms when the index or the table heap gets full of outdated values. A tuple is marked as dead on DELETE or UPDATE and will be removed after a visibility check during maintenance. An index entry isn't explicitly marked; its TID is checked to see if it points to a dead tuple.

Outdated values in Table heap and Index, both handled in VACUUM

The REINDEX command is a global maintenance operation. It performs a complete table scan to rebuild the index from the current tuples in the table heap, which is then used to replace and update the old index. It relies on a prior VACUUM visibility check to detect the live tuples; if not, it defaults to indexing all tuples, even outdated ones.

The BOTTOM-UP defragmentation process triggers when a full B-tree index page is about to split. It is a localized and immediate maintenance operation that cleans up outdated index entries.

The VACUUM operation is a page-by-page maintenance process that deletes dead tuples in the table heap and marks outdated index entries for removal. The VACUUM is triggered manually while the AUTOVACUUM is triggered on a schedule. The space from deleting dead tuples is not returned to the operator system; it is marked as free internally for new tuples in the same table heap. A table heap's size is reduced by relocating its data to a smaller set of disk pages, which requires blocking all transactions. The VACUUM mantains the table heap under the MVCC conditions, mantaining concurrent access during UPDATE operations.

The VACUUM FULL command removes dead tuples by replacing the old table heap with a new one with only live tuples. It includes a REINDEX for the index, rebuilding it using the live tuples from the new table heap.

The VACUUM operation uses the transaction IDs to determine the visibility state of a tuple. Every transaction is assigned a unique XID, which is assigned to the XMIN property of the new tuples it creates and the XMAX property of any tuples it updates or deletes. The VACUUM sets its cleanup cutoff point based on the XID of the oldest currently active transaction. It will remove tuples with an XMAX value lower than the XID, as they are no longer visible to any ongoing transactions.

Accessing Index and Table Heap Statistics with Extensions and System Catalogs

The pgstattuple extension provides functions that reveal low-level details about the database's physical storage. It returns properties like bloat and unused space, which are invisible to high-level commands like SELECT. It's included in the PostgreSQL, but must be first activated with the CREATE EXTENSION command.

The pgstattuple() function analyzes the physical layout of its table heap:

table_len: The total size of the table in bytes, including the empty and unassigned space. tuple_count/tuple_percent: The number of all tuples (both live and dead) and the percentage of the table's space they occupy. dead_tuple_count/dead_tuple_percent: The number of dead tuples and the percentage of the table's space they occupy. free_space/free_percent: The amount of empty space on the table's pages, measured in bytes, and the percentage of reusable space in the table's total disk space.

These functions provide a low-level view of tuples and index entries by accessing their physical location within the database.

The low-level physical values of the relations

The pgstattuple extension functions access the low-level data to measure the bloat and fragmentation caused by row operations. We can use it to track the effectiveness of the maintenance operations.

chevron-rightDifferent pgstatuple properties for HOT updates on a no-AUTOVACUUM table.hashtag

We used the pgstattuple and pgstatindex extensions to gather data on table and index bloat. This allows us to analyze the impact of VACUUM after a series of sequential updates.

We create two identical tables, one with autovacuum disabled, and populate each with 10,000 rows. We then create an index on the id column for both tables before we performed five consecutive HOT updates on a non-indexed column.

On the table without autovacuum, pgstattuple shows a dead_tuple_count of 5 and a high dead_tuple_percent. This reflects the five outdated tuple versions left in the table heap. The table with autovacuum has both properties at 0, as the background process has already cleaned up the outdated tuples.

The AUTOVACUUM only affects the internal fragmentation of the table it runs on. The GENERATE_SERIES() INSERTs uses all available space on the disk pages because of the default FILLFACTOR of 100. A Heap-Only Tuple (HOT) update requires free space on the page to store the new version of a row. Without any available space, the HOT optimization fails, and the system performs a standard update, creating a new index entry for every updated row.

The AUTOVACUUM removes the outdated entries from a leaf page. The created empty space lowers the avg_leaf_density, which remains high in a non-vacuumed table. The process doesn't affect the external fragmentation. The Leaf_fragmentation represents how organized the data is within the physical index structure.

The GENERATE_SERIES() INSERTs maintain a low external fragmentation by creating an organized and contiguous series of data, while UPDATE operations increase it. The string-based data types are not ordered linearly like integers, an updated entry will not be placed in a contiguous location next to its old value, even if the string is mostly the same. The external index structure can be reorganized by the REINDEX operation, which completely rebuilds the data's distribution across the index pages.

A lower FILLFACTOR increases the chances for successful HOT updates, which in turn decreases the index size. The leaf_fragmentation is also lower because fewer new pages are created and scattered across the disk, resulting in a more compact index structure. The avg_leaf_density, however, varies depending on the index's size. The FILLFACTOR also slightly influences the table heap's dead_tuple_percent. It doesn't change the number of tuples inserted or which ones are considered live or dead, but it increases the number of disk pages needed to store the initial data. The larger table size results in a lower dead_tuple_percent.

Table and index stats on different FILLFACTOR
The impact of AUTOVACUUM on a GENERATE_SERIES() of HOT updates

We can access a table's or index's structure using the pg_class system catalog view. The pg_class is a built-in feature that provides a native view of the metadata for every database relation. Unlike extensions, it requires no installation. It returns high-level data estimates, which are calculated by the database during maintenance operations and doesn't scan the physical storage pages.

The view includes the following key columns:

relname: The name of the relation relkind: A code indicating the type of relation relpages: The estimated number of disk pages used for the relation's storage. reltuples: The number of tuples or index entries in the relation. oid: The unique identifier used by the database to retrieve the relation's physical size. reloptions: The storage options specified when the relation was created

The database provides many specialized system views and functions that rely on the pg_class master catalog. The unique OID stored in pg_class is used to retrieve the relation's metadata from its specific directory.

chevron-rightThe additional pg system views and functions for database relationshashtag

We query system views or functions using the relation's name; PostgreSQL's internal process automatically retrieves the linked pg_class OID, which is the actual value used internally by the operating system functions to access the metadata.

The pg_indexes system view provides logical metadata about the indexes. It returns the index's name, its table, its SQL script command, and its assigned tablespace. The TABLESPACE keyword defines the table's directory physical location. It contains all the logical storage unit used for the table and its associated relations.

The pg_stat_all_tables() function provides the table's runtime statistics. It includes the row operation counts, the number and types of scans, the number of live and dead tuples, and timestamps of the last maintenance operation. The statistics collector provides the exact table activity paramethers from a dedicated runtime directory, enabling the efficient monitoring and validation of maintenance processes without requiring a full table scan.

The pg_stat_all_indexes() provides similar statistics specific to the index, like the number of index entry reads.

The pg_relation_size() function returns the exact size of a database relation. It doesn't rely on the pg_class metadata estimates, it queries operating system functions to access the physical disk location of the relation.

The pg_total_relation_size() function queries the same operating system functions as pg_relation_size() but includes the table's additional structures, like TOAST and its associated indexes. A TOAST table stores column values that are too large to fit on a standard data page.

The pg_table_size() function queries the same operating system functions as pg_total_relation_size(), it accesses the table physical file to return the exact table size and its associated structures. The pg_indexes_size() shares the same queries but accesses different physical file locations, as it returns the combined sizes of all indexes associated with the specified table.

The physical properties accessed by the system views and functions

The system views and functions outside of pg_class provide a better view into a table's physical storage. They can reveal properties not directly visible in pg_class, like the TABLESPACE (physical directory), and include the size of associated relations, like the TOAST (The Oversized Attribute Storage Technique) tables.

The B-tree Composite Index and its Left-Prefix Rule structure

A B-tree composite index includes multiple columns in its CREATE INDEX statement. Each composite key is a combination of the indexed column values, and it points to a single TID.

The structure of a composite index is defined by its first CREATE INDEX column. The B-tree uses the first column's value to define the ranges and distribution of its nodes. All the following columns' positions within the composite key will depend on the placement of the first column value. A query must include the first column in its WHERE condition to effectively navigate a composite index. A latter column would have to scan every entry to find a its value, resulting in an operation similar to a full table scan.

A B-tree composite index ordered by its first column

We follow the "left sort rule" when creating a composite index to optimize its queries. We decice the order of the columns based on the type of data they contain and the query operations they're involved in. It prioritizes columns with the highest cardinality (most unique values) as they allow the index to quickly filter a large number of rows. It prioritizes columns involved in more specific operations, like equality checks (=), as they allow the index to filter out more specific results. It doesn't prioritize columns used in the ORDER BY clause as they are used to sort results that have already been filtered at the end of the query. We can include the ASC and DESC keywords in the CREATE INDEX statement to match the sort order of an ORDER BY clause. This allows the database to skip the sorting step in the query.

A composite index applies the left-prefix rule to its columns using their operation classes. It stores its multiple columns in the pg_index catalog as an array. Each column has its own associated operation class, defined in pg_opclass, which provides its comparison and sorting functions from pg_amop and pg_amproc, respectively. The columns' operation classes act independently of each other. They only need to be compatible with the index's access method to be indexed togheter. The index's sort order is defined by the first column's operation class, with subsequent columns applying their sorting functions only when needed to differentiate rows with identical values.

oidvector data type is symilar to an array structure

Comparing the MVCC method for tuple updates and the CONCURRENCY option for table LOCK index operations.

The database applies a table lock when creating an index, which prevents any other operations on the table's data.

We can avoid the inaccessibility period by using the CONCURRENCY keyword in the CREATE INDEX statement. It allows the database to create the index without locking the table, keeping it available for other operations.

The CONCURRENCY keyword builds the index in a temporary space, requiring multiple scans to include any tuple updates and maintain data consistency with the table heap. Once it's complete, the database applies the new index to the table.

The MVCC (Multi-Version Concurrency Control) is a database method that manages concurrency. Concurrency is a property of the database management system that allows it to handle multiple transactions on the same data unit simultaneously. The MVCC provides a consistent data version for all transactions and prevents uncommitted changes from blocking other operations. Multiple operations can maintain data concurrency, but not all of them employ the MVCC method.

The UPDATE operation employs the MVCC by not physically changing a tuple in the table heap. It creates a new version of the tuple while keeping the old version available for any ongoing operation that started before the transaction was commited. The process is a local read/write operation that manages data using the xmin and xmax transactions IDs in the tuple header, leaving the outdated tuple for later maintenance operations like VACUUM.

MVCC implementation on table heap update

The CONCURRENCY option applied to the CREATE INDEX operation doesn't use the MVCC. It's a schema-level operation that maintains the concurrency of the table by avoiding a lock during index creation, allowing other operations to proceed. It's not part of the MVCC approach because it physically creates and replaces an entire index instead of just managing local operations.

The first scan creates the initial version of the index by reading all the current table rows. The second scan, triggered after the first is complete, verifies any changes made to the table during the initial index build, adding any new or updated tuples to the temporary index. The final data consistency check validates the atomic swap, which is a single, indivisible action that replaces the old index with the new one. The deleted tuples are handled by the standard VACUUM process, which marks their space as available. An index created with the CONCURRENCY option will still contain entries that point to outdated tuples, they will be removed by later maintenance operations.

The MVCC method on a CONCURRENCY index

We can apply a constraint directly to the values within an index. The index's structure optimizes the constraint validation check on the current values, without having to scan the entire table. It's aplied only on constraint that implicitly create and use indexes, like UNIQUE and PRIMARY KEY.

Indexes can't be used to create multiple logical partitions of data from the same column. The query planner has access to the index conditions but can't determine if they are mutually exclusive. It will check all relevant indexes, which adds overhead and makes the operation as slow as a table scan. Instead we use the declarative partitioning, which is designed to handle this exact scenario.

Data type indexing with Operation class and family

The B-tree assigns an operation class to each data type allowed within the index. It contains the functions that set and maintain the linear sorting order of the index values, such as sorting and operator functions.

The sorting function uses the trichotomy rule to determine the mathematical relationship between two values. It establishes the initial sorted order of the index and maintains it for every new entry added.

The ordering relations are the fundamental mathematical properties necessary for the B-tree's linear order. These rules are implicitly enforced by the sorting function.

The index value must follow the reflexive rule (A=A), the symmetric rule (if A=B, then B=A), and the transitive rule (if A=B and B=C, then A=C) for the equality operator. It must follow the irreflexible rule (A<A is always false) and the transitive rule (A<B, B<C, then C<A) for the less than (<) and more than (>) operators. The trichotomy rule states that for any two values, A and B, only one of three conditions can be true: A<B, A=B, or A>B. This rule ensures a single, linear order.

The operator functions handle the five comparison operations (<, >, =, ≤, and ≥) used for index queries. A single-value query uses the (<) and (>) operators to navigate the B-tree index positions, then confirms the matching value using the (=) operator. A range-value query uses the (≤) and (≥) operators to locate the start of the range and includes all entries until it finds the last key value that satisfies the condition.

The structure of a data type Operation class

Operation families group multiple operation classes for data types with equivalent sorting rules, like int4 and int8, and allow for cross-data type comparison queries on the B-tree.

The database identifies the different data types in the index query and checks if their operation classes are part of the same operation family. This allows it to implicitly convert one of the data types during the cross-data type comparison.

Some compatible data types, like NUMERIC and FLOAT, are not suitable for cross-data type comparisons. Their different precision levels would result in a converted NUMERIC value being different from the original, which breaks the transitive rule (A != A).

The metadata for the operation class functions is stored in the pg_amop and pg_amoproc system catalogs.

B-tree support functions and EQUALIMAGE index deduplication

The support functions are specialized methods contained within each operator class. They optimize the B-tree's core operations, like the sorting function that sets and maintains the index's internal structure.

The ORDER support function implements the trichotomy function for index entries. It compares two values and returns an int32 outcome representing their mathematical relationship. A specific version of this function is included for each data type allowed in the index.

The SortSupport optional function optimizes index sorting. It acts as a helper for the ORDER function, loading all the indexed column data into RAM where it's efficiently sorted using a block-focused algorithm. The query planner chooses to use it for large datasets, where the single-time overhead of loading data into RAM is less than the cost of multiple, single-comparison disk operations performed by the ORDER function. The function's availability depends on the data type's operator class.

The in_range support function optimizes the index's range operations. On select queries, it helps the database quickly find specific ranges without a full table scan. On insert, it speeds up the top-down B-tree navigation to the correct child node by efficiently matching the inserted value to the ranges defined by the internal nodes.

The equalimage support function is an optional helper for index deduplication. It allows a B-tree index to store multiple table rows values in a single key, using a posting list of TIDs, and avoids the use of multiple keys for the same values.

The equalimage function is triggered for all repeated values within the index keys. It applies a semantic check to validate the key deduplication, which allows the database to create a canonical image of the values. It also includes NULL entries, which, unlike in standard SQL, aren't considered unique and can be grouped together.

The database deduplicates values during the CREATE INDEX and REINDEX operations, automatically checking for equal values within the index. On INSERT and UPDATE it acts as a conditioned lazy process.

The deduplicated key's posting list must fit within its single node. It can't span across multiple nodes, and if its size exceeds the node's capacity, the duplicate values will be stored as separate keys.

Columns added to an index using the INCLUDE clause can't be deduplicated. They are stored in the node as part of the index's posting list and can't be further deduplicated.

The equalimage function retrieves the comparison function for the indexed data type and checks if PostgreSQL has assigned it a deterministic OID argument. The deterministic rule sets the equalimage boolean flag to true, which validates any equality operation defined by the query condition as safe and allows the deduplication structure by PostgreSQL.

chevron-rightHow the deterministic rule ensures consistent comparisons for the equalimagehashtag

The deduplication process relies on the database's deterministic rules applied to its comparison operations.

The database includes 2 types of equality: The byte-by-byte equality, for data types like INT where it compares the exact match between 2 single values. The logical equality, for data types like TEXT where we compare the logical results of the values, like the strings "straße" and "strasse". The deterministic rule ensures that any comparison will always yield the same result for the same values.

The collation rules define how operations and sorting are regulated for collatable data types, such as character-based values. They are stored in the pg_collation catalog. Their OID specifies which set of collation rules is applied by the comparison function within the indexed column operation class, which defaults to deterministic if not specified during index creation.

The deterministic rules validate a comparison equality operation by confirming the integrity of the equality image, which ensures the operation remains valid even if the values change positions. The non-deterministic rules allow for case-insensitive and logic-based equality operations.

The equalimage boolean flag is stored in the metadata, allowing PostgreSQL to create a deduplication structure. It uses a single posting list entry containing multiple TIDs for equal values, enabling the lazy merging of values during INSERT operations.

The "lazy process" refers to the timing of the deduplication merging: it happens when the database anticipates a disk page split; the deduplication pass set by the equalimage boolean flag then allows the equal entry values to be merged without additional checks.

Certain compatible data types can't be deduplicated, such as FLOAT4 (REAL) and FLOAT8 (DOUBLE PRECISION), because of their different precision levels and how they handle specific values like +0 and -0. If an indexed column can hold multiple compatible data types (e.g., VARCHAR and TEXT), their values can be deduplicated on the same semantic image.

The EQUALIMAGE aplied during the index deduplication process

The B+tree rules and structure

The B+tree is a streamlined version of the B-tree data structure.

The B+tree stores key-TID pairs only in the leaf nodes, using the internal nodes only for keys and child pointers. This is different from a standard B-tree, where the internal nodes also contain the TIDs of their single values that act as dividers.

The B+tree includes a doubly linked list that horizontally connects nodes at the same tree level. It optimizes range values retrievals by avoiding to vertically traverse the tree. It can slow down single-value queries because the search must always reach a leaf node, while a standard B-tree can retrieve the data from an internal node.

The B+tree logical structure is implemented differently depending n the database management system:

In MySQL, a B+ tree includes previous and next pointers for all nodes at every horizontal level, which allows for efficient horizontal retrieval of data. In PostgreSQL's, the B-tree only has the previous and next pointers at the last level for the leaf nodes.

Both follow the N keys for N+1 child pointers rule for their internal nodes.

The B+ tree is used for a table's primary key index. Its TID works as a unique row identifier. It's created by default on CREATE TABLE when a primary key is included in a column definition, it optimizes data retrieval on the table's heap.

The B+tree internal nodes can contain more pointers, as they don't store the entries

Last updated