Node Storage
This article gives some insight into how XML nodes are internally stored.
Node Table
XML nodes are stored in a flat table. The node table of a database can be displayed via the INFO STORAGE
command:
$ basex -c"CREATE DB db <xml>HiThere</xml>" -c"INFO STORAGE"
PRE DIS SIZ ATS ID NS KIND CONTENT
-----------------------------------------
0 1 3 1 0 0 DOC db.xml
1 1 2 1 1 0 ELEM xml
2 1 1 1 2 0 TEXT HiThere
PRE Value
The PRE value of a node represents the order in which the XML nodes are visited by a SAX parser. It is actually not stored in the database; instead, the table position implies it. As a result, it will change whenever a node with a smaller PRE value is added to or deleted from a database.
ID Value
Each database node has a persistent ID value, which remains valid after update operations, and which is referenced by the value indexes. As long as no updates are performed on a database, the PRE and ID values are identical and need not be stored. The values will remain identical as long as new nodes are exclusively added to the end of the database. Once nodes are deleted or inserted somewhere else, the values will diverge, as shown in the next example:
$ basex -c"CREATE DB db <xml>Hi</xml>" -q"insert node <b/> before /xml" -c"INFO STORAGE"
PRE DIS SIZ ATS ID NS KIND CONTENT
-----------------------------------------
0 1 4 1 0 0 DOC db.xml
1 1 1 1 3 0 ELEM b
2 2 2 1 1 0 ELEM xml
3 1 1 1 2 0 TEXT Hi
The db:node-pre
and db:node-id
functions can be called to retrieve the PRE and ID values of a node, and db:get-pre
and db:get-id
can be used retrieve the node with the specified value. By default, ID lookups are expensive. If UPDINDEX
is turned on, an additional index will be maintained to speed up the process.
Block Storage
The node table is organized as blocks with a fixed maximum number of records. The records within a block are sorted by their PRE value (which, therefore, can be implicitly determined and need not be saved). For each block, the smallest pre value within that block is stored. Since the records are sorted, it will be the PRE value of the first record stored in the block. Since the two maps will not grow excessively large, but are accessed/changed by each read/write operation, they are kept in main memory and only flushed to disk if required. In addition, a bit map tells which physical blocks are free and which not, so that when a new block is needed, an already free one will be reused.