Difference between revisions of "Storage Layout"
m (Text replacement - "<br />" to "<br/>") |
|||
(17 intermediate revisions by 3 users not shown) | |||
Line 5: | Line 5: | ||
The following data types are used for specifying the storage layout: | The following data types are used for specifying the storage layout: | ||
− | {| class="wikitable" | + | {| class="wikitable" |
− | + | |- valign="top" | |
! Type | ! Type | ||
! Description | ! Description | ||
! Example (native → hex integers) | ! Example (native → hex integers) | ||
− | |- | + | |- valign="top" |
| {{Type|Num}} | | {{Type|Num}} | ||
− | | Compressed integer (1-5 bytes), specified in [https://github.com/BaseXdb/basex/blob/master/src/main/java/org/basex/util/Num.java Num.java] | + | | Compressed integer (1-5 bytes), specified in [https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/util/Num.java Num.java] |
− | | {{ | + | | {{Code|15}} → {{Code|0F}}; {{Code|511}} → {{Code|41 FF}}<br/> |
− | |- | + | |- valign="top" |
| {{Type|Token}} | | {{Type|Token}} | ||
| Length ({{Type|Num}}) and bytes of UTF8 byte representation | | Length ({{Type|Num}}) and bytes of UTF8 byte representation | ||
− | | {{ | + | | {{Code|Hello}} → {{Code|05 48 65 6c 6c 6f}} |
− | |- | + | |- valign="top" |
| {{Type|Double}} | | {{Type|Double}} | ||
| Number, stored as token | | Number, stored as token | ||
− | | {{ | + | | {{Code|123}} → {{Code|03 31 32 33}} |
− | |- | + | |- valign="top" |
| {{Type|Boolean}} | | {{Type|Boolean}} | ||
− | | Boolean (1 byte, {{ | + | | Boolean (1 byte, {{Code|00}} or {{Code|01}}) |
− | | {{ | + | | {{Code|true}} → {{Code|01}} |
− | |- | + | |- valign="top" |
| {{Type|Nums}}, {{Type|Tokens}}, {{Type|Doubles}} | | {{Type|Nums}}, {{Type|Tokens}}, {{Type|Doubles}} | ||
| Arrays of values, introduced with the number of entries | | Arrays of values, introduced with the number of entries | ||
− | | {{ | + | | {{Code|1,2}} → {{Code|02 01 31 01 32}} |
− | |- | + | |- valign="top" |
| {{Type|TokenSet}} | | {{Type|TokenSet}} | ||
| Key array ({{Type|Tokens}}), next/bucket/size arrays (3x {{Type|Nums}}) | | Key array ({{Type|Tokens}}), next/bucket/size arrays (3x {{Type|Nums}}) | ||
Line 38: | Line 38: | ||
=Database Files= | =Database Files= | ||
− | The following tables illustrate the layout of the BaseX database files. All files are suffixed with {{ | + | The following tables illustrate the layout of the BaseX database files. All files are suffixed with {{Code|.basex}}. |
− | == | + | ==Metadata, Name/Path/Doc Indexes: {{Code|inf}}== |
− | {| class="wikitable" | + | {| class="wikitable" |
− | + | |- valign="top" | |
! Description | ! Description | ||
! Format | ! Format | ||
− | + | |- valign="top" | |
− | |- | + | | '''1. Metadata''' |
− | + | | 1. Key/value pairs, in no particular order ({{Type|Token}}/{{Type|Token}}):<br/> • Examples: {{Code|FNAME}}, {{Code|TIME}}, {{Code|SIZE}}, ...<br/> • {{Code|PERM}} → Number of users ({{Type|Num}}), and name/password/permission values for each user ({{Type|Token}}/{{Type|Token}}/{{Type|Num}})<br/>2. Empty key as finalizer | |
− | + | |- valign="top" | |
− | | valign= | + | | '''2. Main memory indexes''' |
− | + | | 1. Key/value pairs, in no particular order ({{Type|Token}}/{{Type|Token}}):<br/> • {{Code|TAGS}} → Element Name Index<br/> • {{Code|ATTS}} → Attribute Name Index<br/> • {{Code|PATH}} → Path Index<br/> • {{Code|NS}} → Namespaces<br/> • {{Code|DOCS}} → Document Index<br/>2. Empty key as finalizer | |
− | + | |- valign="top" | |
− | | 1. Key/value pairs, in no particular order ({{Type|Token}}/{{Type|Token}}):<br /> • {{ | + | | '''2 a) Name Index'''<br/>Element/attribute names |
− | | valign= | + | | 1. Token set, storing all names ({{Type|TokenSet}})<br/>2. One StatsKey instance per entry:<br/>2.1. Content kind ({{Type|Num}}):<br/>2.1.1. Number: min/max ({{Type|Doubles}})<br/>2.1.2. Category: number of entries ({{Type|Num}}), entries ({{Type|Tokens}})<br/>2.2. Number of entries ({{Type|Num}})<br/>2.3. Leaf flag ({{Type|Boolean}})<br/>2.4. Maximum text length ({{Type|Double}}; legacy, could be {{Type|Num}}) |
− | + | |- valign="top" | |
− | + | | '''2 b) Path Index''' | |
− | | 1. Token set, storing all names ({{Type|TokenSet}})<br />2. One StatsKey instance per entry:<br/>2.1. Content kind ({{Type|Num}}):<br />2.1.1. Number: min/max ({{Type|Doubles}})<br />2.1.2. Category: number of entries ({{Type|Num}}), entries ({{Type|Tokens}})<br />2.2. Number of entries ({{Type|Num}})<br />2.3. Leaf flag ({{Type|Boolean}})<br />2.4. Maximum text length ({{Type|Double}}; legacy, could be {{Type|Num}}) | + | | 1. Flag for path definition ({{Type|Boolean}}, always {{Code|true}}; legacy)<br/>2. PathNode:<br/>2.1. Name reference ({{Type|Num}})<br/>2.2. Node kind ({{Type|Num}})<br/>2.3. Number of occurrences ({{Type|Num}})<br/>2.4. Number of children ({{Type|Num}})<br/>2.5. {{Type|Double}}; legacy, can be reused or discarded<br/>2.6. Recursive generation of child nodes (→ 2) |
− | | valign= | + | |- valign="top" |
− | + | | '''2 c) Namespaces''' | |
− | |||
− | | 1. Flag for path definition ({{Type|Boolean}}, always {{ | ||
− | | valign= | ||
− | |||
− | |||
| 1. Token set, storing prefixes ({{Type|TokenSet}})<br/>2. Token set, storing URIs ({{Type|TokenSet}})<br/>3. NSNode:<br/>3.1. pre value ({{Type|Num}})<br/>3.2. References to prefix/URI pairs ({{Type|Nums}})<br/>3.3. Number of children ({{Type|Num}})<br/>3.4. Recursive generation of child nodes (→ 3) | | 1. Token set, storing prefixes ({{Type|TokenSet}})<br/>2. Token set, storing URIs ({{Type|TokenSet}})<br/>3. NSNode:<br/>3.1. pre value ({{Type|Num}})<br/>3.2. References to prefix/URI pairs ({{Type|Nums}})<br/>3.3. Number of children ({{Type|Num}})<br/>3.4. Recursive generation of child nodes (→ 3) | ||
− | | valign= | + | |- valign="top" |
− | + | | '''2 d) Document Index''' | |
− | |||
| Array of integers, representing the distances between all document pre values ({{Type|Nums}}) | | Array of integers, representing the distances between all document pre values ({{Type|Nums}}) | ||
− | |||
|} | |} | ||
− | ==Node Table: {{ | + | ==Node Table: {{Code|tbl}}, {{Code|tbli}}== |
− | * {{ | + | * {{Code|tbl}}: Main database table, stored in blocks. |
− | * {{ | + | * {{Code|tbli}}: Database directory, organizing the database blocks. |
− | Some more information on the [[Node | + | Some more information on the [[Node Storage|node storage]] is available. |
− | ==Texts: {{ | + | ==Texts: {{Code|txt}}, {{Code|atv}}== |
− | * {{ | + | * {{Code|txt}}: Heap file for text values (document names, string values of texts, comments and processing instructions) |
− | * {{ | + | * {{Code|atv}}: Heap file for attribute values. |
− | ==Value Indexes: {{ | + | ==Value Indexes: {{Code|txtl}}, {{Code|txtr}}, {{Code|atvl}}, {{Code|atvr}}== |
'''Text Index:''' | '''Text Index:''' | ||
− | * {{ | + | * {{Code|txtl}}: Heap file with ID lists. |
− | * {{ | + | * {{Code|txtr}}: Index file with references to ID lists. |
− | The '''Attribute Index''' is contained in the files {{ | + | The '''Attribute Index''' is contained in the files {{Code|atvl}} and {{Code|atvr}}, the '''Token Index''' in {{Code|tokl}} and {{Code|tokr}}. All have the same layout. |
− | + | For a more detailed discussion and examples of these file formats please see [[Index File Structure]]. | |
− | + | ==Document Path Index: {{Code|pth}}== | |
− | + | Provides an index of all the document paths in the database. For databases with a large number of paths this file can be quite large so it is only generated the first time a function requesting a path lookup is run. For databases where path lookups are never used this file will not exist. | |
− | ... | + | '''Note:''' On Windows/Mac systems this file is case insensitive (all paths are lower case). On UNIX-like systems this file is case sensitive. The behaviour of path look ups will vary between systems. Copying this file between system types may lead to unexpected behaviour. |
+ | |||
+ | ==ID/Pre Mapping: {{Code|idp}}== | ||
+ | |||
+ | This file is only created if incremental indexing (UPDINDEX) is enabled for a database. It is used to provide a quick look up of the pre value for a database node id. | ||
+ | |||
+ | ==Full-Text Fuzzy Index: {{Code|ftxx}}, {{Code|ftxy}}, {{Code|ftxz}}== | ||
+ | |||
+ | ...may soon be reimplemented. |
Latest revision as of 10:17, 9 March 2023
This article is part of the Advanced User's Guide. It presents some low-level details on how data is stored in the database files.
Contents
Data Types[edit]
The following data types are used for specifying the storage layout:
Type | Description | Example (native → hex integers) |
---|---|---|
Num
|
Compressed integer (1-5 bytes), specified in Num.java | 15 → 0F ; 511 → 41 FF |
Token
|
Length (Num ) and bytes of UTF8 byte representation
|
Hello → 05 48 65 6c 6c 6f
|
Double
|
Number, stored as token | 123 → 03 31 32 33
|
Boolean
|
Boolean (1 byte, 00 or 01 )
|
true → 01
|
Nums , Tokens , Doubles
|
Arrays of values, introduced with the number of entries | 1,2 → 02 01 31 01 32
|
TokenSet
|
Key array (Tokens ), next/bucket/size arrays (3x Nums )
|
Database Files[edit]
The following tables illustrate the layout of the BaseX database files. All files are suffixed with .basex
.
Metadata, Name/Path/Doc Indexes: inf
[edit]
Description | Format |
---|---|
1. Metadata | 1. Key/value pairs, in no particular order (Token /Token ):• Examples: FNAME , TIME , SIZE , ...• PERM → Number of users (Num ), and name/password/permission values for each user (Token /Token /Num )2. Empty key as finalizer |
2. Main memory indexes | 1. Key/value pairs, in no particular order (Token /Token ):• TAGS → Element Name Index• ATTS → Attribute Name Index• PATH → Path Index• NS → Namespaces• DOCS → Document Index2. Empty key as finalizer |
2 a) Name Index Element/attribute names |
1. Token set, storing all names (TokenSet )2. One StatsKey instance per entry: 2.1. Content kind ( Num ):2.1.1. Number: min/max ( Doubles )2.1.2. Category: number of entries ( Num ), entries (Tokens )2.2. Number of entries ( Num )2.3. Leaf flag ( Boolean )2.4. Maximum text length ( Double ; legacy, could be Num )
|
2 b) Path Index | 1. Flag for path definition (Boolean , always true ; legacy)2. PathNode: 2.1. Name reference ( Num )2.2. Node kind ( Num )2.3. Number of occurrences ( Num )2.4. Number of children ( Num )2.5. Double ; legacy, can be reused or discarded2.6. Recursive generation of child nodes (→ 2) |
2 c) Namespaces | 1. Token set, storing prefixes (TokenSet )2. Token set, storing URIs ( TokenSet )3. NSNode: 3.1. pre value ( Num )3.2. References to prefix/URI pairs ( Nums )3.3. Number of children ( Num )3.4. Recursive generation of child nodes (→ 3) |
2 d) Document Index | Array of integers, representing the distances between all document pre values (Nums )
|
Node Table: tbl
, tbli
[edit]
tbl
: Main database table, stored in blocks.tbli
: Database directory, organizing the database blocks.
Some more information on the node storage is available.
Texts: txt
, atv
[edit]
txt
: Heap file for text values (document names, string values of texts, comments and processing instructions)atv
: Heap file for attribute values.
Value Indexes: txtl
, txtr
, atvl
, atvr
[edit]
Text Index:
txtl
: Heap file with ID lists.txtr
: Index file with references to ID lists.
The Attribute Index is contained in the files atvl
and atvr
, the Token Index in tokl
and tokr
. All have the same layout.
For a more detailed discussion and examples of these file formats please see Index File Structure.
Document Path Index: pth
[edit]
Provides an index of all the document paths in the database. For databases with a large number of paths this file can be quite large so it is only generated the first time a function requesting a path lookup is run. For databases where path lookups are never used this file will not exist.
Note: On Windows/Mac systems this file is case insensitive (all paths are lower case). On UNIX-like systems this file is case sensitive. The behaviour of path look ups will vary between systems. Copying this file between system types may lead to unexpected behaviour.
ID/Pre Mapping: idp
[edit]
This file is only created if incremental indexing (UPDINDEX) is enabled for a database. It is used to provide a quick look up of the pre value for a database node id.
Full-Text Fuzzy Index: ftxx
, ftxy
, ftxz
[edit]
...may soon be reimplemented.