Difference between revisions of "Catalog Resolver"

From BaseX Documentation
Jump to navigation Jump to search
m (Text replacement - "syntaxhighlight" to "pre")
 
(55 intermediate revisions by 5 users not shown)
Line 1: Line 1:
==Overview==
+
This article is part of the [[Advanced User's Guide]]. It clarifies how to deal with mapping system IDs (DTD locations) and URIs to local resources when parsing and transforming XML data:
XML documents often rely on Document Type Definitions (DTD).
 
While parsing a document with BaseX elements and entities can be checked for validity with respect to that particular DTD.
 
Currently the DTD is used only for entity resolution.
 
  
 +
* The [https://docs.oracle.com/en/java/javase/11/core/xml-catalog-api1.html Java 11: XML Catalog API] is used to resolve references to external resources.
 +
* As an alternative, Norman Walsh’s [https://xmlresolver.org/ Enhanced XML Resolver] is utilized if it is found in the classpath.
 +
* The Apache-maintained [https://xml.apache.org/commons XML Commons Resolver] has become obsolete.
 +
* If enabled, a catalog is universally applied for resolving:
 +
** entities (when parsing XML documents);
 +
** URIs (for documents, module imports, XSL transformations);
 +
** resources (when validating documents).
  
XHTML for example defines its doctype via the following line:
+
==Introduction==
<pre class="brush:xml">
+
 
 +
XML documents often rely on Document Type Definitions (DTDs). Entities can be resolved with respect to that particular DTD. By default, the DTD is only used for entity resolution.
 +
 
 +
XHTML, for example, defines its doctype via the following line:
 +
 
 +
<pre lang="xml">
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">  
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">  
 
</pre>
 
</pre>
  
Fetching the <code>xhtml1-strict.dtd</code> obviously involves network traffic. When dealing with single files this may seem tolerable, but  
+
Fetching <code>xhtml1-strict.dtd</code> from the W3C’s server obviously involves network traffic. When dealing with single files, this may seem tolerable, but importing large collections benefits from caching these resources. Depending on the remote server, you will experience significant speed improvements when caching DTDs locally.
importing large collections might benefit from caching these resources locally.  
+
 
Depending on your connection you will experience significant speed improvements.
+
To address these issues, the [https://www.oasis-open.org/committees/download.php/14809/xml-catalogs.html XML Catalogs Standard] defines an entity catalog that maps both external identifiers and arbitrary URI references to URI references.
 +
 
 +
Another application for XML catalogs is to provide local resources for reusable XSLT stylesheet libraries that are imported from a canonical location. This is described in greater detail in the following section.
 +
 
 +
==Usage==
  
== XML Entity and URI Resolvers in BaseX ==
+
===System ID (DTD Location) Rewrites===
BaseX comes with a default URI resolver that is usable out of the box.
 
  
To enable entity resolving you have to provide a valid XML Catalog file.
+
To enable entity resolving, you have to provide a valid XML Catalog file so that the parser knows where to look for mirrored DTDs.
A simple working example for XHTML might look like this:
+
 
<pre class="brush:xml" start="0">
+
A simple working example for XHTML might look like this:
<?xml version="1.0"?>
+
 
 +
<pre lang="xml">
 
<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
 
<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
+
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
 
</catalog>
 
</catalog>
 
</pre>
 
</pre>
This rewrites all SystemIds starting with: ''<nowiki>http://www.w3.org/TR/xhtml1/DTD/</nowiki>'' to ''file:///path/to/dtds/''.
 
  
The XHTML DTD <code>xhtml1-strict.dtd</code> and all its linked resources will now be loaded from the specified path.
+
This rewrites all systemIds starting with: <code><nowiki>http://www.w3.org/TR/xhtml1/DTD/</nowiki></code> to <code>file:///path/to/dtds/</code>. For example, if the following XML file is parsed:
 +
 
 +
<pre lang="xml">
 +
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 +
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 +
<html xmlns="http://www.w3.org/1999/xhtml"/>
 +
</pre>
 +
 
 +
The XHTML DTD <code>xhtml1-transitional.dtd</code> and all its linked resources will now be loaded from the specified path.
 +
 
 +
The catalog file <code>etc/w3-catalog.xml</code> in the full distributions can be used out of the box. It defines rewritings for some common W3 DTD files.
 +
 
 +
===URI Rewrites===
 +
 
 +
Consider a library of reusable XSLT stylesheets. For performance reasons, this library will be cached locally. However, the import URI for a given stylesheet should always be the same, independent of the accidental relative or absolute path that it is stored at locally. Example:
 +
 
 +
<pre lang="xml">
 +
<xsl:import href="http://acme.com/xsltlib/acme2html/1.0/acme2html.xsl"/>
 +
</pre>
 +
 
 +
The XSLT stylesheet might not even be available from this location. The URI serves as a canonical location identifier for this XSLT stylesheet. A local copy of the <code>acme2html/1.0/</code> directory is expected to reside somewhere, and the location of this directory relative to the local XML catalog file is specified in an entry in this catalog, like this:
 +
 
 +
<pre lang="xml">
 +
<rewriteURI uriStartString="http://acme.com/xsltlib/acme2html/1.0/" rewritePrefix="../acmehtml10/"/>
 +
</pre>
 +
 
 +
This way, XSLT import URIs don’t have to be adjusted for the relative or absolute locations of the XSLT library’s local copy.
 +
 
 +
The same URI rewriting works for resources retrieved by the <code>doc()</code> function from within an XSLT stylesheet. See [[XSLT Module]] for details on how to invoke XSLT stylesheets from within BaseX.
 +
 
 +
NOTE: This URI rewriting is currently restricted to XSLT stylesheets. It has neither been enabled yet for the XQuery function <code>doc()</code> nor for XSD schema locations. It needs Norman W⁠alsh’s resolver mentioned above, the Java resolver apparently does not work for XSLT includes/imports in stylesheets run by Saxon using <code>xslt:transform()</code> et al. See the note below about setting the <code>org.basex.catalog</code> property for Norm’s xmlresolver.org resolver.
 +
 
 
===GUI Mode===
 
===GUI Mode===
[[File:catalog-file.jpg|thumb|Location for the Catalog File]]
+
 
When running BaseX in GUI mode simply provide the path to your XML Catalog file in the '''Parsing'''-Tab of the Database Creation Dialog.
+
When running BaseX in GUI mode, enable DTD parsing and provide the path to your XML Catalog file in the ''Parsing'' Tab of the Database Creation Dialog.
  
 
===Console & Server Mode===
 
===Console & Server Mode===
To enable Entity Resolving in Console Mode specify the following [[options]]:
 
* <code>SET CATFILE [path]</code>
 
Now entity resolving is active for the current session. All subsequent <code>ADD</code> commands will use the catalog file to resolve entities.
 
  
The path to your catalog file and the actual entities may be either absolute or are relative to the current working directory.  
+
To enable Entity Resolving in Console Mode, enable the {{Option|DTD}} option and assign the path to your XML catalog file to the {{Option|CATALOG}} option. All subsequent commands for adding documents will use the specified catalog file to resolve entities.
 +
 
 +
Paths to your catalog file and the actual DTDs are either absolute or relative to the ''current working directory''. When using BaseX in client-server mode, they are resolved against the working directory of the ''server''.
 +
 
 +
===Additional Notes===
 +
 
 +
Entity resolving only works if the [[Parsers#XML Parsers|internal XML parser]] is switched off (which is the default case).
 +
 
 +
By default, an error is raised if the catalog resolution fails. The runtime properties of the catalog resolver can be changed by setting system properties, either on startup…
  
'''Please note''' that entity resolving only works with option: <code>SET INTPARSE false</code>. <code>INTPARSE</code> is set to ''false'' by ''default''.
+
<pre lang="perl">
 +
java -Djavax.xml.catalog.resolve=continue ... org.basex.BaseX
 +
</pre>
  
Using the internal parser let's you specify manually whether you want to parse DTDs and entities or not.
+
…or via XQuery:
  
== Using other Resolvers ==
+
<pre lang='xquery'>
There might be some cases when you do not want to use the built-in resolver that Java provides by default (via <code>com.sun.org.apache.xml.internal.resolver.*</code>).
+
Q{java:System}setProperty('javax.xml.catalog.resolve', 'continue'),
 +
...
 +
</pre>
 +
See [https://docs.oracle.com/en/java/javase/11/core/xml-catalog-api1.html Java 11: XML Catalog API] for more information.
  
BaseX offers support for the Apache maintained [http://xml.apache.org/commons XML Commons Resolver] available for download [http://xerces.apache.org/mirrors.cgi here].
+
When using a catalog within an XQuery Module, the global <code>db:catalog</code> option may not be set in this module. You can set it via pragma instead:
  
To use it add '''resolver.jar''' to the classpath when [[Startup|starting BaseX]]:
+
<pre lang='xquery'>
<pre class="brush:bash">
+
(# db:catalog xmlcatalog/catalog.xml #) {
java -cp basex.jar:resolver.jar org.basex.BaseXServer
+
  xslt:transform(db:get('acme_content')[1], '../acmecustom/acmehtml.xsl')
 +
}
 
</pre>
 
</pre>
  
== More Information ==
+
It is assumed that this stylesheet <code>../acmecustom/acmehtml.xsl</code> (location relative to the current XQuery script or module) imports <code>acme2html/1.0/acme2html.xsl</code> by its canonical URI that will be resolved to a local URI by the catalog resolver.
*[http://xml.apache.org/commons/components/resolver/resolver-article.html Apache XML Commons Article on Entity Resolving]
+
 
*[http://java.sun.com/webservices/docs/1.6/jaxb/catalog.html XML Entity and URI Resolvers], Sun
+
Please note that since catalog-based URI rewriting does not work yet within URIs accessed from XQuery, you cannot give a canonical location that needs to be catalog-resolved as the second argument of <code>xslt:transform()</code>.
*[http://www.oasis-open.org/committees/download.php/14810/xml-catalogs.pdf XML Catalogs. OASIS Standard, Version 1.1. 07-October-2005.]
+
 
 +
The catalog location in the pragma can be given relative to the current working directory (the directory that is returned by <code>file:current-dir()</code>) or as an absolute operating system path. The catalog location in the pragma is not an XQuery expression; no concatenation or other operations may occur in the pragma, and the location string must not be surrounded by quotes.
 +
 
 +
As mentioned above, the Java resolver doesn’t work for XSLT include or import URIs when executing XSLT with Saxon. You can use Norman Walsh’s xmlresolver.org instead. The catalog location may be given as a path relative to the current working directory, as described above. When you want to specify it as an absolute path, this resolver expects a file URI, not a file system path. Example for setting it on Windows before executing one of the .bat scripts, such as bin\basexgui.bat:
 +
 
 +
<pre lang="bat">
 +
set BASEX_JVM="-Dorg.basex.catalog=file:///c:/Users/Jane/path/to/catalog.xml"
 +
</pre>
 +
 
 +
=Changelog=
 +
 
 +
;Version 10.0
 +
* Updated: <code>CATFILE</code> option renamed to {{Option|CATALOG}}.

Latest revision as of 17:38, 1 December 2023

This article is part of the Advanced User's Guide. It clarifies how to deal with mapping system IDs (DTD locations) and URIs to local resources when parsing and transforming XML data:

  • The Java 11: XML Catalog API is used to resolve references to external resources.
  • As an alternative, Norman Walsh’s Enhanced XML Resolver is utilized if it is found in the classpath.
  • The Apache-maintained XML Commons Resolver has become obsolete.
  • If enabled, a catalog is universally applied for resolving:
    • entities (when parsing XML documents);
    • URIs (for documents, module imports, XSL transformations);
    • resources (when validating documents).

Introduction[edit]

XML documents often rely on Document Type Definitions (DTDs). Entities can be resolved with respect to that particular DTD. By default, the DTD is only used for entity resolution.

XHTML, for example, defines its doctype via the following line:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 

Fetching xhtml1-strict.dtd from the W3C’s server obviously involves network traffic. When dealing with single files, this may seem tolerable, but importing large collections benefits from caching these resources. Depending on the remote server, you will experience significant speed improvements when caching DTDs locally.

To address these issues, the XML Catalogs Standard defines an entity catalog that maps both external identifiers and arbitrary URI references to URI references.

Another application for XML catalogs is to provide local resources for reusable XSLT stylesheet libraries that are imported from a canonical location. This is described in greater detail in the following section.

Usage[edit]

System ID (DTD Location) Rewrites[edit]

To enable entity resolving, you have to provide a valid XML Catalog file so that the parser knows where to look for mirrored DTDs.

A simple working example for XHTML might look like this:

<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
</catalog>

This rewrites all systemIds starting with: http://www.w3.org/TR/xhtml1/DTD/ to file:///path/to/dtds/. For example, if the following XML file is parsed:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"/>

The XHTML DTD xhtml1-transitional.dtd and all its linked resources will now be loaded from the specified path.

The catalog file etc/w3-catalog.xml in the full distributions can be used out of the box. It defines rewritings for some common W3 DTD files.

URI Rewrites[edit]

Consider a library of reusable XSLT stylesheets. For performance reasons, this library will be cached locally. However, the import URI for a given stylesheet should always be the same, independent of the accidental relative or absolute path that it is stored at locally. Example:

<xsl:import href="http://acme.com/xsltlib/acme2html/1.0/acme2html.xsl"/>

The XSLT stylesheet might not even be available from this location. The URI serves as a canonical location identifier for this XSLT stylesheet. A local copy of the acme2html/1.0/ directory is expected to reside somewhere, and the location of this directory relative to the local XML catalog file is specified in an entry in this catalog, like this:

<rewriteURI uriStartString="http://acme.com/xsltlib/acme2html/1.0/" rewritePrefix="../acmehtml10/"/>

This way, XSLT import URIs don’t have to be adjusted for the relative or absolute locations of the XSLT library’s local copy.

The same URI rewriting works for resources retrieved by the doc() function from within an XSLT stylesheet. See XSLT Module for details on how to invoke XSLT stylesheets from within BaseX.

NOTE: This URI rewriting is currently restricted to XSLT stylesheets. It has neither been enabled yet for the XQuery function doc() nor for XSD schema locations. It needs Norman W⁠alsh’s resolver mentioned above, the Java resolver apparently does not work for XSLT includes/imports in stylesheets run by Saxon using xslt:transform() et al. See the note below about setting the org.basex.catalog property for Norm’s xmlresolver.org resolver.

GUI Mode[edit]

When running BaseX in GUI mode, enable DTD parsing and provide the path to your XML Catalog file in the Parsing Tab of the Database Creation Dialog.

Console & Server Mode[edit]

To enable Entity Resolving in Console Mode, enable the DTD option and assign the path to your XML catalog file to the CATALOG option. All subsequent commands for adding documents will use the specified catalog file to resolve entities.

Paths to your catalog file and the actual DTDs are either absolute or relative to the current working directory. When using BaseX in client-server mode, they are resolved against the working directory of the server.

Additional Notes[edit]

Entity resolving only works if the internal XML parser is switched off (which is the default case).

By default, an error is raised if the catalog resolution fails. The runtime properties of the catalog resolver can be changed by setting system properties, either on startup…

java -Djavax.xml.catalog.resolve=continue ... org.basex.BaseX

…or via XQuery:

Q{java:System}setProperty('javax.xml.catalog.resolve', 'continue'),
...

See Java 11: XML Catalog API for more information.

When using a catalog within an XQuery Module, the global db:catalog option may not be set in this module. You can set it via pragma instead:

(# db:catalog xmlcatalog/catalog.xml #) {
  xslt:transform(db:get('acme_content')[1], '../acmecustom/acmehtml.xsl')
}

It is assumed that this stylesheet ../acmecustom/acmehtml.xsl (location relative to the current XQuery script or module) imports acme2html/1.0/acme2html.xsl by its canonical URI that will be resolved to a local URI by the catalog resolver.

Please note that since catalog-based URI rewriting does not work yet within URIs accessed from XQuery, you cannot give a canonical location that needs to be catalog-resolved as the second argument of xslt:transform().

The catalog location in the pragma can be given relative to the current working directory (the directory that is returned by file:current-dir()) or as an absolute operating system path. The catalog location in the pragma is not an XQuery expression; no concatenation or other operations may occur in the pragma, and the location string must not be surrounded by quotes.

As mentioned above, the Java resolver doesn’t work for XSLT include or import URIs when executing XSLT with Saxon. You can use Norman Walsh’s xmlresolver.org instead. The catalog location may be given as a path relative to the current working directory, as described above. When you want to specify it as an absolute path, this resolver expects a file URI, not a file system path. Example for setting it on Windows before executing one of the .bat scripts, such as bin\basexgui.bat:

set BASEX_JVM="-Dorg.basex.catalog=file:///c:/Users/Jane/path/to/catalog.xml"

Changelog[edit]

Version 10.0
  • Updated: CATFILE option renamed to CATALOG.