Difference between revisions of "Catalog Resolver"

From BaseX Documentation
Jump to navigation Jump to search
Line 27: Line 27:
 
</pre>
 
</pre>
  
This rewrites all systemIds starting with: <code><nowiki>http://www.w3.org/TR/xhtml1/DTD/</nowiki></code> to <code>file:///path/to/dtds/</code>. For example, if the following file is parsed:
+
This rewrites all systemIds starting with: <code><nowiki>http://www.w3.org/TR/xhtml1/DTD/</nowiki></code> to <code>file:///path/to/dtds/</code>. For example, if the following XML file is parsed:
  
 
<pre class="brush:xml" start="0">
 
<pre class="brush:xml" start="0">

Revision as of 12:51, 5 March 2019

This article is part of the Advanced User's Guide. It clarifies how to deal with external DTD declarations when parsing and transforming XML data.

Introduction

XML documents often rely on Document Type Definitions (DTDs). While parsing a document with BaseX, entities can be resolved with respect to that particular DTD. By default, the DTD is only used for entity resolution.

XHTML, for example, defines its doctype via the following line:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 

Fetching xhtml1-strict.dtd obviously involves network traffic. When dealing with single files, this may seem tolerable, but importing large collections benefits from caching these resources. Depending on the remote server, you will experience significant speed improvements when caching DTDs locally.

Usage

BaseX relies on the Apache-maintained XML Commons Resolver. The xml-resolver-1.2.jar library is included in the full distributions of BaseX. If the resolver is not found in the classpath, and if Java 8 is used, Java’s built-in resolver will be applied (via com.sun.org.apache.xml.internal.resolver.*).

To enable entity resolving you have to provide a valid XML Catalog file, so that the parser knows where to look for mirrored DTDs.

A simple working example for XHTML might look like this:

<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
</catalog>

This rewrites all systemIds starting with: http://www.w3.org/TR/xhtml1/DTD/ to file:///path/to/dtds/. For example, if the following XML file is parsed:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"/>

The XHTML DTD xhtml1-transitional.dtd and all its linked resources will now be loaded from the specified path.

The catalog file etc/w3-catalog.xml in the full distributions can be used out of the box. It defines rewriting for some common W3 DTD files.

GUI Mode

When running BaseX in GUI mode, simply provide the path to your XML Catalog file in the Parsing Tab of the Database Creation Dialog.

Console & Server Mode

To enable Entity Resolving in Console Mode, assign a catalog file path to the CATFILE option. All subsequent ADD commands will use the specified catalog file to resolve entities.

The paths to your catalog file and the actual DTDs are either absolute or relative to the current working directory. When using BaseX in Client-Server-Mode, this is relative to the server's working directory.

Additional Notes

Entity resolving only works if the internal XML parser is switched off (which is the default case). If you use the internal parser, you can manually specify whether you want to parse DTDs and entities or not.

The runtime properties of the catalog resolver can be changed by setting system properties, or adding a CatalogManager.properties file to the classpath (see Controlling the Catalog Resolver for more information).

Links