Difference between revisions of "Catalog Resolver"

From BaseX Documentation
Jump to navigation Jump to search
(39 intermediate revisions by 5 users not shown)
Line 1: Line 1:
==Overview==
+
This article is part of the [[Advanced User's Guide]]. It clarifies how to deal with external DTD declarations when parsing and transforming XML data.
XML documents often rely on Document Type Definitions (DTD).
 
While parsing a document with BaseX elements and entities can be checked for validity with respect to that particular DTD.
 
Currently the DTD is used only for entity resolution.
 
  
 +
==Introduction==
 +
 +
XML documents often rely on Document Type Definitions (DTDs). Entities can be resolved with respect to that particular DTD. By default, the DTD is only used for entity resolution.
 +
 +
XHTML, for example, defines its doctype via the following line:
  
XHTML for example defines its doctype via the following line:
 
 
<pre class="brush:xml">
 
<pre class="brush:xml">
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">  
 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">  
 
</pre>
 
</pre>
  
Fetching the <code>xhtml1-strict.dtd</code> obviously involves network traffic. When dealing with single files this may seem tolerable, but  
+
Fetching <code>xhtml1-strict.dtd</code> obviously involves network traffic. When dealing with single files, this may seem tolerable, but importing large collections benefits from caching these resources. Depending on the remote server, you will experience significant speed improvements when caching DTDs locally.
importing large collections might benefit from caching these resources locally.  
+
 
Depending on your connection you will experience significant speed improvements.
+
To address these issues, the [https://www.oasis-open.org/committees/download.php/14809/xml-catalogs.html XML Catalogs Standard] defines an entity catalog that maps both external identifiers and arbitrary URI references to URI references.
  
== XML Entity and URI Resolvers in BaseX ==
+
==Usage==
BaseX comes with a default URI resolver that is usable out of the box.
+
 
 +
BaseX relies on the Apache-maintained [http://xml.apache.org/commons XML Commons Resolver]. The ''xml-resolver-1.2.jar'' library is included in the full distributions of BaseX. If the resolver is not found in the classpath, and if Java 8 is used, Java’s built-in resolver will be applied (via <code>com.sun.org.apache.xml.internal.resolver.*</code>).
 +
 
 +
To enable entity resolving you have to provide a valid XML Catalog file, so that the parser knows where to look for mirrored DTDs.
 +
 
 +
A simple working example for XHTML might look like this:
  
To enable entity resolving you have to provide a valid XML Catalog file.
 
A simple working example for  XHTML might look like this:
 
 
<pre class="brush:xml" start="0">
 
<pre class="brush:xml" start="0">
<?xml version="1.0"?>
 
 
<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
 
<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
+
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
 
</catalog>
 
</catalog>
 
</pre>
 
</pre>
This rewrites all SystemIds starting with: ''<nowiki>http://www.w3.org/TR/xhtml1/DTD/</nowiki>'' to ''file:///path/to/dtds/''.
 
  
The XHTML DTD <code>xhtml1-strict.dtd</code> and all its linked resources will now be loaded from the specified path.
+
This rewrites all systemIds starting with: <code><nowiki>http://www.w3.org/TR/xhtml1/DTD/</nowiki></code> to <code>file:///path/to/dtds/</code>. For example, if the following XML file is parsed:
 +
 
 +
<pre class="brush:xml" start="0">
 +
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 +
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 +
<html xmlns="http://www.w3.org/1999/xhtml"/>
 +
</pre>
 +
 
 +
The XHTML DTD <code>xhtml1-transitional.dtd</code> and all its linked resources will now be loaded from the specified path.
 +
 
 +
The catalog file ''etc/w3-catalog.xml'' in the full distributions can be used out of the box. It defines rewriting for some common W3 DTD files.
 +
 
 
===GUI Mode===
 
===GUI Mode===
[[File:catalog-file.jpg|thumb|Location for the Catalog File]]
+
 
When running BaseX in GUI mode simply provide the path to your XML Catalog file in the '''Parsing'''-Tab of the Database Creation Dialog.
+
When running BaseX in GUI mode, enable DTD parsing and provide the path to your XML Catalog file in the ''Parsing'' Tab of the Database Creation Dialog.
  
 
===Console & Server Mode===
 
===Console & Server Mode===
To enable Entity Resolving in Console Mode specify the following [[options]]:
 
* <code>SET CATFILE [path]</code>
 
Now entity resolving is active for the current session. All subsequent <code>ADD</code> commands will use the catalog file to resolve entities.
 
  
The path to your catalog file and the actual entities may be either absolute or are relative to the current working directory.  
+
To enable Entity Resolving in Console Mode, enable the {{Option|DTD}} option and assign the path to your XML catalog file to the {{Option|CATFILE}} option. All subsequent commands for adding documents will use the specified catalog file to resolve entities.
  
'''Please note''' that entity resolving only works with option: <code>SET INTPARSE false</code>. <code>INTPARSE</code> is set to ''false'' by ''default''.
+
Paths to your catalog file and the actual DTDs are either absolute or relative to the ''current working directory''. When using BaseX in client-server mode, they are resolved against the working directory of the ''server''.
  
Using the internal parser let's you specify manually whether you want to parse DTDs and entities or not.
+
===Additional Notes===
  
== Using other Resolvers ==
+
Entity resolving only works if the [[Parsers#XML Parsers|internal XML parser]] is switched off (which is the default case).
There might be some cases when you do not want to use the built-in resolver that Java provides by default (via <code>com.sun.org.apache.xml.internal.resolver.*</code>).
 
  
BaseX offers support for the Apache maintained [http://xml.apache.org/commons XML Commons Resolver] available for download [http://xerces.apache.org/mirrors.cgi here].
+
The runtime properties of the catalog resolver can be changed by setting system properties, or adding a ''CatalogManager.properties'' file to the classpath. By default, and if the system property {{Code|xml.catalog.ignoreMissing}} is not assigned, no warnings will be output to standard error if the properties file or resources linked from that file are not found. See [https://xerces.apache.org/xml-commons/components/resolver/resolver-article.html#ctrlresolver Controlling the Catalog Resolver] for more information.
  
To use it add '''resolver.jar''' to the classpath when [[Startup|starting BaseX]]:
+
==Links==
<pre class="brush:bash">
 
java -cp basex.jar:resolver.jar org.basex.BaseXServer
 
</pre>
 
  
== More Information ==
+
* [https://www.oasis-open.org/committees/download.php/14809/xml-catalogs.html XML Catalogs. OASIS Standard, Version 1.1. 07-October-2005]
*[http://xml.apache.org/commons/components/resolver/resolver-article.html Apache XML Commons Article on Entity Resolving]
+
* [http://en.wikipedia.org/wiki/Document_Type_Definition Wikipedia on Document Type Definitions]
*[http://java.sun.com/webservices/docs/1.6/jaxb/catalog.html XML Entity and URI Resolvers], Sun
+
* [http://xml.apache.org/commons/components/resolver/resolver-article.html Apache XML Commons Article on Entity Resolving]
*[http://www.oasis-open.org/committees/download.php/14810/xml-catalogs.pdf XML Catalogs. OASIS Standard, Version 1.1. 07-October-2005.]
+
* [http://java.sun.com/webservices/docs/1.6/jaxb/catalog.html XML Entity and URI Resolvers], Sun

Revision as of 11:53, 13 March 2019

This article is part of the Advanced User's Guide. It clarifies how to deal with external DTD declarations when parsing and transforming XML data.

Introduction

XML documents often rely on Document Type Definitions (DTDs). Entities can be resolved with respect to that particular DTD. By default, the DTD is only used for entity resolution.

XHTML, for example, defines its doctype via the following line:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 

Fetching xhtml1-strict.dtd obviously involves network traffic. When dealing with single files, this may seem tolerable, but importing large collections benefits from caching these resources. Depending on the remote server, you will experience significant speed improvements when caching DTDs locally.

To address these issues, the XML Catalogs Standard defines an entity catalog that maps both external identifiers and arbitrary URI references to URI references.

Usage

BaseX relies on the Apache-maintained XML Commons Resolver. The xml-resolver-1.2.jar library is included in the full distributions of BaseX. If the resolver is not found in the classpath, and if Java 8 is used, Java’s built-in resolver will be applied (via com.sun.org.apache.xml.internal.resolver.*).

To enable entity resolving you have to provide a valid XML Catalog file, so that the parser knows where to look for mirrored DTDs.

A simple working example for XHTML might look like this:

<catalog prefer="system" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <rewriteSystem systemIdStartString="http://www.w3.org/TR/xhtml1/DTD/" rewritePrefix="file:///path/to/dtds/" />
</catalog>

This rewrites all systemIds starting with: http://www.w3.org/TR/xhtml1/DTD/ to file:///path/to/dtds/. For example, if the following XML file is parsed:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"/>

The XHTML DTD xhtml1-transitional.dtd and all its linked resources will now be loaded from the specified path.

The catalog file etc/w3-catalog.xml in the full distributions can be used out of the box. It defines rewriting for some common W3 DTD files.

GUI Mode

When running BaseX in GUI mode, enable DTD parsing and provide the path to your XML Catalog file in the Parsing Tab of the Database Creation Dialog.

Console & Server Mode

To enable Entity Resolving in Console Mode, enable the DTD option and assign the path to your XML catalog file to the CATFILE option. All subsequent commands for adding documents will use the specified catalog file to resolve entities.

Paths to your catalog file and the actual DTDs are either absolute or relative to the current working directory. When using BaseX in client-server mode, they are resolved against the working directory of the server.

Additional Notes

Entity resolving only works if the internal XML parser is switched off (which is the default case).

The runtime properties of the catalog resolver can be changed by setting system properties, or adding a CatalogManager.properties file to the classpath. By default, and if the system property xml.catalog.ignoreMissing is not assigned, no warnings will be output to standard error if the properties file or resources linked from that file are not found. See Controlling the Catalog Resolver for more information.

Links