XQSync

To get started using XQSync, try the tutorial. For the source code, see the project page on github.

Running XQSync

XQSync is a Java command-line tool. The entry point is the main method in the com.marklogic.ps.xqsync.XQSync class. This class takes zero or more property files as its arguments. Any specified system properties will override file-based properties, and properties found in later files may override properties specified in earlier files on the command line. See src/xqsync.sh for a sample shell script.

Note: XQSync needs a lot of heap space for large synchronization tasks. Be prepared to increase the Java VM heap space limit, using -Xmx. Depending on the version of Java used, -Xincgc may also help.

Required libraries:

You can also download older versions of xqsync.jar.

Required properties:

Note that this requirements can be overriden by a subclass of com.marklogic.ps.xqsync.Configuration. See Customization for details.

Available properties:

Propertydefault valuenotes
ALLOW_EMPTY_METADATA false If true, missing metadata files in INPUT_PACKAGE will be ignored.
CONFIGURATION_CLASSNAMEnull If non-null, this class will be used instead of com.marklogic.ps.xqsync.Configuration. For details, see Customization.
COPY_COLLECTIONStrue If true, all document collections are copied.
COPY_PERMISSIONStrue If true, all document permissions are copied.
COPY_PROPERTIEStrue If true, document properties are copied. When targeting an output connection that has CPF enabled, it is a good idea to disable this setting.
COPY_QUALITYtrue If true, document quality is copied.
DELETE_COLLECTION false In combination with INPUT_COLLECTION_URI and OUTPUT_CONNECTION_STRING, delete the INPUT_COLLECTION_URI on the OUTPUT_CONNECTION_STRING, before beginning synchronization.
FATAL_ERRORStrue If true, all exceptions are fatal. If false, exceptions will still be logged, but in most cases XQSync will proceed.
INPUT_BATCH_SIZE 1 Process documents in batches of N documents. When exporting many small documents from an input database, increasing this setting can improve performance. Note that the right setting will vary according to document size: if the batch size is too large, poor performance or errors may result. Imports are also done in batches of N documents.
INPUT_CONNECTION_STRING null Input documents will come from this XCC connection. By default, every document in the input database will be transferred. To change this behavior, use exactly one of the related properties:
  • INPUT_DOCUMENT_URIS
  • INPUT_COLLECTION_URI
  • INPUT_DIRECTORY_URI
  • INPUT_QUERY
NB - if more than one key is specified, XQSync will use the first one it finds, using the order above.
NB - to list all input documents, or to list a collection or a directory, XQSync uses cts:uris(). If the document URI lexicon is not available, it will fall back to a slower technique.
If the connection string uses the xccs:// scheme, XQSync will attempt to use SSL for server communications. This requires MarkLogic Server 4.1 or later.
The username, password, and hostname in this connection string must not include non-URI characters. If necessary, use percent-encoding as specified in RFC 3986.
The list of URIs are stored in a temporary file by default. The location of the file can be specified directly (see the URI_QUEUE_FILE property). Another way to specify this location is provide a temporary directory (see the TMP_DIR property). If none of these properties are specified, xqsync would use the OS platform's default temporary directory.
INPUT_COLLECTION_URI null In combination with INPUT_CONNECTION_STRING, all documents in the named collection(s) will be transferred. If whitespace is present, INPUT_COLLECTION_URI will be treated as a whitespace-delimited sequence; e.g., INPUT_COLLECTION_URI=a b would transfer all documents in either collection a or collection b.
INPUT_DIRECTORY_URI null In combination with INPUT_CONNECTION_STRING, all documents in the named directory will be transferred. If whitespace is present, INPUT_DIRECTORY_URI will be treated as a whitespace-delimited sequence; e.g., INPUT_DIRECTORY_URI=a/ b/ would transfer all documents whose URIs begin with a/ or b/.
INPUT_DOCUMENT_URIS null In combination with INPUT_CONNECTION_STRING, all documents named by the (whitespace-delimited) uris will be transferred.
INPUT_MODULE_URI null

If this property and INPUT_CONNECTION_STRING are both set, the named module will be used to process each document as it is read from the input database. The module must define a module variable $URI as xs:string external. The module must return a document-node.

Here is a simple example module, which recursively transforms the input document to lower-case all element names. Note that the first call to local:lc() will always receive a document-node (unless the document $URI does not exist), and so the entire module will always return a document-node.

xquery version "1.0-ml" ;

declare variable $URI as xs:string external ;

declare function local:lc($list as node()*)
 as node()*
{
  for $n in $list
  return typeswitch($n)
  case document-node() return document {
    local:lc($n/node()) }
  case element() return element { node-name($n) } {
    $n/@*,
    local:lc($n/node())
  }
  default return $n
} ;

local:lc(doc($URI))
      
INPUT_PACKAGE null Input documents will come from this zip file path. If the path is a directory, any "*.zip" children will be used.
INPUT_QUERY null

In combination with INPUT_CONNECTION_STRING, all uris returned by the query will be transferred. This sample query would transfer the first 100 documents, in document order:
for $i in doc()[1 to 100] return xdmp:node-uri($i)
If the document URI lexicon is enabled, this could be written as:
cts:uris('', 'document')[1 to 100]
to transfer the first 100 documents, sorted by document URI.

If the query contains any repeated semicolons (";;"), it will be split into multiple queries and run separately. This permits faster start-up with complex queries.

INPUT_QUERY_CACHABLE false In combination with INPUT_CONNECTION_STRING, the query which fetches the input document URIs will instruct XCC to cache or to stream the URIs. If set to true, no documents will sync until all URIs have been fetched. This is usually undesirable, so false is the default.
INPUT_QUERY_BUFFER_BYTES 0 In combination with INPUT_CONNECTION_STRING, the query which fetches the input document URIs will use this buffer size. The value 0 will cause XCC to use its default size.
INPUT_START_POSITION null Use the numeric value of this property as the starting position for the sequence of input documents.
INPUT_TIMESTAMP null If not null, and INPUT_CONNECTION_STRING is set, then all input queries will use this timestamp. The special value #AUTO will cause the first request timestamp to be used for the entire synchronization.
INPUT_RESULT_BUFFER_SIZE 0 In combination with INPUT_CONNECTION_STRING, the query which fetches each input document and its metadata will use this buffer size. The value 0 will cause XCC to use its default size.
INPUT_INDENTED true By default, when retrieving documents from MarkLogic, the MarkLogic server would pretty-indent XML elements. You can turn off this behavior by setting INPUT_INDENTED to false. This is equivalent to setting "declare boundary-space preserve; declare option xdmp:output \"indent=no\";" in the prolog section of an XQuery module.
LOG_FORMATTERnull Xqsync logs everything on a single line by default. You can specify this property to SimpleFormatter to restore xqsync's original 2-line logging format.
LOG_LEVELINFO java.util.logger.Level at which to log.
LOG_HANDLERCONSOLE,FILE java.util.logger log handlers with which to log.
USE_MULTI_STMT_TXN false If true, batch inserts will use MarkLogic 5.0's multi-statement transaction feature. If false, batch inserts will use XCC's insertContent([]) API.
OUTPUT_CONNECTION_STRING null Documents will be written to this XCC connection.
If the connection string uses the xccs:// scheme, XQSync will attempt to use SSL for server communications. This requires MarkLogic Server 4.1 or later.
The username, password, and hostname in this connection string must not include non-URI characters. If necessary, use percent-encoding as specified in RFC 3986.
If the property includes multiple, whitespace-delimited connection strings, XQSync will round-robin documents across the available connections.
OUTPUT_COLLECTIONS null Output documents will be added to one or more collection URIs. Collection URIs may be delimited by whitespace, commas, or colons.
OUTPUT_CONNECTION_STRING null Documents will be written to this XCC connection.
If the connection string uses the xccs:// scheme, XQSync will attempt to use SSL for server communications. This requires MarkLogic Server 4.1 or later. If the property includes multiple, whitespace-delimited connection strings, XQSync will round-robin documents across the available connections.
OUTPUT_FILTER_FORMATSnull The specified list of document types will not be copied to output.
Example: OUTPUT_FILTER_FORMATS=binary()
Example: OUTPUT_FILTER_FORMATS=text(),xml
OUTPUT_FORESTSnullPermitted output forest names.
OUTPUT_PACKAGE null Output documents will be written to this zip file path.
PRINT_CURRENT_RATE false By default, xqsync would print out the overall rate of document transfers every minute. By specifying this property to true, xqsync would also print out the current (last 1-minute) running rate.
QUEUE_SIZE 100,000 Maximum size of the synchronization queue, to limit memory consumption by XQSync. You may wish to use a smaller value, if you encounter OutOfMemoryError. You may wish to use a larger value, if using many threads and loading very small documents. If you use a large value, you may also need something like -Xmx4096m to increase the Java heap size. Plan for roughly 1-GB per 1-M queue entries (ie, 1-kB per entry).
REPAIR_MULTIPLE_DOCUMENTS_PER_URIfalse Normally not necessary, this property will cause XQSync to generate XQuery that prevents doc() from returning multiple documents per URI.
REPAIR_INPUT_XMLfalse Should MarkLogic Server try to repair malformed input XML?
ROLES_EXECUTEnull Names of any roles to attach to output documents as execute permissions.
ROLES_INSERTnull Names of any roles to attach to output documents as insert permissions.
ROLES_READnull Names of any roles to attach to output documents as read permissions.
ROLES_UPDATEnull Names of any roles to attach to output documents as update permissions.
SKIP_EXISTINGfalse If true, documents that already exist in OUTPUT_CONNECTION are not overwritten. This only affects operations when OUTPUT_CONNECTION is defined. If false, or if targeting an OUTPUT_PACKAGE, then all documents will be overwritten.
THREADS1 Number of worker threads to spawn.
THROTTLE_BYTES_PER_SECOND0 If non-zero, all threads will be throttled to the given number of bytes inserted per second.
THROTTLE_EVENTS_PER_SECOND0 If non-zero, all threads will be throttled to the given number of inserts per second.
TMP_DIRnull Specify the temorary directory location. This is used for creating the temporary file for storing URIs. See also the INPUT_CONNECTION_STRING property.
URI_QUEUE_FILEnull Specify the temorary file location. This is used for creating the temporary file for storing URIs. See also the INPUT_CONNECTION_STRING property. If both URI_QUEUE_FILE and TMP_DIR are specified, xqsync would only use URI_QUEUE_FILE.
URI_PREFIXnull String to prepend to all output uris.
URI_PREFIX_STRIPnull String to strip from the beginning of all output uris, when present in those uris.
URI_SUFFIXnull String to append to all output uris.
URI_SUFFIX_STRIPnull String to strip from the end of all output uris, when present in those uris.
ENCODE_OUTPUT_URIfalse If true, xqsync will encode URIs so that they'll conform to the URI standard.
USE_RANDOM_OUTPUT_URIfalse If true, xqsync will generate random URIs for inserting documents into MarkLogic. URI_PREFIX and URI_SUFFIX will still be obeyed.
USE_IN_FOREST_EVALfalse Use in-forest eval when inserting documents into MarkLogic. XQSync would use OUTPUT_FORESTS as the list of eval forest if specified. Otherwise, it'll use all of the forests in the output database. By turning on this mode, XQSync will not delete URIs that have 0 content bytes. Also, SKIP_EXISTING will not work.
CHECKSUM_MODULEnull The checksum module to invoke when transferring documents between 2 MarkLogic databases. If this is specified, xqsync will use this module to compute a checksum from the source database and the destination database, and compare. If the checksum does not match, xqsync will print out a warning. A simple checksum module should look like this:
declare variable $URI external;

(: must return 0 for empty or non-existence URIs :)

if ($URI and fn:doc-available($URI)) then
xdmp:md5(fn:concat(
  xdmp:quote(fn:doc($URI)),
  xdmp:quote(xdmp:document-properties($URI)),
  for $i in xdmp:quote(xdmp:document-get-collections($URI))
  order by $i
  return $i
))
else 0

Customization

Some users may which to customize XQSync more deeply than allowed by the existing configuration options. This can be done by supplying a property CONFIGURATION_CLASSNAME. This must be a subclass of com.marklogic.ps.xqsync.Configuration, and can extend or override its method implementations.

For an example subclass, see com.marklogic.ps.tests.ExampleConfiguration. A typical implementation might override one or more of these methods:

Note that this feature should be considered experimental. The implementation may change at any time, and may not be backward-compatible with existing customizations.