XQSync Tutorial

MarkLogic Server includes built-in support for online, transactional backup and restore of both databases and forests. However, the on-disk format of these backups is platform-specific: a backup from Windows IA-32 can't be restored on a Solaris/SPARC server, nor vice versa.

XQSync is an application-level synchronization tool that can copy documents and their metadata between databases. XQSync can also package documents and their metadata as zip archives, or write them directly to a filesystem. XQSync can synchronize an entire database, a collection, a directory, or the results of evaluating an XQuery expression. Finally, XQSync can make some simple changes along the way: it can add a prefix or append a suffix to every document URI, and it can add new read permissions to every document.

Setting up XQSync

XQSync is a Java program, and requires a Java runtime environment, version 1.6 or later. If you don't have a JRE, download and install one. OpenJDK may also work, but is not tested.

Since XQSync works by connecting to a MarkLogic Server, you'll need the Mark Logic XCC libraries, too. You can download the MarkXCC.Java zip archive here. Be sure to download the latest XCC for Java release! After you download the zip archive, unpack it and find the jar files. You can put xcc.jar anywhere on your disk, but for this tutorial I'll assume that both files are in the current directory.

Next, we'll need a Java library called XStream: XQSync uses XStream to serialize and deserialize XML documents as Java objects. You can download XStream here.

Finally, you'll need a copy of XPP3. You can download it from this site. Again, be sure to get the latest version: right now, that's xpp3-1.1.3_8.jar.

Running XQSync

Now that we have a Java environment and all the libraries we need, let's try it out with the simplest possible invocation.

java -cp xqsync.jar:xcc.jar:xstream.jar:xpp3.jar \ com.marklogic.ps.xqsync.XQSync

Note that the command-line above is for Linux or Unix. For Windows, you'll need to use semicolons in the classpath, instead. Note that the rest of this tutorial will use Linux command-lines: if you're using Windows, just translate the classpath appropriately.

java -cp "xqsync.jar;xcc.jar;xstream.jar;xpp3.jar" \ com.marklogic.ps.xqsync.XQSync

Java command lines can get pretty ugly. I'll keep using them for this tutorial, but you might want to put all of that into a shell script (or a batch file, on Windows). Remember, I put all my jar files in the current directory. You might have used another location: if so, just change the classpath in your command-line to match.

java -cp xqsync.jar:xcc.jar:xstream.jar:xpp3.jar \ com.marklogic.ps.xqsync.XQSync Exception in thread "main" java.lang.NoClassDefFoundError: com/marklogic/ps/xqsync/XQSync

Oh, that doesn't look good. Hmm... looks like we forgot to put xqsync.jar into the current directory. We'll do that, and try again...

java -cp xqsync.jar:xcc.jar:xstream.jar:xpp3.jar \ com.marklogic.ps.xqsync.XQSync added system properties logging to CONSOLE logging to file simplelogger-%u-%g.log Aug 25, 2006 7:30:37 AM com.marklogic.ps.SimpleLogger configureLogger INFO: setting up com.marklogic.ps.SimpleLogger@190d11 for: com.marklogic.ps Aug 25, 2006 7:30:38 AM com.marklogic.ps.xqsync.Configuration setProperties INFO: first-time setup Exception in thread "main" java.io.IOException: missing required property: INPUT_CONNECTION_STRING at com.marklogic.ps.xqsync.Configuration.configureInput(Configuration.java:276) at com.marklogic.ps.xqsync.Configuration.configure(Configuration.java:247) at com.marklogic.ps.xqsync.Configuration.setProperties(Configuration.java:188) at com.marklogic.ps.xqsync.XQSync.main(XQSync.java:52)

That looks a little bit better: we see a startup message, at least. But what about the error? What does missing required property: INPUT_CONNECTION_STRING mean?

INPUT_CONNECTION_STRING is the name of a property. In Java, properties are simply name-value pairs, such as SIZE=1 or PATH=/lib/java. XQSync uses properties for almost everything, so let's learn how to configure those properties.

XQSync Configuration

XQSync configures itself by looking at the System properties, plus any property files that you supply on the command-line. We'll learn about property files in a minute. But for now, we just need to decide where our content will come from (that's the INPUT_CONNECTION_STRING property), and where it will end up (that will be the OUTPUT_CONNECTION_STRING property). Both of these properties will be XCC connection URIs: XQSync will use them to connect to specific MarkLogic Server instances.

You can also tell XQSync to use zip archives or filesystem directories as its output. Once you've written the contents of a MarkLogic database to an output package, you can then use that same package as the input package, restoring it to an output database.

Let's look at an example: to keep it simple, we'll use the built-in Documents database.

Example: Synchronizing the Use-Cases

Did you know that MarkLogic Server automatically sets up the W3C XQuery Use-Cases when it installs? If your server is running on your PC or laptop, you can view these use cases using one of the following links:

For convenience, here's a screenshot of the Use Cases application, as seen in versions prior to 5.0:

You will also need an XDBC server. To set one up, use the MarkLogic Server admin interface, which runs on port 8001. Click on Configure > App Servers > Create XDBC, and fill in the following values:

Now we are ready to run XQSync. For this example, we will back up the contents of Documents to a filesystem package (a zip archive file), and then restore it to the same Documents database. Note that we could achieve the same effect by using the builtin backup and restore features, but those features don't support cross-platform backup and restore.

java -cp xqsync.jar:xcc.jar:xstream.jar:xpp3.jar \ -DINPUT_CONNECTION_STRING=xcc://admin:admin@localhost:9000/Documents \ -DOUTPUT_PACKAGE=documents.zip \ com.marklogic.ps.xqsync.XQSync

In this command-line, we simple tell XQSync where to look for the input (INPUT_CONNECTION_STRING), and where to place the content it finds (OUTPUT_PACKAGE). When I ran that command-line, here is the output that I saw.

added system properties logging to CONSOLE logging to file simplelogger-%u-%g.log Aug 29, 2006 11:23:57 AM com.marklogic.ps.SimpleLogger configureLogger INFO: setting up com.marklogic.ps.SimpleLogger@190d11 for: com.marklogic.ps Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.Configuration setProperties INFO: first-time setup Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.Configuration configureInput INFO: input from connection: xcc://admin:admin@localhost:9000/Documents Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.Configuration configureOutput INFO: output to package: documents.zip Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.XQSync main INFO: XQSync starting: version = 2006-08-29.1 Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.XQSync main INFO: default encoding is UTF-8 Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.XQSync main INFO: default encoding is now UTF-8 Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.XQSyncManager run INFO: starting pool of 1 threads Aug 29, 2006 11:23:57 AM com.marklogic.ps.xqsync.XQSyncManager getRequest INFO: listing all documents Aug 29, 2006 11:23:58 AM com.marklogic.ps.xqsync.XQSyncManager run INFO: queued 20 items Aug 29, 2006 11:23:58 AM com.marklogic.ps.xqsync.OutputPackage newZipOutputStream INFO: package output going to new zipfile /tmp/xqsync/documents.zip Aug 29, 2006 11:23:58 AM com.marklogic.ps.xqsync.Monitor run INFO: loaded 20 records ok Aug 29, 2006 11:23:58 AM com.marklogic.ps.xqsync.XQSyncManager run INFO: exiting Aug 29, 2006 11:23:58 AM com.marklogic.ps.xqsync.XQSync main INFO: completed 20 in 1196 ms

That was fairly easy: we didn't see any errors, and the last line tells me that 20 documents were synchronized. Looking at the current directory, I see a new documents.zip.

$ ls documents.zip simplelogger-0-0.log.lck xpp3.jar xstream.jar simplelogger-0-0.log xcc.jar xqsync.jar $ unzip -l documents.zip Archive: documents.zip Length Date Time Name -------- ---- ---- ---- 491 08-29-06 11:23 http:/www.amazon.com/reviews.xml 458 08-29-06 11:23 http:/www.amazon.com/reviews.xml.metadata 1299 08-29-06 11:23 report1.xml 458 08-29-06 11:23 report1.xml.metadata 1065 08-29-06 11:23 book.xml 458 08-29-06 11:23 book.xml.metadata 4614 08-29-06 11:23 auctionwatchlist.xml 458 08-29-06 11:23 auctionwatchlist.xml.metadata 923 08-29-06 11:23 http:/www.bn.com/bib.xml 458 08-29-06 11:23 http:/www.bn.com/bib.xml.metadata 626 08-29-06 11:23 purchaseReport.xml 458 08-29-06 11:23 purchaseReport.xml.metadata 689 08-29-06 11:23 census.xml 458 08-29-06 11:23 census.xml.metadata 823 08-29-06 11:23 ipo.xml 458 08-29-06 11:23 ipo.xml.metadata 3196 08-29-06 11:23 news.xml 458 08-29-06 11:23 news.xml.metadata 419 08-29-06 11:23 partlist.xml 458 08-29-06 11:23 partlist.xml.metadata 499 08-29-06 11:23 company-data.xml 458 08-29-06 11:23 company-data.xml.metadata 1752 08-29-06 11:23 bids.xml 458 08-29-06 11:23 bids.xml.metadata 536 08-29-06 11:23 users.xml 458 08-29-06 11:23 users.xml.metadata 638 08-29-06 11:23 prices.xml 458 08-29-06 11:23 prices.xml.metadata 939 08-29-06 11:23 purchaseOrder.xml 458 08-29-06 11:23 purchaseOrder.xml.metadata 5211 08-29-06 11:23 report2.xml 458 08-29-06 11:23 report2.xml.metadata 244 08-29-06 11:23 books.xml 458 08-29-06 11:23 books.xml.metadata 1712 08-29-06 11:23 items.xml 458 08-29-06 11:23 items.xml.metadata 241 08-29-06 11:23 zips.xml 458 08-29-06 11:23 zips.xml.metadata 160 08-29-06 11:23 postals.xml 458 08-29-06 11:23 postals.xml.metadata -------- ------- 35237 40 files

Every document in my Documents database is represented twice: the first file contains the document content, and a .metadata file. The metadata file contains extra information about the document: its format, collections, permissions, properties, and quality are all recorded.

I also see a log file: that contains the output of XQSync. This can be useful for long-running XQSync tasks, and for debugging. To increase the verbosity of the log, we can add LOG_LEVEL to the properties. Valid log levels are from java.util.logging, and include DEBUG, FINE, FINDER, and FINEST.

To restore the zip file, we simply use INPUT_PACKAGE and OUTPUT_CONNECTION_STRING.

java -cp xqsync.jar:xcc.jar:xstream.jar:xpp3.jar \ -DOUTPUT_CONNECTION_STRING=xcc://admin:admin@localhost:9000/Documents \ -DINPUT_PACKAGE=documents.zip \ com.marklogic.ps.xqsync.XQSync added system properties logging to CONSOLE logging to file simplelogger-%u-%g.log Aug 29, 2006 11:59:25 AM com.marklogic.ps.SimpleLogger configureLogger INFO: setting up com.marklogic.ps.SimpleLogger@190d11 for: com.marklogic.ps Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.Configuration setProperties INFO: first-time setup Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.Configuration configureInput INFO: input from package: documents.zip Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.Configuration configureOutput INFO: output to connection: xcc://admin:admin@localhost:9000/Documents Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.XQSync main INFO: XQSync starting: version = 2006-08-29.1 Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.XQSync main INFO: default encoding is UTF-8 Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.XQSync main INFO: default encoding is now UTF-8 Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.XQSyncManager run INFO: starting pool of 1 threads Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.XQSyncManager queueFromInputPackage INFO: listing package documents.zip Aug 29, 2006 11:59:25 AM com.marklogic.ps.xqsync.XQSyncManager run INFO: queued 20 items Aug 29, 2006 11:59:26 AM com.marklogic.ps.xqsync.Monitor run INFO: loaded 20 records ok Aug 29, 2006 11:59:26 AM com.marklogic.ps.xqsync.XQSyncManager run INFO: exiting Aug 29, 2006 11:59:26 AM com.marklogic.ps.xqsync.XQSync main INFO: completed 20 in 1726 ms

Advanced Configuration

In our simple example, we synchronized the entire Documents database to an output package. We can also synchronize by collection, using INPUT_COLLECTION_URI, or by directory, using INPUT_DIRECTORY_URI. We can even submit an arbitrary query, using INPUT_QUERY - the query must return a sequence of document URIs, which will be synchronized.

On the output side, we can specify prefixes and suffixes for every URI. We can also specify which Forests will receive the new documents, and add new read-permissions to every document, by role.

If needed, the process may be multithreaded. Simply set THREADS to the desired number of threads.

Performance and Limitations

XQSync uses a pool of worker threads to synchronize documents. The manager thread performs the input query (or lists the contents of the input package) and queues each document for the workers. In parallel, the workers write any queued documents to the destination. Note that nothing is queued, and work is performed, until the input query has been evaluated and begins to return results. For millions of documents, it may be better to synchronize some subset of the documents. Collections, directories, and ad-hoc queries can all be useful for this purpose.

Zip archives are inherently 32-bit. Thus, if you are synchronizing more than 2GB of XML, XQSync may need to split the work up into multiple archives.

Conclusion

We hope this tutorial was useful. For more information about XQSync, see the README.

Troubleshooting

NoClassDefFoundError: Your classpath isn't finding one of the jar files. Examine it carefully, and make sure the paths are all accurate.

I'm running Windows, and XQSync is ignoring my INPUT_PACKAGE property: Try escaping all the back-slash characters in your path. For example, change c:\foo to c:\\foo - alternatively, change back-slashes to forward-slashes.

XQSync: a Wheelbarrow for Content