Corb - Content-Reprocessing in Bulk

Corb is a Java tool designed for bulk content-reprocessing. Essentially, it lists all the documents in a collection (or all the documents in the database), and then uses a pool of worker threads to apply an XQuery module to each document.

Required JVM: Sun 1.5 or later

Required libraries:

Required configuration:

Additional configuration notes

If the target XDBC server is configured with a library location on the filesystem:

If the target XDBC server is configured with a library location pointing to a database:

Running Corb

The entry point is the main method in the com.marklogic.developer.corb.Manager class. Corb requires 3 command-line arguments:

There are 5 additional optional command-line arguments:

com.marklogic.developer.corb.Manager \
  XCC-CONNECTION-URI COLLECTION-NAME XQUERY-MODULE [ THREAD-COUNT [ URIS-MODULE [ MODULE-ROOT [ MODULES-DATABASE [ INSTALL ] ] ] ] ]

Writing a Custom URI Module

The URIS-MODULE must be an XQuery main module, and must return a sequence of (xs:integer, xs:string*). The first item must be the size of the subsequence sequence of URIs.

For example, this simple URIS-MODULE would return all available URIs from the URI Lexicon, behaving just as Corb normally would with COLLECTION-NAME="" and no URIS-MODULE.

(: simple URIS-MODULE example :)
let $uris := cts:uris('', 'document')
return (count($uris), $uris)

This example may be extended to intersect a cts:query, etc.

Sample Invocations

The following sample invocation uses a sample medline-reprocessing XQuery module, which is included in corb.jar. You can also download medline-iso8601.xqy.

java -cp $HOME/lib/java/xcc.jar:$HOME/lib/java/corb.jar \
  com.marklogic.developer.corb.Manager \
  xcc://admin:admin@localhost:9002/ "" \
  medline-iso8601.xqy

Another sample invocation, using a processing module loaded from the filesystem, an alternate URI selection module, and 2 threads.

java -cp $HOME/lib/java/xcc.jar:$HOME/lib/java/corb.jar \
  com.marklogic.developer.corb.Manager \
  xcc://admin:admin@localhost:9002/ "" \
  /home/myproject/src/custom-transform.xqy 2 \
  /home/myproject/src/custom-uri-selection.xqy

A third sample invocation. Using 4 threads, custom modules pre-installed in the 'mydb' database processing with /preprocessing/custom-transform.xqy and using URIs returned by /preprocessing/custom-uri-selection.xqy:

java -cp $HOME/lib/java/xcc.jar:$HOME/lib/java/corb.jar \
  com.marklogic.developer.corb.Manager \
  xcc://admin:admin@localhost:9002/ "" \
  custom-transform.xqy 4 \
  custom-uri-selection.xqy \
  /preprocessing/ \
  mydb
  false

As Corb processes the documents, various progress messages will be logged.