RecordLoader

To get started with RecordLoader, try the tutorial. For the source code, see the project page on github.

Running RecordLoader

The entry point is the main method in the com.marklogic.ps.RecordLoader class. The command-line arguments may be any mix of zero or more property files and zero or more input files.

Any property file names must end in .properties. Other file names on the command line will be treated as input files, not property files. Any specified system properties will override file-based properties, and properties found in later files may override properties specified in earlier files on the command line. It's also possibly to specify properties as VM arguments (-DNAME=value). See src/recordloader.sh for a sample shell script. See src/config/ for sample property files.

Input files will be used as input for the loader. The input file names must not end in .properties. These may be binary, text, or XML (see INPUT_FORMAT, below). XML files may be single documents or superfiles containing many documents (see RECORD_NAME below).

Required JVM: Sun 1.5 or later

Required libraries:

MarkLogic XCC
recordloader.jar
XPP3 XML Pull-Parser (here's a link to the jar files - be sure to use xpp3-1.1.4c.jar or later)
junit.jar
svnkit.jar

Required inputs:

None. If ID_NAME is missing, then the default value #FILENAME will be used.

Available properties:

Property	default value	notes
CONFIGURATION_CLASSNAME	com.marklogic.recordloader.xcc.XccConfiguration	This class will be used to provide configuration information. This class must be an extension of the com.marklogic.recordloader.Configuration class.
CONTENT_FACTORY_CLASSNAME	com.marklogic.recordloader.xcc.XccContentFactory	This class will be used to create new content objects, which implement com.marklogic.recordloader.ContentInterface. One alternative implementation is provided, as `com.marklogic.recordloader.xcc.XccModuleContentFactory`, which creates objects in the class `com.marklogic.recordloader.xcc.XccModuleContent`. When XccModuleContentFactory is used, new documents must fit in memory, and will be posted to the XQuery main module designated by the `CONTENT_MODULE_URI` property. If the `SKIP_EXISTING` or `ERROR_EXISTING` features are desired, the module must implement each itself (see below). When RecordLoader invokes this module, it will set external variables: `$URI` `$XML-STRING` `$NAMESPACE` (using `DEFAULT_NAMESPACE`) `$LANGUAGE` (using `LANGUAGE`) `$ROLES-EXECUTE` (comma-separated values, using `ROLES_EXECUTE`) `$ROLES-INSERT` (comma-separated values, using `ROLES_INSERT`) `$ROLES-READ` (comma-separated values, using `ROLES_READ`) `$ROLES-UPDATE` (comma-separated values, using `ROLES_UPDATE`) `$COLLECTIONS` (comma-separated values, using any base collections plus `OUTPUT_COLLECTIONS`) `$SKIP-EXISTING` (using `SKIP_EXISTING`) `$ERROR-EXISTING` (using `ERROR_EXISTING`) `$FORESTS` (comma-separated forest ids, via `OUTPUT_FORESTS`) The following XQuery implements an example ContentModule, which implements a simple transform to lower-case all element names. Note that the module implements its own versions of the `SKIP_EXISTING` and `ERROR_EXISTING` checks. xquery version "1.0-ml"; declare variable $URI as xs:string external; declare variable $XML-STRING as xs:string external; declare variable $NAMESPACE as xs:string external; declare variable $LANGUAGE as xs:string external; declare variable $ROLES-EXECUTE as xs:string external; declare variable $ROLES-INSERT as xs:string external; declare variable $ROLES-READ as xs:string external; declare variable $ROLES-UPDATE as xs:string external; declare variable $COLLECTIONS as xs:string external; declare variable $SKIP-EXISTING as xs:boolean external; declare variable $ERROR-EXISTING as xs:boolean external; declare variable $FORESTS as xs:string external; declare function local:do($list as node()) as node() { for $n in $list return typeswitch($n) (: lower-case element localnames :) case element() return element { QName(namespace-uri($n), lower-case(local-name($n))) } { $n/@, local:do($n/node()) } case document-node() return document { local:do($n/node()) } default return $n }; if ($SKIP-EXISTING and doc($URI)) then () else if ($ERROR-EXISTING and doc($URI)) then error((), 'DUPLICATE-URI', $URI) else xdmp:document-insert( $URI, local:do(xdmp:unquote( $XML-STRING, $NAMESPACE, if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else () )), ( for $r in tokenize($ROLES-EXECUTE, ',')[. ne ''] return xdmp:permission($r, 'execute'), for $r in tokenize($ROLES-INSERT, ',')[. ne ''] return xdmp:permission($r, 'insert'), for $r in tokenize($ROLES-READ, ',')[. ne ''] return xdmp:permission($r, 'read'), for $r in tokenize($ROLES-UPDATE, ',')[. ne ''] return xdmp:permission($r, 'update') ), tokenize($COLLECTIONS, ',')[. ne ''], 0, for $id in tokenize($FORESTS, ',')[. ne ''] return xs:unsignedLong($id) ) Another included factory is `com.marklogic.recordloader.http.HttpContentFactory`. This is similar to the XCC version above, except that the implementation must use `xdmp:get-request-field()` rather than declaring `external` to populate the module variables. Note that some variables take `xs:string` rather than `xs:string`: these are `COLLECTIONS`, `FORESTS`, `ROLES-EXECUTE`. `ROLES-INSERT`. `ROLES-READ`. and `ROLES-UPDATE`. The `xs:boolean` variables are `xs:string` instead. Also, `FORESTS` will be passed in as forest names, so the module must be prepared to translate these into forest ids. declare variable $URI as xs:string := xdmp:get-request-field( 'URI'); declare variable $XML-STRING as xs:string := xdmp:get-request-field( 'XML-STRING'); declare variable $NAMESPACE as xs:string := xdmp:get-request-field( 'NAMESPACE'); declare variable $LANGUAGE as xs:string := xdmp:get-request-field( 'LANGUAGE'); declare variable $ROLES-EXECUTE as xs:string* := xdmp:get-request-field( 'ROLES-EXECUTE); declare variable $ROLES-INSERT as xs:string* := xdmp:get-request-field( 'ROLES-INSERT); declare variable $ROLES-READ as xs:string* := xdmp:get-request-field( 'ROLES-READ'); declare variable $ROLES-UPDATE as xs:string* := xdmp:get-request-field( 'ROLES-UPDATE); declare variable $COLLECTIONS as xs:string* := xdmp:get-request-field( 'COLLECTIONS'); declare variable $SKIP-EXISTING as xs:boolean := xs:boolean( xdmp:get-request-field('SKIP-EXISTING')); declare variable $ERROR-EXISTING as xs:boolean := xs:boolean( xdmp:get-request-field('ERROR-EXISTING')); declare variable $FORESTS as xs:string* := xdmp:get-request-field( 'FORESTS'); if ($SKIP-EXISTING and doc($URI)) then () else if ($ERROR-EXISTING and doc($URI)) then error((), 'DUPLICATE-URI', $URI) else xdmp:document-insert( $URI, xdmp:unquote( $XML-STRING, $NAMESPACE, if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else () ), ( for $r in tokenize($ROLES-EXECUTE, ',')[. ne ''] return xdmp:permission($r, 'execute'), for $r in tokenize($ROLES-INSERT, ',')[. ne ''] return xdmp:permission($r, 'insert'), for $r in tokenize($ROLES-READ, ',')[. ne ''] return xdmp:permission($r, 'read'), for $r in tokenize($ROLES-UPDATE, ',')[. ne ''] return xdmp:permission($r, 'update') ), $COLLECTIONS, 0, for $fn in $FORESTS return xdmp:forest($fn) )
CONNECTION_STRING	xcc://admin:admin@localhost:9000/	XCC URI, including username, password, host, and port, to use for all queries and inserts. If desired, a database name may also be supplied. Multiple connection strings may be separated with whitespace or commas. To use SSL for encrypted communication, start the connection string with the `xccs://` scheme rather than the default `xcc://` scheme. Make sure the XDBC server is configured for SSL.
DEFAULT_NAMESPACE	null	If present, all XML will default to the supplied namespace uri.
DELETE_INPUT_FILES	false	If true, each input file will be deleted after being loaded. This setting does not affect zip archives. This setting does not delete directories. If `FATAL_ERRORS` is false, then the input document may be deleted even though errors have occured.
DOCUMENT_FORMAT	xml	Document format for all new documents. Valid settings are `xml`, `text`, and `binary`
ENCRYPTED_PASSWORD	false	Not everyone wants to include their password in the properties file. A "little" bit of security was added by adding the ENCRYPTED_PASSWORD Flag. When this flag is enabled the recordloader process will look for the "ciphertext" and "keyfile" files that contain the encrypted password to be used. To create the two encrypted password files, execute: java -cp recordloader.jar com.marklogic.ps.PasswordEncrypter PASSWORD Change PASSWORD to be the password that becomes encrypted Now the properties file does not contain a password, and the encrypted password files can be kept seperate, securely and non-human readable.
ENCRYPTED_PASSWORD	false	Not everyone wants to include their password in the properties file. A "little" bit of security was added by adding the ENCRYPTED_PASSWORD Flag. When this flag is enabled the recordloader process will look for the "ciphertext" and "keyfile" files that contain the encrypted password to be used. To create the two encrypted password files, execute: java -cp recordloader.jar com.marklogic.ps.PasswordEncrypter PASSWORD Change PASSWORD to be the password that becomes encrypted Now the properties file does not contain a password, and the encrypted password files can be kept seperate, securely and non-human readable.
ERROR_EXISTING	false	If true, RecordLoader will throw an error if it finds itself trying to overwrite an existing document uri. This error may or may not be fatal, depending on the value of `FATAL_ERRORS`. Note that this option requires the server to perform a separate check for each document uri. This can reduce performance. Note that if using `CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory`, this option requires the module to implement its mechanism (see above).
FATAL_ERRORS	true	If true, RecordLoader will exit with an error upon encountering any non-retryable error. If set to false, RecordLoader will close the current record and continue on to the next.
ID_NAME	`#FILENAME`	Within each input document or RECORD_NAME element, the first element called ID_NAME will be used to compose the new document uri. If ID_NAME starts with '@', an attribute with this local-name will be used to compose the new document uri. Note that namespace is ignored: only the local-name is used. The named node must have a simple text value: it may not be empty, and it must not contain any non-text children. The special value `ID_NAME=#AUTO` will cause RecordLoader to automatically generate ids, in sequence, for each input record. Since RecordLoader automatically includes the base filename in each output URI, this is safe. Note that when the input is standard input, the default value is `#AUTO` - not `#FILENAME`. The special value `ID_NAME=#FILENAME` will cause RecordLoader to automatically load each input file into a single document per input file, using the file's basename to compose the new document uri. This is the default behavior. Examples: ID_NAME=MedlineID, ID_NAME=@id
IGNORE_FILE_BASENAME	false	If true, RecordLoader will omit the file or zip archive basename when composing new document uris.
IGNORE_UNKNOWN	false	If set, RecordLoader will ignore siblings of RECORD_NAME that are not RECORD_NAME elements. Otherwise, this condition causes a fatal error.
INPUT_MALFORMED_ACTION	REPORT	Constant values from java.nio.charset.CodingErrorAction, used to determine what happens if there are invalid character sequences in the input XML. REPORT: throws a MalformedInputException REPLACE: replaces invalid sequence with a '?' or similar. IGNORE: skips over the invalid sequence.
INPUT_ENCODING	UTF-8	The Java Charset encoding (codepage) to use for all input XML. If unset, RecordLoader will use null, which will default to the default Locale's character encoding. Note that MarkLogic Server must receive all XML as UTF-8, so the output encoding is always UTF-8. Example: if the input XML is encoded as `windows-1252`, use `INPUT_ENCODING=Cp1252` to ensure correct conversion.
INPUT_ESCAPE_IDS	true	If true, all input ids will be URI-escaped. Note that the default is true if `ID_NAME=#FILENAME`, which is the default. In other modes, the default is false.
INPUT_FILE_SIZE_LIMIT	0	If greater than zero, RecordLoader will skip any input files larger that `INPUT_FILE_SIZE_LIMIT` Bytes. This does not apply to zip archives, nor to the size of their entries.
INPUT_HANDLER_CLASSNAME	com.marklogic.recordloader.DefaultInputHandler	The specified class will be used to marshall loader inputs. The default class handles `INPUT_PATH` as well as command-line arguments. This property is meant for plug-in classes, which must implement com.marklogic.recordloader.InputHandlerInterface, and may extend the com.marklogic.recordloader.AbstractInputHandler class. Built-in alternatives: com.marklogic.recordloader.svn.SvnInputHandler treats `INPUT_PATH` as a subversion repository url (EXPERIMENTAL).
INPUT_PATH	null	The filesystem path in which to look for XML files or zip archives. If unset, RecordLoader will read XML directly from standard input.
INPUT_PATTERN	^.+\\.[Xx][Mm][Ll]$	Matching pattern (regex) for files found in INPUT_PATH. The default value matches all filenames ending with `.xml`
INPUT_STREAMING	false	If true, the input content will be streamed into the database. By default, content will be buffered one document at a time, per thread. The streaming option requires less memory, but is more fragile: if interrupted, a document insert cannot be retried.
INPUT_STRIP_PREFIX	null	If not null, characters matching this pattern (regex) will be removed from all input URIs. For example, Windows users may wish to set `INPUT_STRIP_PREFIX=^[A-Z]:` so that document URIs in the database do not include drive-letter prefixes.
INPUT_NORMALIZE_PATHS	false	If true, backslashes in input paths will be coalesced and replaced with slashes in all output document URIs. This is useful for Windows users, especially in combination with `INPUT_STRIP_PREFIX`. With both properties set as suggested, `C:\foo\bar\baz.xml` on the filesystem becomes `/foo/bar/baz.xml` in the database.
LANGUAGE	null	If set, the value will be passed to XCC `ContentCreateOptions.setLanguage()`, or to the `CONTENT_MODULE` external variable `$LANGUAGE`. Accepted values are documented in XML 1.0 and RFC 3066. If null, the default database language will be used.
LOG_LEVEL	INFO	java.util.logger.Level at which to log.
LOG_HANDLER	CONSOLE,FILE	java.util.logger log handlers with which to log.
LOOP_FOREVER	false	If set, RecordLoader will loop forever on the arguments and properties. This is most useful when combined with `DELETE_INPUT_FILES`.
OUTPUT_COLLECTIONS	null	One or more collections to apply to every new document. Use whitespace to separate multiple collection uris. Note that the actual document collections will also include so-called "base collections". One of these is a batch marker, `com.marklogic.ps.RecordLoader.{ system milliseconds }`. This base collection can be useful for tracking how and when documents were ingested. Another base collection is derived from the input filename (see `USE_FILENAME_COLLECTION`). Another base collection is derived from the current wall-clock time (see `USE_TIMESTAMP_COLLECTION`).
OUTPUT_FORESTS	null	If set, all documents will be explicitly placed into the named forests. Use whitespace or the characters `,:;` to separate values.
OUTPUT_QUALITY	0	When using XccContentFactory for inserts, this value will be used to set document quality.
ROLES_EXECUTE	null	One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have execute permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
ROLES_INSERT	null	One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have insert permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
ROLES_READ	null	One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have read permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
ROLES_UPDATE	null	One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have update permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
RECORD_NAME	null	Element name in which each document is found. These may not nest. If no RECORD_NAME is set, the first child element of the first root element will be used for the entire RecordLoader run. If `ID_NAME` is set to an element or attribute name, or set to `#AUTO` (including when RecordLoader reads from standard input), then the special value `RECORD_NAME=#DOCUMENT` will cause RecordLoader to treat every document root element as a record. This mode is slower than `ID_NAME=#FILENAME`, but useful when the filenames are not appropriate as document URIs.
RECORD_NAMESPACE	null	Element namespace in which each document is found. If unset, but RECORD_NAME is set, then the empty namespace is assumed. If unset, and RECORD_NAME is also unset, then then the namespace of the first child element of the first root element will be used for the entire RecordLoader run.
SKIP_EXISTING	false	If true, existing document uris will be skipped. This allows RecordLoader to resume after being interrupted. This option may be combined with `START_ID`, in case the known value for `START_ID` already exists. Note that one read I/O is required per skip, so SKIP_EXISTING is slower than using START_ID (below). Note that if using `CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory`, this option requires the module to implement its mechanism (see above).
SKIP_EXISTING_UNTIL_FIRST_MISS	false	If true, existing documents will be skipped until the first miss is found. This can be somewhat faster than `SKIP_EXISTING` alone, but may result in updates to some existing documents.
START_ID	null	When set, records are skipped until one with an `ID_NAME` value equal to `START_ID` is found. This can be used to resume ingestion after interruptions or fatal errors.
START_ID_MULTITHREADED	false	Normally, `START_ID` causes RecordLoader to temporarily reduce `THREADS` to 1 until the starting value is found. When this property is true, RecordLoader allows multiple threads to run even before the start value has been found. Before the start value has been found, all threads will skip their input records. Once the start value has been found by one thread, all threads will begin processing records input normally. Enabling this property can sometimes be useful, but may result in unpredictable or non-deterministic behavior.
THREADS	1	Number of RecordLoader threads. Note that when using standard input, this value is ignored. Note that RecordLoader uses at most 1 thread per input file or zip entry.
THROTTLE_BYTES_PER_SECOND	0	If non-zero, all threads will be throttled to the given number of bytes inserted per second.
THROTTLE_EVENTS_PER_SECOND	0	If non-zero, all threads will be throttled to the given number of inserts per second.
URI_PREFIX	null	Prefix used before the ID_NAME value, to compose all document uris. If the prefix does not end in '/', RecordLoader will add a '/' to it.
URI_SUFFIX	null	Suffix used after the ID_NAME value, to compose all document uris.
USE_FILENAME_COLLECTION	true	If `ID_NAME` is not `#FILENAME`, and this property is true, RecordLoader will add an extra collection to each record, built from the filename of the current input file. This can be useful when splitting superfiles.
USE_TIMESTAMP_COLLECTION	true	If this property is true, RecordLoader will add an extra collection to each record, built from the timestamp at which RecordLoader started running. This can be useful when tracking down problems with content loaded at different times.
XML_REPAIR_LEVEL	NONE	To what degree should XPP3 and MarkLogic Server compensate for invalid XML? NONE: throw an exception (see also: `FATAL_ERRORS`). FULL: do everything reasonable to ingest the document.

Troubleshooting

`XmlPullParserException: could not resolve entity named 'foo'`.

The XPP implementation used by RecordLoader, xpp3, does not handle unknown references, and does not process DTD-style document declarations. So if your XML includes non-XML character entities, RecordLoader is not for you. Future enhancements could include a plug-in system, allowing the user to substitute an XPP implementation that supports document declarations.

`java.util.concurrent.RejectedExecutionException`.

If you are using RecordLoader with thousands of files or zipfile entries, you may need to increase the JVM heap space. Try -Xmx256m as one of your command-line JVM arguments.

With Solaris, my UTF-8 accents and diacritics are mangled.

You should see UTF-8 in the output from locale -a:

$ locale -a | grep -i utf en_CA.UTF-8 en_US.UTF-8 es.UTF-8 es_MX.UTF-8 fr.UTF-8 fr_CA.UTF-8

If no UTF-8 locales are available, make sure to install the correct Solaris packages:

SUNWeuluf
SUNWeu8os
SUNWicu
SUNWicud