RecordLoader

To get started with RecordLoader, try the tutorial. For the source code, see the project page on github.

Running RecordLoader

The entry point is the main method in the com.marklogic.ps.RecordLoader class. The command-line arguments may be any mix of zero or more property files and zero or more input files.

Any property file names must end in .properties. Other file names on the command line will be treated as input files, not property files. Any specified system properties will override file-based properties, and properties found in later files may override properties specified in earlier files on the command line. It's also possibly to specify properties as VM arguments (-DNAME=value). See src/recordloader.sh for a sample shell script. See src/config/ for sample property files.

Input files will be used as input for the loader. The input file names must not end in .properties. These may be binary, text, or XML (see INPUT_FORMAT, below). XML files may be single documents or superfiles containing many documents (see RECORD_NAME below).

Required JVM: Sun 1.5 or later

Required libraries:

Required inputs:

None. If ID_NAME is missing, then the default value #FILENAME will be used.

Available properties:

Propertydefault valuenotes
CONFIGURATION_CLASSNAME com.marklogic.recordloader.xcc.XccConfiguration This class will be used to provide configuration information. This class must be an extension of the com.marklogic.recordloader.Configuration class.
CONTENT_FACTORY_CLASSNAME com.marklogic.recordloader.xcc.XccContentFactory

This class will be used to create new content objects, which implement com.marklogic.recordloader.ContentInterface. One alternative implementation is provided, as com.marklogic.recordloader.xcc.XccModuleContentFactory, which creates objects in the class com.marklogic.recordloader.xcc.XccModuleContent.

When XccModuleContentFactory is used, new documents must fit in memory, and will be posted to the XQuery main module designated by the CONTENT_MODULE_URI property. If the SKIP_EXISTING or ERROR_EXISTING features are desired, the module must implement each itself (see below).

When RecordLoader invokes this module, it will set external variables:

  • $URI
  • $XML-STRING
  • $NAMESPACE  (using DEFAULT_NAMESPACE)
  • $LANGUAGE  (using LANGUAGE)
  • $ROLES-EXECUTE  (comma-separated values, using ROLES_EXECUTE)
  • $ROLES-INSERT  (comma-separated values, using ROLES_INSERT)
  • $ROLES-READ  (comma-separated values, using ROLES_READ)
  • $ROLES-UPDATE  (comma-separated values, using ROLES_UPDATE)
  • $COLLECTIONS  (comma-separated values, using any base collections plus OUTPUT_COLLECTIONS)
  • $SKIP-EXISTING  (using SKIP_EXISTING)
  • $ERROR-EXISTING  (using ERROR_EXISTING)
  • $FORESTS  (comma-separated forest ids, via OUTPUT_FORESTS)

The following XQuery implements an example ContentModule, which implements a simple transform to lower-case all element names. Note that the module implements its own versions of the SKIP_EXISTING and ERROR_EXISTING checks.

xquery version "1.0-ml";

declare variable $URI as xs:string external;
declare variable $XML-STRING as xs:string external;
declare variable $NAMESPACE as xs:string external;
declare variable $LANGUAGE as xs:string external;
declare variable $ROLES-EXECUTE as xs:string external;
declare variable $ROLES-INSERT as xs:string external;
declare variable $ROLES-READ as xs:string external;
declare variable $ROLES-UPDATE as xs:string external;
declare variable $COLLECTIONS as xs:string external;
declare variable $SKIP-EXISTING as xs:boolean external;
declare variable $ERROR-EXISTING as xs:boolean external;
declare variable $FORESTS as xs:string external;

declare function local:do($list as node()*)
 as node()*
{
  for $n in $list return typeswitch($n)
  (: lower-case element localnames :)
  case element() return element {
    QName(namespace-uri($n), lower-case(local-name($n)))
  } {
    $n/@*, local:do($n/node())
  }
  case document-node() return document { local:do($n/node()) }
  default return $n
};

if ($SKIP-EXISTING and doc($URI)) then ()
else if ($ERROR-EXISTING and doc($URI)) then error((), 'DUPLICATE-URI', $URI)
else xdmp:document-insert(
  $URI,
  local:do(xdmp:unquote(
    $XML-STRING,
    $NAMESPACE,
    if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else ()
  )),
  (
    for $r in tokenize($ROLES-EXECUTE, ',')[. ne '']
    return xdmp:permission($r, 'execute'),
    for $r in tokenize($ROLES-INSERT, ',')[. ne '']
    return xdmp:permission($r, 'insert'),
    for $r in tokenize($ROLES-READ, ',')[. ne '']
    return xdmp:permission($r, 'read'),
    for $r in tokenize($ROLES-UPDATE, ',')[. ne '']
    return xdmp:permission($r, 'update')
  ),
  tokenize($COLLECTIONS, ',')[. ne ''],
  0,
  for $id in tokenize($FORESTS, ',')[. ne '']
  return xs:unsignedLong($id)
)

Another included factory is com.marklogic.recordloader.http.HttpContentFactory. This is similar to the XCC version above, except that the implementation must use xdmp:get-request-field() rather than declaring external to populate the module variables. Note that some variables take xs:string* rather than xs:string: these are COLLECTIONS, FORESTS, ROLES-EXECUTE. ROLES-INSERT. ROLES-READ. and ROLES-UPDATE. The xs:boolean variables are xs:string instead. Also, FORESTS will be passed in as forest names, so the module must be prepared to translate these into forest ids.

declare variable $URI as xs:string := xdmp:get-request-field(
  'URI');
declare variable $XML-STRING as xs:string := xdmp:get-request-field(
  'XML-STRING');
declare variable $NAMESPACE as xs:string := xdmp:get-request-field(
  'NAMESPACE');
declare variable $LANGUAGE as xs:string := xdmp:get-request-field(
  'LANGUAGE');
declare variable $ROLES-EXECUTE as xs:string* := xdmp:get-request-field(
  'ROLES-EXECUTE);
declare variable $ROLES-INSERT as xs:string* := xdmp:get-request-field(
  'ROLES-INSERT);
declare variable $ROLES-READ as xs:string* := xdmp:get-request-field(
  'ROLES-READ');
declare variable $ROLES-UPDATE as xs:string* := xdmp:get-request-field(
  'ROLES-UPDATE);
declare variable $COLLECTIONS as xs:string* := xdmp:get-request-field(
  'COLLECTIONS');
declare variable $SKIP-EXISTING as xs:boolean := xs:boolean(
  xdmp:get-request-field('SKIP-EXISTING'));
declare variable $ERROR-EXISTING as xs:boolean := xs:boolean(
  xdmp:get-request-field('ERROR-EXISTING'));
declare variable $FORESTS as xs:string* := xdmp:get-request-field(
  'FORESTS');

if ($SKIP-EXISTING and doc($URI)) then ()
else if ($ERROR-EXISTING and doc($URI)) then error((), 'DUPLICATE-URI', $URI)
else xdmp:document-insert(
  $URI,
  xdmp:unquote(
    $XML-STRING,
    $NAMESPACE,
    if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else ()
  ),
  (
    for $r in tokenize($ROLES-EXECUTE, ',')[. ne '']
    return xdmp:permission($r, 'execute'),
    for $r in tokenize($ROLES-INSERT, ',')[. ne '']
    return xdmp:permission($r, 'insert'),
    for $r in tokenize($ROLES-READ, ',')[. ne '']
    return xdmp:permission($r, 'read'),
    for $r in tokenize($ROLES-UPDATE, ',')[. ne '']
    return xdmp:permission($r, 'update')
  ),
  $COLLECTIONS,
  0,
  for $fn in $FORESTS return xdmp:forest($fn)
)
CONNECTION_STRINGxcc://admin:admin@localhost:9000/ XCC URI, including username, password, host, and port, to use for all queries and inserts. If desired, a database name may also be supplied. Multiple connection strings may be separated with whitespace or commas. To use SSL for encrypted communication, start the connection string with the xccs:// scheme rather than the default xcc:// scheme. Make sure the XDBC server is configured for SSL.
DEFAULT_NAMESPACEnull If present, all XML will default to the supplied namespace uri.
DELETE_INPUT_FILES false If true, each input file will be deleted after being loaded. This setting does not affect zip archives. This setting does not delete directories. If FATAL_ERRORS is false, then the input document may be deleted even though errors have occured.
DOCUMENT_FORMATxml Document format for all new documents. Valid settings are xml, text, and binary
ENCRYPTED_PASSWORDfalse

Not everyone wants to include their password in the properties file. A "little" bit of security was added by adding the ENCRYPTED_PASSWORD Flag. When this flag is enabled the recordloader process will look for the "ciphertext" and "keyfile" files that contain the encrypted password to be used. To create the two encrypted password files, execute:

java -cp recordloader.jar com.marklogic.ps.PasswordEncrypter PASSWORD

Change PASSWORD to be the password that becomes encrypted Now the properties file does not contain a password, and the encrypted password files can be kept seperate, securely and non-human readable.

ENCRYPTED_PASSWORDfalse

Not everyone wants to include their password in the properties file. A "little" bit of security was added by adding the ENCRYPTED_PASSWORD Flag. When this flag is enabled the recordloader process will look for the "ciphertext" and "keyfile" files that contain the encrypted password to be used. To create the two encrypted password files, execute:

java -cp recordloader.jar com.marklogic.ps.PasswordEncrypter PASSWORD

Change PASSWORD to be the password that becomes encrypted Now the properties file does not contain a password, and the encrypted password files can be kept seperate, securely and non-human readable.

ERROR_EXISTINGfalse If true, RecordLoader will throw an error if it finds itself trying to overwrite an existing document uri. This error may or may not be fatal, depending on the value of FATAL_ERRORS.

Note that this option requires the server to perform a separate check for each document uri. This can reduce performance.

Note that if using CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory, this option requires the module to implement its mechanism (see above).

FATAL_ERRORStrue If true, RecordLoader will exit with an error upon encountering any non-retryable error. If set to false, RecordLoader will close the current record and continue on to the next.
ID_NAME#FILENAME Within each input document or RECORD_NAME element, the first element called ID_NAME will be used to compose the new document uri. If ID_NAME starts with '@', an attribute with this local-name will be used to compose the new document uri.

Note that namespace is ignored: only the local-name is used. The named node must have a simple text value: it may not be empty, and it must not contain any non-text children.

The special value ID_NAME=#AUTO will cause RecordLoader to automatically generate ids, in sequence, for each input record. Since RecordLoader automatically includes the base filename in each output URI, this is safe.

Note that when the input is standard input, the default value is #AUTO - not #FILENAME.

The special value ID_NAME=#FILENAME will cause RecordLoader to automatically load each input file into a single document per input file, using the file's basename to compose the new document uri. This is the default behavior.

Examples: ID_NAME=MedlineID, ID_NAME=@id

IGNORE_FILE_BASENAMEfalse If true, RecordLoader will omit the file or zip archive basename when composing new document uris.
IGNORE_UNKNOWNfalse If set, RecordLoader will ignore siblings of RECORD_NAME that are not RECORD_NAME elements. Otherwise, this condition causes a fatal error.
INPUT_MALFORMED_ACTIONREPORT Constant values from java.nio.charset.CodingErrorAction, used to determine what happens if there are invalid character sequences in the input XML.
  • REPORT: throws a MalformedInputException
  • REPLACE: replaces invalid sequence with a '?' or similar.
  • IGNORE: skips over the invalid sequence.
INPUT_ENCODINGUTF-8 The Java Charset encoding (codepage) to use for all input XML. If unset, RecordLoader will use null, which will default to the default Locale's character encoding.
Note that MarkLogic Server must receive all XML as UTF-8, so the output encoding is always UTF-8.
Example: if the input XML is encoded as windows-1252, use INPUT_ENCODING=Cp1252 to ensure correct conversion.
INPUT_ESCAPE_IDStrue If true, all input ids will be URI-escaped. Note that the default is true if ID_NAME=#FILENAME, which is the default. In other modes, the default is false.
INPUT_FILE_SIZE_LIMIT0 If greater than zero, RecordLoader will skip any input files larger that INPUT_FILE_SIZE_LIMIT Bytes. This does not apply to zip archives, nor to the size of their entries.
INPUT_HANDLER_CLASSNAME com.marklogic.recordloader.DefaultInputHandler The specified class will be used to marshall loader inputs. The default class handles INPUT_PATH as well as command-line arguments. This property is meant for plug-in classes, which must implement com.marklogic.recordloader.InputHandlerInterface, and may extend the com.marklogic.recordloader.AbstractInputHandler class.
Built-in alternatives:
  • com.marklogic.recordloader.svn.SvnInputHandler treats INPUT_PATH as a subversion repository url (EXPERIMENTAL).
INPUT_PATHnull The filesystem path in which to look for XML files or zip archives. If unset, RecordLoader will read XML directly from standard input.
INPUT_PATTERN^.+\\.[Xx][Mm][Ll]$ Matching pattern (regex) for files found in INPUT_PATH. The default value matches all filenames ending with .xml
INPUT_STREAMINGfalse If true, the input content will be streamed into the database. By default, content will be buffered one document at a time, per thread. The streaming option requires less memory, but is more fragile: if interrupted, a document insert cannot be retried.
INPUT_STRIP_PREFIXnull If not null, characters matching this pattern (regex) will be removed from all input URIs. For example, Windows users may wish to set INPUT_STRIP_PREFIX=^[A-Z]: so that document URIs in the database do not include drive-letter prefixes.
INPUT_NORMALIZE_PATHSfalse If true, backslashes in input paths will be coalesced and replaced with slashes in all output document URIs. This is useful for Windows users, especially in combination with INPUT_STRIP_PREFIX. With both properties set as suggested, C:\foo\bar\baz.xml on the filesystem becomes /foo/bar/baz.xml in the database.
LANGUAGEnull If set, the value will be passed to XCC ContentCreateOptions.setLanguage(), or to the CONTENT_MODULE external variable $LANGUAGE. Accepted values are documented in XML 1.0 and RFC 3066.

If null, the default database language will be used.

LOG_LEVELINFO java.util.logger.Level at which to log.
LOG_HANDLERCONSOLE,FILE java.util.logger log handlers with which to log.
LOOP_FOREVER false If set, RecordLoader will loop forever on the arguments and properties. This is most useful when combined with DELETE_INPUT_FILES.
OUTPUT_COLLECTIONSnull One or more collections to apply to every new document. Use whitespace to separate multiple collection uris. Note that the actual document collections will also include so-called "base collections". One of these is a batch marker, com.marklogic.ps.RecordLoader.{ system milliseconds }. This base collection can be useful for tracking how and when documents were ingested. Another base collection is derived from the input filename (see USE_FILENAME_COLLECTION). Another base collection is derived from the current wall-clock time (see USE_TIMESTAMP_COLLECTION).
OUTPUT_FORESTSnull If set, all documents will be explicitly placed into the named forests. Use whitespace or the characters ,:; to separate values.
OUTPUT_QUALITY0 When using XccContentFactory for inserts, this value will be used to set document quality.
ROLES_EXECUTEnull One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have execute permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
ROLES_INSERTnull One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have insert permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
ROLES_READnull One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have read permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
ROLES_UPDATEnull One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have update permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error.
RECORD_NAMEnull

Element name in which each document is found. These may not nest. If no RECORD_NAME is set, the first child element of the first root element will be used for the entire RecordLoader run.

If ID_NAME is set to an element or attribute name, or set to #AUTO (including when RecordLoader reads from standard input), then the special value RECORD_NAME=#DOCUMENT will cause RecordLoader to treat every document root element as a record. This mode is slower than ID_NAME=#FILENAME, but useful when the filenames are not appropriate as document URIs.

RECORD_NAMESPACEnull Element namespace in which each document is found. If unset, but RECORD_NAME is set, then the empty namespace is assumed. If unset, and RECORD_NAME is also unset, then then the namespace of the first child element of the first root element will be used for the entire RecordLoader run.
SKIP_EXISTINGfalse

If true, existing document uris will be skipped. This allows RecordLoader to resume after being interrupted. This option may be combined with START_ID, in case the known value for START_ID already exists.

Note that one read I/O is required per skip, so SKIP_EXISTING is slower than using START_ID (below).

Note that if using CONTENT_FACTORY_CLASSNAME=com.marklogic.recordloader.xcc.XccModuleContentFactory, this option requires the module to implement its mechanism (see above).

SKIP_EXISTING_UNTIL_FIRST_MISSfalse If true, existing documents will be skipped until the first miss is found. This can be somewhat faster than SKIP_EXISTING alone, but may result in updates to some existing documents.
START_IDnull When set, records are skipped until one with an ID_NAME value equal to START_ID is found. This can be used to resume ingestion after interruptions or fatal errors.
START_ID_MULTITHREADEDfalse Normally, START_ID causes RecordLoader to temporarily reduce THREADS to 1 until the starting value is found. When this property is true, RecordLoader allows multiple threads to run even before the start value has been found. Before the start value has been found, all threads will skip their input records. Once the start value has been found by one thread, all threads will begin processing records input normally.
Enabling this property can sometimes be useful, but may result in unpredictable or non-deterministic behavior.
THREADS1

Number of RecordLoader threads.

Note that when using standard input, this value is ignored.

Note that RecordLoader uses at most 1 thread per input file or zip entry.

THROTTLE_BYTES_PER_SECOND0 If non-zero, all threads will be throttled to the given number of bytes inserted per second.
THROTTLE_EVENTS_PER_SECOND0 If non-zero, all threads will be throttled to the given number of inserts per second.
URI_PREFIXnull Prefix used before the ID_NAME value, to compose all document uris. If the prefix does not end in '/', RecordLoader will add a '/' to it.
URI_SUFFIXnull Suffix used after the ID_NAME value, to compose all document uris.
USE_FILENAME_COLLECTIONtrue If ID_NAME is not #FILENAME, and this property is true, RecordLoader will add an extra collection to each record, built from the filename of the current input file. This can be useful when splitting superfiles.
USE_TIMESTAMP_COLLECTIONtrue If this property is true, RecordLoader will add an extra collection to each record, built from the timestamp at which RecordLoader started running. This can be useful when tracking down problems with content loaded at different times.
XML_REPAIR_LEVELNONE To what degree should XPP3 and MarkLogic Server compensate for invalid XML?
  • NONE: throw an exception (see also: FATAL_ERRORS).
  • FULL: do everything reasonable to ingest the document.

Troubleshooting

XmlPullParserException: could not resolve entity named 'foo'.

The XPP implementation used by RecordLoader, xpp3, does not handle unknown references, and does not process DTD-style document declarations. So if your XML includes non-XML character entities, RecordLoader is not for you. Future enhancements could include a plug-in system, allowing the user to substitute an XPP implementation that supports document declarations.

java.util.concurrent.RejectedExecutionException.

If you are using RecordLoader with thousands of files or zipfile entries, you may need to increase the JVM heap space. Try -Xmx256m as one of your command-line JVM arguments.

With Solaris, my UTF-8 accents and diacritics are mangled.

You should see UTF-8 in the output from locale -a:

$ locale -a | grep -i utf en_CA.UTF-8 en_US.UTF-8 es.UTF-8 es_MX.UTF-8 fr.UTF-8 fr_CA.UTF-8

If no UTF-8 locales are available, make sure to install the correct Solaris packages: