To get started with RecordLoader, try the tutorial. For the source code, see the project page on github.
The entry point is the main method in the com.marklogic.ps.RecordLoader class. The command-line arguments may be any mix of zero or more property files and zero or more input files.
Any property file names must end in .properties
.
Other file names on the command line
will be treated as input files, not property files.
Any specified system properties will override file-based properties,
and properties found in later files may override properties
specified in earlier files on the command line.
It's also possibly to specify properties as VM arguments (-DNAME=value).
See
src/recordloader.sh
for a sample shell script.
See src/config/
for sample property files.
Input files will be used as input for the loader.
The input file names must not end in .properties
.
These may be binary, text, or XML (see INPUT_FORMAT
, below).
XML files may be single documents or superfiles containing many documents
(see RECORD_NAME
below).
None. If ID_NAME
is missing,
then the default value #FILENAME
will be used.
Property | default value | notes |
---|---|---|
CONFIGURATION_CLASSNAME | com.marklogic.recordloader.xcc.XccConfiguration | This class will be used to provide configuration information. This class must be an extension of the com.marklogic.recordloader.Configuration class. |
CONTENT_FACTORY_CLASSNAME | com.marklogic.recordloader.xcc.XccContentFactory |
This class will be used to create new content objects,
which implement com.marklogic.recordloader.ContentInterface.
One alternative implementation is provided, as
When XccModuleContentFactory is used, new documents must fit in memory,
and will be posted to the XQuery main module designated by the
When RecordLoader invokes this module, it will set external variables:
The following XQuery implements an example ContentModule,
which implements a simple transform to lower-case all element names.
Note that the module implements its own versions of the
xquery version "1.0-ml"; declare variable $URI as xs:string external; declare variable $XML-STRING as xs:string external; declare variable $NAMESPACE as xs:string external; declare variable $LANGUAGE as xs:string external; declare variable $ROLES-EXECUTE as xs:string external; declare variable $ROLES-INSERT as xs:string external; declare variable $ROLES-READ as xs:string external; declare variable $ROLES-UPDATE as xs:string external; declare variable $COLLECTIONS as xs:string external; declare variable $SKIP-EXISTING as xs:boolean external; declare variable $ERROR-EXISTING as xs:boolean external; declare variable $FORESTS as xs:string external; declare function local:do($list as node()*) as node()* { for $n in $list return typeswitch($n) (: lower-case element localnames :) case element() return element { QName(namespace-uri($n), lower-case(local-name($n))) } { $n/@*, local:do($n/node()) } case document-node() return document { local:do($n/node()) } default return $n }; if ($SKIP-EXISTING and doc($URI)) then () else if ($ERROR-EXISTING and doc($URI)) then error((), 'DUPLICATE-URI', $URI) else xdmp:document-insert( $URI, local:do(xdmp:unquote( $XML-STRING, $NAMESPACE, if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else () )), ( for $r in tokenize($ROLES-EXECUTE, ',')[. ne ''] return xdmp:permission($r, 'execute'), for $r in tokenize($ROLES-INSERT, ',')[. ne ''] return xdmp:permission($r, 'insert'), for $r in tokenize($ROLES-READ, ',')[. ne ''] return xdmp:permission($r, 'read'), for $r in tokenize($ROLES-UPDATE, ',')[. ne ''] return xdmp:permission($r, 'update') ), tokenize($COLLECTIONS, ',')[. ne ''], 0, for $id in tokenize($FORESTS, ',')[. ne ''] return xs:unsignedLong($id) ) Another included factory is
declare variable $URI as xs:string := xdmp:get-request-field( 'URI'); declare variable $XML-STRING as xs:string := xdmp:get-request-field( 'XML-STRING'); declare variable $NAMESPACE as xs:string := xdmp:get-request-field( 'NAMESPACE'); declare variable $LANGUAGE as xs:string := xdmp:get-request-field( 'LANGUAGE'); declare variable $ROLES-EXECUTE as xs:string* := xdmp:get-request-field( 'ROLES-EXECUTE); declare variable $ROLES-INSERT as xs:string* := xdmp:get-request-field( 'ROLES-INSERT); declare variable $ROLES-READ as xs:string* := xdmp:get-request-field( 'ROLES-READ'); declare variable $ROLES-UPDATE as xs:string* := xdmp:get-request-field( 'ROLES-UPDATE); declare variable $COLLECTIONS as xs:string* := xdmp:get-request-field( 'COLLECTIONS'); declare variable $SKIP-EXISTING as xs:boolean := xs:boolean( xdmp:get-request-field('SKIP-EXISTING')); declare variable $ERROR-EXISTING as xs:boolean := xs:boolean( xdmp:get-request-field('ERROR-EXISTING')); declare variable $FORESTS as xs:string* := xdmp:get-request-field( 'FORESTS'); if ($SKIP-EXISTING and doc($URI)) then () else if ($ERROR-EXISTING and doc($URI)) then error((), 'DUPLICATE-URI', $URI) else xdmp:document-insert( $URI, xdmp:unquote( $XML-STRING, $NAMESPACE, if ($LANGUAGE) then concat('default-language=', $LANGUAGE) else () ), ( for $r in tokenize($ROLES-EXECUTE, ',')[. ne ''] return xdmp:permission($r, 'execute'), for $r in tokenize($ROLES-INSERT, ',')[. ne ''] return xdmp:permission($r, 'insert'), for $r in tokenize($ROLES-READ, ',')[. ne ''] return xdmp:permission($r, 'read'), for $r in tokenize($ROLES-UPDATE, ',')[. ne ''] return xdmp:permission($r, 'update') ), $COLLECTIONS, 0, for $fn in $FORESTS return xdmp:forest($fn) ) |
CONNECTION_STRING | xcc://admin:admin@localhost:9000/ |
XCC URI, including username, password, host, and port,
to use for all queries and inserts.
If desired, a database name may also be supplied.
Multiple connection strings may be separated with whitespace or commas.
To use SSL for encrypted communication, start the connection string
with the xccs:// scheme
rather than the default xcc:// scheme.
Make sure the XDBC server is configured for SSL.
|
DEFAULT_NAMESPACE | null | If present, all XML will default to the supplied namespace uri. |
DELETE_INPUT_FILES | false | If true, each input file will be deleted after being loaded.
This setting does not affect zip archives.
This setting does not delete directories.
If FATAL_ERRORS is false,
then the input document may be deleted even though errors have occured.
|
DOCUMENT_FORMAT | xml | Document format for all new documents.
Valid settings are
xml , text , and binary
|
ENCRYPTED_PASSWORD | false |
Not everyone wants to include their password in the properties file. A "little" bit of security was added by adding the ENCRYPTED_PASSWORD Flag. When this flag is enabled the recordloader process will look for the "ciphertext" and "keyfile" files that contain the encrypted password to be used. To create the two encrypted password files, execute: java -cp recordloader.jar com.marklogic.ps.PasswordEncrypter PASSWORDChange PASSWORD to be the password that becomes encrypted Now the properties file does not contain a password, and the encrypted password files can be kept seperate, securely and non-human readable. |
ENCRYPTED_PASSWORD | false |
Not everyone wants to include their password in the properties file. A "little" bit of security was added by adding the ENCRYPTED_PASSWORD Flag. When this flag is enabled the recordloader process will look for the "ciphertext" and "keyfile" files that contain the encrypted password to be used. To create the two encrypted password files, execute: java -cp recordloader.jar com.marklogic.ps.PasswordEncrypter PASSWORDChange PASSWORD to be the password that becomes encrypted Now the properties file does not contain a password, and the encrypted password files can be kept seperate, securely and non-human readable. |
ERROR_EXISTING | false |
If true, RecordLoader will throw an error
if it finds itself trying to overwrite an existing document uri.
This error may or may not be fatal,
depending on the value of FATAL_ERRORS .
Note that this option requires the server to perform a separate check for each document uri. This can reduce performance.
Note that if using
|
FATAL_ERRORS | true | If true, RecordLoader will exit with an error upon encountering any non-retryable error. If set to false, RecordLoader will close the current record and continue on to the next. |
ID_NAME | #FILENAME |
Within each input document or RECORD_NAME element,
the first element called ID_NAME will be used to compose the new document uri.
If ID_NAME starts with '@', an attribute with this local-name
will be used to compose the new document uri.
Note that namespace is ignored: only the local-name is used. The named node must have a simple text value: it may not be empty, and it must not contain any non-text children.
The special value
Note that when the input is standard input,
the default value is
The special value Examples: ID_NAME=MedlineID, ID_NAME=@id |
IGNORE_FILE_BASENAME | false | If true, RecordLoader will omit the file or zip archive basename when composing new document uris. |
IGNORE_UNKNOWN | false | If set, RecordLoader will ignore siblings of RECORD_NAME that are not RECORD_NAME elements. Otherwise, this condition causes a fatal error. |
INPUT_MALFORMED_ACTION | REPORT | Constant values from java.nio.charset.CodingErrorAction,
used to determine what happens if there are
invalid character sequences in the input XML.
|
INPUT_ENCODING | UTF-8 | The Java Charset encoding (codepage) to use for all input XML.
If unset, RecordLoader will use null,
which will default to the default Locale's character encoding.
Note that MarkLogic Server must receive all XML as UTF-8, so the output encoding is always UTF-8. Example: if the input XML is encoded as windows-1252 ,
use INPUT_ENCODING=Cp1252 to ensure correct conversion.
|
INPUT_ESCAPE_IDS | true |
If true, all input ids will be URI-escaped.
Note that the default is true if ID_NAME=#FILENAME ,
which is the default. In other modes, the default is false.
|
INPUT_FILE_SIZE_LIMIT | 0 |
If greater than zero, RecordLoader will skip any input files
larger that INPUT_FILE_SIZE_LIMIT Bytes.
This does not apply to zip archives, nor to the size of their entries.
|
INPUT_HANDLER_CLASSNAME | com.marklogic.recordloader.DefaultInputHandler |
The specified class will be used to marshall loader inputs.
The default class handles INPUT_PATH
as well as command-line arguments.
This property is meant for plug-in classes, which must implement
com.marklogic.recordloader.InputHandlerInterface,
and may extend the com.marklogic.recordloader.AbstractInputHandler
class.
Built-in alternatives:
|
INPUT_PATH | null | The filesystem path in which to look for XML files or zip archives. If unset, RecordLoader will read XML directly from standard input. |
INPUT_PATTERN | ^.+\\.[Xx][Mm][Ll]$ | Matching pattern (regex) for files found in INPUT_PATH.
The default value matches all filenames ending with .xml |
INPUT_STREAMING | false | If true, the input content will be streamed into the database. By default, content will be buffered one document at a time, per thread. The streaming option requires less memory, but is more fragile: if interrupted, a document insert cannot be retried. |
INPUT_STRIP_PREFIX | null | If not null, characters matching this pattern (regex)
will be removed from all input URIs.
For example, Windows users may wish to set
INPUT_STRIP_PREFIX=^[A-Z]:
so that document URIs in the database
do not include drive-letter prefixes.
|
INPUT_NORMALIZE_PATHS | false | If true, backslashes in input paths
will be coalesced and replaced with slashes
in all output document URIs.
This is useful for Windows users,
especially in combination with INPUT_STRIP_PREFIX .
With both properties set as suggested,
C:\foo\bar\baz.xml on the filesystem becomes
/foo/bar/baz.xml in the database.
|
LANGUAGE | null |
If set, the value will be passed
to XCC ContentCreateOptions.setLanguage() ,
or to the CONTENT_MODULE
external variable $LANGUAGE .
Accepted values are documented
in XML 1.0
and RFC 3066.
If null, the default database language will be used. |
LOG_LEVEL | INFO | java.util.logger.Level at which to log. |
LOG_HANDLER | CONSOLE,FILE | java.util.logger log handlers with which to log. |
LOOP_FOREVER | false |
If set, RecordLoader will loop forever on the arguments and properties.
This is most useful when combined with DELETE_INPUT_FILES .
|
OUTPUT_COLLECTIONS | null | One or more collections to apply to every new document.
Use whitespace to separate multiple collection uris.
Note that the actual document collections will also include
so-called "base collections". One of these is a batch marker,
com.marklogic.ps.RecordLoader.{ system milliseconds } .
This base collection can be useful for tracking how and when
documents were ingested.
Another base collection is derived from
the input filename (see USE_FILENAME_COLLECTION ).
Another base collection is derived from
the current wall-clock time (see USE_TIMESTAMP_COLLECTION ).
|
OUTPUT_FORESTS | null | If set, all documents will be explicitly placed into the named forests.
Use whitespace or the characters ,:; to separate values. |
OUTPUT_QUALITY | 0 | When using XccContentFactory for inserts, this value will be used to set document quality. |
ROLES_EXECUTE | null | One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have execute permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error. |
ROLES_INSERT | null | One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have insert permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error. |
ROLES_READ | null | One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have read permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error. |
ROLES_UPDATE | null | One or more existing role name, separated by whitespace. If set, every document inserted by RecordLoader will have update permission for these roles. If any of the supplied role-names do not exist, the first document insert will throw a fatal error. |
RECORD_NAME | null |
Element name in which each document is found. These may not nest. If no RECORD_NAME is set, the first child element of the first root element will be used for the entire RecordLoader run. If |
RECORD_NAMESPACE | null | Element namespace in which each document is found. If unset, but RECORD_NAME is set, then the empty namespace is assumed. If unset, and RECORD_NAME is also unset, then then the namespace of the first child element of the first root element will be used for the entire RecordLoader run. |
SKIP_EXISTING | false |
If true, existing document uris will be skipped.
This allows RecordLoader to resume after being interrupted.
This option may be combined with Note that one read I/O is required per skip, so SKIP_EXISTING is slower than using START_ID (below).
Note that if using
|
SKIP_EXISTING_UNTIL_FIRST_MISS | false | If true, existing documents will be skipped until the first miss
is found. This can be somewhat faster than SKIP_EXISTING alone,
but may result in updates to some existing documents.
|
START_ID | null |
When set, records are skipped
until one with an ID_NAME value
equal to START_ID is found.
This can be used to resume ingestion after interruptions or fatal errors.
|
START_ID_MULTITHREADED | false |
Normally, START_ID causes RecordLoader to temporarily reduce
THREADS to 1 until the starting value is found.
When this property is true, RecordLoader allows multiple threads to run
even before the start value has been found.
Before the start value has been found, all threads will skip
their input records. Once the start value has been found by one thread,
all threads will begin processing records input normally.
Enabling this property can sometimes be useful, but may result in unpredictable or non-deterministic behavior. |
THREADS | 1 |
Number of RecordLoader threads. Note that when using standard input, this value is ignored. Note that RecordLoader uses at most 1 thread per input file or zip entry. |
THROTTLE_BYTES_PER_SECOND | 0 | If non-zero, all threads will be throttled to the given number of bytes inserted per second. |
THROTTLE_EVENTS_PER_SECOND | 0 | If non-zero, all threads will be throttled to the given number of inserts per second. |
URI_PREFIX | null | Prefix used before the ID_NAME value, to compose all document uris. If the prefix does not end in '/', RecordLoader will add a '/' to it. |
URI_SUFFIX | null | Suffix used after the ID_NAME value, to compose all document uris. |
USE_FILENAME_COLLECTION | true | If ID_NAME is not #FILENAME ,
and this property is true,
RecordLoader will add an extra collection to each record,
built from the filename of the current input file.
This can be useful when splitting superfiles.
|
USE_TIMESTAMP_COLLECTION | true | If this property is true, RecordLoader will add an extra collection to each record, built from the timestamp at which RecordLoader started running. This can be useful when tracking down problems with content loaded at different times. |
XML_REPAIR_LEVEL | NONE | To what degree should XPP3 and MarkLogic Server
compensate for invalid XML?
|
XmlPullParserException: could not resolve entity named 'foo'
.
The XPP implementation used by RecordLoader, xpp3, does not handle unknown references, and does not process DTD-style document declarations. So if your XML includes non-XML character entities, RecordLoader is not for you. Future enhancements could include a plug-in system, allowing the user to substitute an XPP implementation that supports document declarations.
java.util.concurrent.RejectedExecutionException
.
If you are using RecordLoader with thousands of files or zipfile entries,
you may need to increase the JVM heap space. Try -Xmx256m
as one of your command-line JVM arguments.
You should see UTF-8 in the output from locale -a
:
$ locale -a | grep -i utf
en_CA.UTF-8
en_US.UTF-8
es.UTF-8
es_MX.UTF-8
fr.UTF-8
fr_CA.UTF-8
If no UTF-8 locales are available, make sure to install the correct Solaris packages:
SUNWeuluf
SUNWeu8os
SUNWicu
SUNWicud