Note: Starting with MarkLogic 6, Marklogic Content Pump (mlcp) is a fully-supported tool that covers the same ground as this long-standing open source project. Content Pump is not supported on older versions of MarkLogic Server. Stick with recordloader (and this tutorial) if you are running earlier versions of MarkLogic.
When you first learned to use XQuery and MarkLogic Server,
you probably brought XML into the database
Or maybe you prefer to use WebDAV - just drag and drop.
But sometimes those tools aren't enough.
Sometimes you need RecordLoader.
Use RecordLoader when...
RecordLoader can do all this quickly and easily. Along the way, it can...
RecordLoader is a Java program, and requires a Java 1.5 environment. If you don't have Java 1.5, download and install it.
First, we'll need RecordLoader itself. So download recordloader.jar.
Since RecordLoader works by connecting to a MarkLogic Server,
you'll need the Mark Logic XCC libraries, too.
You can download the MarkXCC.Java zip archive
Be sure to download the latest XCC for Java release!
After you download the zip archive, unpack it and find the jar files.
You can put
xcc.jar anywhere on your disk,
but for this tutorial I'll assume
that both files are in the current directory.
Finally, you'll need a copy of XPP3. You can download it from
Again, be sure to get the latest version: right now, that's
Now that we have a Java environment and all the libraries we need, let's try it out with the simplest possible invocation.
Java command lines can get pretty ugly. I'll keep using them for this tutorial, but you might want to put all of that into a shell script (or a batch file, on Windows). Remember, I put all my jar files in the current directory. You might have used another location: if so, just change the classpath in your command-line to match.
OK, so we copied that command-line and pasted it into a shell. What happened?
Oh, that doesn't look good. Hmm... looks like we forgot to put recordloader.jar into the current directory. We'll do that, and try again...
That looks a little bit better: we see a startup message, which incidentally tells us which version of RecordLoader we have. That could be important to know.
But what about the error? What does
missing required property: ID_NAME mean?
ID_NAME is the name of a property. In Java, properties are
simply name-value pairs, such as
RecordLoader uses properties for almost everything:
let's learn how to configure those properties.
RecordLoader configures itself by looking at the System properties,
plus any property files that you supply on the command-line.
We'll learn about property files in a minute.
But for now, we just want to load a simple XML file.
To do this, we only need a couple of properties:
ID_NAME is the name of an element or an attribute
that RecordLoader can use as a unique ID for the input XML.
RecordLoader will use this unique ID
as part of the final document URI.
If you use an attribute, put
@ in front of its name.
For example, if you are processing Medline XML,
you might set
For XML that uses an attribute named 'id',
you would set
There's also a special value
tells RecordLoader to make up its own ids for the
documents it inserts, starting with 1 and counting up.
This should be used only with caution, since it
gives you very little control over your final document URIs.
Next, RecordLoader needs to know how to connect to MarkLogic Server:
CONNECTION_STRING comes in.
The default value is
That will work, as long as you've set up an XDBC server
that listens on port 9000, points at the right database,
and has a user named 'admin' (password 'admin')
with enough privileges to write to the database.
I'm going to use the default value for this tutorial,
but if you need to use another username, password, host, or port,
then you'll have to set
If you're looking at the required properties in the README file,
you might notice that
INPUT_PATH isn't required,
and defaults to null. If it isn't set, RecordLoader
looks for XML on the standard input stream. This allows
you to uncompress a large file and pipe the resulting XML
directly to RecordLoader. You can pipe the result of
any command to RecordLoader in this way:
note that the resulting XML must look like a single document
See the "Wikipedia" use-case, below, for an example of this technique.
Now that we have a working configuration,
let's try it out. We'll use Medline content,
National Library of Medicine.
You can download a sample file
The sample file,
medsamp2006.xml is about 372KB,
and contains 87 MedlineCitation elements.
We want to break it up so that every MedlineCitation is a new document
- and that's exactly what RecordLoader was designed to do.
Looking at the structure of
we decide that the unique element is
So here's how we can run RecordLoader:
That was pretty quick. What happened?
First, we saw some messages from RecordLoader, telling us that it's using UTF-8 for both input and output character encoding. Next, we see the full path to the Medline XML sample. Toward the end, we saw a progress message every 3 seconds - and at the very end, we saw confirmation that we loaded 88 new documents into the database.
RecordLoader looks at its input XML for new "records".
Each record is simply an XML element that wants to be inserted
into the MarkLogic database as a new document.
The default behavior is to break up the input XML
based on the name of the first element it sees below the root element:
in our simple test, that was the element named
which was exactly what we wanted.
Didn't the NLM say that there were 87 sample records? It turns out that they lied: there are 88. We can confirm this with a simple XQuery:
That's interesting: the new document URIs all start with 'medsamp2006/'.
That's because RecordLoader automatically prefixes every document
with the base name of the input file: in this case, that's medsamp2006.
If we set
OUTPUT_URI_PREFIX=/foo/, all the URIs
will start will '/foo/medsamp2006/'.
If we set
every new URI will end with '.xml'.
If we set up a WebDAV server with a root of 'medsamp2006/', we can view these documents via any WebDAV client. Or we could query by URI. Either way, let's look at the first document:
There's one more wrinkle here:
if we look at
we find that the
PMID element appears
as a child of
but sometimes its also a lower-level descendant,
two levels underneath
That's ok: the
PMID that we want, and that RecordLoader will use,
always comes immediately after the
What if we have more that one XML file? The full Medline distribution has hundreds of files, each with up to 30,000 MedlineCitation elements. Some are 100MB of XML, or more. But RecordLoader handles them easily, because it uses a pull-based parser to stream through the input. Let's tell RecordLoader to load everything in a directory:
If you have too many files, though, shell expansion will fail.
In that case, we can use another property,
RecordLoader will find and load every XML file in
as long as it ends with
We can match other filename patterns by setting
INPUT_PATTERN to a (Java-style) regular expression.
Also, we want to skip any existing documents with the same URI, in case we're resuming an interrupted load.
Whew - that's getting to be too ugly. Let's clean it up.
We can put all the properties in a file.
We'll call it
we could use pretty much any name, but it should end with ".properties".
Now we might want to create a shell script to hold the java command
and that ugly classpath. Let's call it
We can use
chmod +x recordloader.sh to make it executable,
and now our command-line is simple:
Ah, that's better. Now I can move all the jar files into
where I keep all my third-party jars.
Did you notice the other change?
THREADS=4 to the new properties file.
Now RecordLoader will start a thread for each input file,
up to 4 threads at a time.
But loading all of Medline will still take a little while: probably hours, maybe days. What happens if the power goes off? We can resume from where we left off!
At the most basic level, we can use
to skip past any existing documents.
But using SKIP_EXISTING, RecordLoader will query the database
for every new URI it generates.
That's better than nothing, but not as fast as we'd like.
If we look in the RecordLoader output, though,
we can find out the last PMID that was successfully loaded.
Then we set
START_ID to that value:
RecordLoader will process XML until it reaches the supplied value,
and then it will start inserting documents.
It's best to resume using both
START_ID, so that you minimize
the amount of work to be done.
Note that it isn't safe to use START_ID with multiple threads: the threads won't know which file contains the START_ID record, which causes problems. If you're interested in enhancing RecordLoader to solve this problem, patches are welcome.
Let's look at an uglier case: wikipedia.
The Wikipedia project
is one of the largest sources of freely available content on the net,
and database dumps can be downloaded from
The largest file is
it contains the complete history of every article,
plus discussion and user pages.
The bzip2-compressed XML download for enwiki-20060518 is 33 GB,
and the uncompressed size is over 300 GB.
That's a lot of XML.
Can RecordLoader handle it? Yes, but we need to be careful. First, let's take a quick look at the first few lines of the XML:
We decide that each
page element will become a document,
and that the URIs will be based on the
But there's also a
siteinfo element, which is a sibling
page, and appears before it in the document.
If we don't do something about that,
RecordLoader will automatically decide to use
as the document root, and then complain when it sees
at the same level of the input XML.
We can fix this using two new properties:
RECORD_NAME tells RecordLoader which element to use
as the output document root element,
IGNORE_UNKNOWN tells RecordLoader to ignore
elements that don't match
Note that all the wikipedia content is in a special namespace, too,
so we have to set
as well as
wikipedia.properties file is:
We added an XCC connection string, but it's commented out because
we're still using the default value, from before.
Given this configuration, RecordLoader will insert each new document
as a member of the collection 'wikipedia', too.
And did you notice the
That will actually override the existing namespace in the input XML
with our own choice of namespace.
Why would we want to do that? Just to show that we can.
But what about
How will RecordLoader find the input XML?
It turns out that if we omit any input path,
RecordLoader tries to read from standard input.
This is especially handy for this giant compressed XML file,
because I don't want to actually uncompress it onto my disk.
Instead, I can pipe the output of
directly into RecordLoader.
Don't try to run that command just yet, though: you'll almost certainly
hit a fatal
This is because our input
page elements are sometimes very large.
Since the input pages are in alphabetical order, we see the first
problem when we reach the entry for
It turns out that some pages are very controversial,
and undergo "revision wars".
This causes these
to contain lots of
which makes some pages too large to fit into MarkLogic Server's
default memory limits.
We could increase those limits, but in cases like this,
it makes more sense to break up the documents somehow.
Now, we don't want to create a separate document for every revision:
that would make it difficult to retain the page-level data.
Instead, we tell MarkLogic Server to fragment the document.
We won't go into the details of fragments here:
just remember to create a fragment root on
before you start RecordLoader.
Note that loading all of wikipedia will take a while.
This time, we can't use
THREADS to speed it up,
either. That's because RecordLoader only knows how to
spawn threads for input files. This XML is arriving through
standard input, which is roughly the same thing as having
just one input file.
We hope this tutorial was useful. For more information about RecordLoader, see the README.
Now go load some content!
Your classpath isn't finding one
of the jar files. Examine it carefully, and make sure the paths are
I'm running Windows, and RecordLoader is ignoring my
Try escaping all the back-slash characters in your path.
For example, change
- alternatively, change back-slashes to forward-slashes.