Note: Starting with MarkLogic 6, Marklogic Content Pump (mlcp) is a fully-supported tool that covers the same ground as this long-standing open source project. Content Pump is not supported on older versions of MarkLogic Server. Stick with recordloader (and this tutorial) if you are running earlier versions of MarkLogic.
When you first learned to use XQuery and MarkLogic Server,
you probably brought XML into the database
using xdmp:load()
or xdmp:document-load()
.
Or maybe you prefer to use WebDAV - just drag and drop.
But sometimes those tools aren't enough.
Sometimes you need RecordLoader.
Use RecordLoader when...
RecordLoader can do all this quickly and easily. Along the way, it can...
RecordLoader is a Java program, and requires a Java 1.5 environment. If you don't have Java 1.5, download and install it.
First, we'll need RecordLoader itself. So download recordloader.jar.
Since RecordLoader works by connecting to a MarkLogic Server,
you'll need the Mark Logic XCC libraries, too.
You can download the MarkXCC.Java zip archive
here.
Be sure to download the latest XCC for Java release!
After you download the zip archive, unpack it and find the jar files.
You can put xcc.jar
anywhere on your disk,
but for this tutorial I'll assume
that both files are in the current directory.
Finally, you'll need a copy of XPP3. You can download it from
this site.
Again, be sure to get the latest version: right now, that's
xpp3-1.1.3_8.jar
.
Now that we have a Java environment and all the libraries we need, let's try it out with the simplest possible invocation.
Java command lines can get pretty ugly. I'll keep using them for this tutorial, but you might want to put all of that into a shell script (or a batch file, on Windows). Remember, I put all my jar files in the current directory. You might have used another location: if so, just change the classpath in your command-line to match.
OK, so we copied that command-line and pasted it into a shell. What happened?
Oh, that doesn't look good. Hmm... looks like we forgot to put recordloader.jar into the current directory. We'll do that, and try again...
That looks a little bit better: we see a startup message, which incidentally tells us which version of RecordLoader we have. That could be important to know.
But what about the error? What does
missing required property: ID_NAME
mean?
ID_NAME is the name of a property. In Java, properties are
simply name-value pairs, such as SIZE=1
or PATH=/lib/java
.
RecordLoader uses properties for almost everything:
let's learn how to configure those properties.
RecordLoader configures itself by looking at the System properties,
plus any property files that you supply on the command-line.
We'll learn about property files in a minute.
But for now, we just want to load a simple XML file.
To do this, we only need a couple of properties:
ID_NAME
and CONNECTION_STRING
.
ID_NAME
is the name of an element or an attribute
that RecordLoader can use as a unique ID for the input XML.
RecordLoader will use this unique ID
as part of the final document URI.
If you use an attribute, put @
in front of its name.
For example, if you are processing Medline XML,
you might set ID_NAME=PMID
.
For XML that uses an attribute named 'id',
you would set ID_NAME=@id
.
There's also a special value
ID_NAME=#AUTO
,
tells RecordLoader to make up its own ids for the
documents it inserts, starting with 1 and counting up.
This should be used only with caution, since it
gives you very little control over your final document URIs.
Next, RecordLoader needs to know how to connect to MarkLogic Server:
that's where CONNECTION_STRING
comes in.
The default value is xcc://admin:admin@localhost:9000
.
That will work, as long as you've set up an XDBC server
that listens on port 9000, points at the right database,
and has a user named 'admin' (password 'admin')
with enough privileges to write to the database.
I'm going to use the default value for this tutorial,
but if you need to use another username, password, host, or port,
then you'll have to set CONNECTION_STRING
.
If you're looking at the required properties in the README file,
you might notice that INPUT_PATH
isn't required,
and defaults to null. If it isn't set, RecordLoader
looks for XML on the standard input stream. This allows
you to uncompress a large file and pipe the resulting XML
directly to RecordLoader. You can pipe the result of
any command to RecordLoader in this way:
note that the resulting XML must look like a single document
to RecordLoader.
See the "Wikipedia" use-case, below, for an example of this technique.
Now that we have a working configuration,
let's try it out. We'll use Medline content,
from the
National Library of Medicine.
You can download a sample file
here.
The sample file, medsamp2006.xml
is about 372KB,
and contains 87 MedlineCitation elements.
We want to break it up so that every MedlineCitation is a new document
- and that's exactly what RecordLoader was designed to do.
Looking at the structure of medsamp2006.xml
,
we decide that the unique element is PMID
.
So here's how we can run RecordLoader:
That was pretty quick. What happened?
First, we saw some messages from RecordLoader, telling us that it's using UTF-8 for both input and output character encoding. Next, we see the full path to the Medline XML sample. Toward the end, we saw a progress message every 3 seconds - and at the very end, we saw confirmation that we loaded 88 new documents into the database.
RecordLoader looks at its input XML for new "records".
Each record is simply an XML element that wants to be inserted
into the MarkLogic database as a new document.
The default behavior is to break up the input XML
based on the name of the first element it sees below the root element:
in our simple test, that was the element named MedlineCitation
,
which was exactly what we wanted.
Didn't the NLM say that there were 87 sample records? It turns out that they lied: there are 88. We can confirm this with a simple XQuery:
That's interesting: the new document URIs all start with 'medsamp2006/'.
That's because RecordLoader automatically prefixes every document
with the base name of the input file: in this case, that's medsamp2006.
If we set OUTPUT_URI_PREFIX=/foo/
, all the URIs
will start will '/foo/medsamp2006/'.
If we set OUTPUT_URI_SUFFIX=.xml
,
every new URI will end with '.xml'.
If we set up a WebDAV server with a root of 'medsamp2006/', we can view these documents via any WebDAV client. Or we could query by URI. Either way, let's look at the first document:
There's one more wrinkle here:
if we look at medsamp2006.xml
,
we find that the PMID
element appears
as a child of MedlineCitation
,
but sometimes its also a lower-level descendant,
two levels underneath MedlineCitation/CommentsCorrections
.
That's ok: the PMID
that we want, and that RecordLoader will use,
always comes immediately after the MedlineCitation
element.
What if we have more that one XML file? The full Medline distribution has hundreds of files, each with up to 30,000 MedlineCitation elements. Some are 100MB of XML, or more. But RecordLoader handles them easily, because it uses a pull-based parser to stream through the input. Let's tell RecordLoader to load everything in a directory:
If you have too many files, though, shell expansion will fail.
In that case, we can use another property, INPUT_PATH
.
RecordLoader will find and load every XML file in INPUT_PATH
,
as long as it ends with .xml
.
We can match other filename patterns by setting
INPUT_PATTERN
to a (Java-style) regular expression.
Also, we want to skip any existing documents with the same URI, in case we're resuming an interrupted load.
Whew - that's getting to be too ugly. Let's clean it up.
We can put all the properties in a file.
We'll call it medline.properties
:
we could use pretty much any name, but it should end with ".properties".
Now we might want to create a shell script to hold the java command
and that ugly classpath. Let's call it recordloader.sh
:
We can use chmod +x recordloader.sh
to make it executable,
and now our command-line is simple:
Ah, that's better. Now I can move all the jar files into
a common $HOME/lib/java
directory,
where I keep all my third-party jars.
Did you notice the other change?
We added THREADS=4
to the new properties file.
Now RecordLoader will start a thread for each input file,
up to 4 threads at a time.
But loading all of Medline will still take a little while: probably hours, maybe days. What happens if the power goes off? We can resume from where we left off!
At the most basic level, we can use SKIP_EXISTING=true
to skip past any existing documents.
But using SKIP_EXISTING, RecordLoader will query the database
for every new URI it generates.
That's better than nothing, but not as fast as we'd like.
If we look in the RecordLoader output, though,
we can find out the last PMID that was successfully loaded.
Then we set START_ID
to that value:
RecordLoader will process XML until it reaches the supplied value,
and then it will start inserting documents.
It's best to resume using both SKIP_EXISTING=true
and START_ID
, so that you minimize
the amount of work to be done.
Note that it isn't safe to use START_ID with multiple threads: the threads won't know which file contains the START_ID record, which causes problems. If you're interested in enhancing RecordLoader to solve this problem, patches are welcome.
Let's look at an uglier case: wikipedia.
The Wikipedia project
is one of the largest sources of freely available content on the net,
and database dumps can be downloaded from
http://download.wikimedia.org/.
The largest file is pages-meta-history.xml
:
it contains the complete history of every article,
plus discussion and user pages.
The bzip2-compressed XML download for enwiki-20060518 is 33 GB,
and the uncompressed size is over 300 GB.
That's a lot of XML.
Can RecordLoader handle it? Yes, but we need to be careful. First, let's take a quick look at the first few lines of the XML:
We decide that each page
element will become a document,
and that the URIs will be based on the page/id
element.
But there's also a siteinfo
element, which is a sibling
of page
, and appears before it in the document.
If we don't do something about that,
RecordLoader will automatically decide to use siteinfo
as the document root, and then complain when it sees page
at the same level of the input XML.
We can fix this using two new properties:
RECORD_NAME
tells RecordLoader which element to use
as the output document root element,
and IGNORE_UNKNOWN
tells RecordLoader to ignore
elements that don't match RECORD_NAME
.
Note that all the wikipedia content is in a special namespace, too,
so we have to set RECORD_NAMESPACE
as well as RECORD_NAME
.
So our wikipedia.properties
file is:
We added an XCC connection string, but it's commented out because
we're still using the default value, from before.
Given this configuration, RecordLoader will insert each new document
as a member of the collection 'wikipedia', too.
And did you notice the DEFAULT_NAMESPACE
?
That will actually override the existing namespace in the input XML
with our own choice of namespace.
Why would we want to do that? Just to show that we can.
But what about INPUT_PATH
and INPUT_PATTERN
?
How will RecordLoader find the input XML?
It turns out that if we omit any input path,
RecordLoader tries to read from standard input.
This is especially handy for this giant compressed XML file,
because I don't want to actually uncompress it onto my disk.
Instead, I can pipe the output of bunzip2
directly into RecordLoader.
Don't try to run that command just yet, though: you'll almost certainly
hit a fatal XDMP-FRAGTOOLARGE
error.
This is because our input page
elements are sometimes very large.
Since the input pages are in alphabetical order, we see the first
problem when we reach the entry for
Anarchy.
It turns out that some pages are very controversial,
and undergo "revision wars".
This causes these page
elements
to contain lots of revision
children,
which makes some pages too large to fit into MarkLogic Server's
default memory limits.
We could increase those limits, but in cases like this,
it makes more sense to break up the documents somehow.
Now, we don't want to create a separate document for every revision:
that would make it difficult to retain the page-level data.
Instead, we tell MarkLogic Server to fragment the document.
We won't go into the details of fragments here:
just remember to create a fragment root on
namespace org.wikipedia.content
and local-name revision
,
before you start RecordLoader.
Note that loading all of wikipedia will take a while.
This time, we can't use THREADS
to speed it up,
either. That's because RecordLoader only knows how to
spawn threads for input files. This XML is arriving through
standard input, which is roughly the same thing as having
just one input file.
We hope this tutorial was useful. For more information about RecordLoader, see the README.
Now go load some content!
NoClassDefFoundError
:
Your classpath isn't finding one
of the jar files. Examine it carefully, and make sure the paths are
all accurate.
I'm running Windows, and RecordLoader is ignoring my
INPUT_PATH
property:
Try escaping all the back-slash characters in your path.
For example, change c:\foo
to c:\\foo
- alternatively, change back-slashes to forward-slashes.