RecordLoader: Ingesting XML with Knife, Fork, and Shovel

Michael Blakeley
Last updated 2010-08-25

Note: Starting with MarkLogic 6, Marklogic Content Pump (mlcp) is a fully-supported tool that covers the same ground as this long-standing open source project. Content Pump is not supported on older versions of MarkLogic Server. Stick with recordloader (and this tutorial) if you are running earlier versions of MarkLogic.

When you first learned to use XQuery and MarkLogic Server, you probably brought XML into the database using xdmp:load() or xdmp:document-load(). Or maybe you prefer to use WebDAV - just drag and drop. But sometimes those tools aren't enough. Sometimes you need RecordLoader.

Use RecordLoader when...

RecordLoader can do all this quickly and easily. Along the way, it can...

Setting up RecordLoader

RecordLoader is a Java program, and requires a Java 1.5 environment. If you don't have Java 1.5, download and install it.

First, we'll need RecordLoader itself. So download recordloader.jar.

Since RecordLoader works by connecting to a MarkLogic Server, you'll need the Mark Logic XCC libraries, too. You can download the MarkXCC.Java zip archive here. Be sure to download the latest XCC for Java release! After you download the zip archive, unpack it and find the jar files. You can put xcc.jar anywhere on your disk, but for this tutorial I'll assume that both files are in the current directory.

Finally, you'll need a copy of XPP3. You can download it from this site. Again, be sure to get the latest version: right now, that's xpp3-1.1.3_8.jar.

Running RecordLoader

Now that we have a Java environment and all the libraries we need, let's try it out with the simplest possible invocation.

java -cp recordloader.jar:xcc.jar:xpp3.jar \ com.marklogic.ps.RecordLoader

Java command lines can get pretty ugly. I'll keep using them for this tutorial, but you might want to put all of that into a shell script (or a batch file, on Windows). Remember, I put all my jar files in the current directory. You might have used another location: if so, just change the classpath in your command-line to match.

OK, so we copied that command-line and pasted it into a shell. What happened?

java -cp recordloader.jar:xcc.jar:xpp3.jar \ com.marklogic.ps.RecordLoader Exception in thread "main" java.lang.NoClassDefFoundError: com/marklogic/ps/RecordLoader

Oh, that doesn't look good. Hmm... looks like we forgot to put recordloader.jar into the current directory. We'll do that, and try again...

java -cp recordloader.jar:xcc.jar:xpp3.jar \ com.marklogic.ps.RecordLoader Jun 7, 2006 2:06:40 PM com.marklogic.ps.RecordLoader main INFO: RecordLoader starting, version 2006-06-07.1 Exception in thread "main" java.io.IOException: missing required property: ID_NAME at com.marklogic.ps.RecordLoader.main(RecordLoader.java:465)

That looks a little bit better: we see a startup message, which incidentally tells us which version of RecordLoader we have. That could be important to know.

But what about the error? What does missing required property: ID_NAME mean? ID_NAME is the name of a property. In Java, properties are simply name-value pairs, such as SIZE=1 or PATH=/lib/java. RecordLoader uses properties for almost everything: let's learn how to configure those properties.

RecordLoader Configuration

RecordLoader configures itself by looking at the System properties, plus any property files that you supply on the command-line. We'll learn about property files in a minute. But for now, we just want to load a simple XML file. To do this, we only need a couple of properties: ID_NAME and CONNECTION_STRING.

ID_NAME is the name of an element or an attribute that RecordLoader can use as a unique ID for the input XML. RecordLoader will use this unique ID as part of the final document URI. If you use an attribute, put @ in front of its name. For example, if you are processing Medline XML, you might set ID_NAME=PMID. For XML that uses an attribute named 'id', you would set ID_NAME=@id.

There's also a special value ID_NAME=#AUTO, tells RecordLoader to make up its own ids for the documents it inserts, starting with 1 and counting up. This should be used only with caution, since it gives you very little control over your final document URIs.

Next, RecordLoader needs to know how to connect to MarkLogic Server: that's where CONNECTION_STRING comes in. The default value is xcc://admin:admin@localhost:9000. That will work, as long as you've set up an XDBC server that listens on port 9000, points at the right database, and has a user named 'admin' (password 'admin') with enough privileges to write to the database. I'm going to use the default value for this tutorial, but if you need to use another username, password, host, or port, then you'll have to set CONNECTION_STRING.

If you're looking at the required properties in the README file, you might notice that INPUT_PATH isn't required, and defaults to null. If it isn't set, RecordLoader looks for XML on the standard input stream. This allows you to uncompress a large file and pipe the resulting XML directly to RecordLoader. You can pipe the result of any command to RecordLoader in this way: note that the resulting XML must look like a single document to RecordLoader. See the "Wikipedia" use-case, below, for an example of this technique.

Use Case: Medline

Now that we have a working configuration, let's try it out. We'll use Medline content, from the National Library of Medicine. You can download a sample file here. The sample file, medsamp2006.xml is about 372KB, and contains 87 MedlineCitation elements. We want to break it up so that every MedlineCitation is a new document - and that's exactly what RecordLoader was designed to do.

Looking at the structure of medsamp2006.xml, we decide that the unique element is PMID. So here's how we can run RecordLoader:

java -cp recordloader.jar:xcc.jar:xpp3.jar \ -DID_NAME=PMID \ com.marklogic.ps.RecordLoader medsamp2006.xml Jun 7, 2006 3:06:39 PM com.marklogic.ps.RecordLoader main INFO: processing argument: medsamp2006.xml Jun 7, 2006 3:06:39 PM com.marklogic.ps.RecordLoader main INFO: RecordLoader starting, version 2006-06-07.1 logging to CONSOLE logging to file simplelogger-%u-%g.log Jun 7, 2006 3:06:40 PM com.marklogic.ps.SimpleLogger configureLogger INFO: setting up logging for: com.marklogic.ps Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader getDecoder INFO: using input encoding UTF-8 Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader getDecoder INFO: using malformed input action REPORT Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader main INFO: using output encoding UTF-8 Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader main INFO: connecting to admin:admin@localhost:9000 Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader handleFileInput INFO: loading from /tmp/medsamp2006.xml Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader initialize INFO: adding extra collection: com.marklogic.ps.RecordLoader.1149718000087 Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader setFileBasename INFO: using fileBasename = medsamp2006 Jun 7, 2006 3:06:40 PM com.marklogic.ps.RecordLoader displayProgress INFO: inserted record 1 as medsamp2006/10540283 (631 ms, 7874 B, 2 tps, 12 kB/s) Jun 7, 2006 3:06:41 PM com.marklogic.ps.RecordLoader handleFileInput INFO: no files remaining Jun 7, 2006 3:06:41 PM com.marklogic.ps.RecordLoader handleFileInput INFO: waiting for thread 0 Jun 7, 2006 3:06:43 PM com.marklogic.ps.RecordLoader displayProgress INFO: inserted record 40 as medsamp2006/14729922 (3634 ms, 174693 B, 11 tps, 47 kB/s) Jun 7, 2006 3:06:44 PM com.marklogic.ps.RecordLoader finishMain INFO: loaded 88 records ok (4842 ms, 368375 B, 18 tps, 74 kB/s)

That was pretty quick. What happened?

First, we saw some messages from RecordLoader, telling us that it's using UTF-8 for both input and output character encoding. Next, we see the full path to the Medline XML sample. Toward the end, we saw a progress message every 3 seconds - and at the very end, we saw confirmation that we loaded 88 new documents into the database.

RecordLoader looks at its input XML for new "records". Each record is simply an XML element that wants to be inserted into the MarkLogic database as a new document. The default behavior is to break up the input XML based on the name of the first element it sees below the root element: in our simple test, that was the element named MedlineCitation, which was exactly what we wanted.

Didn't the NLM say that there were 87 sample records? It turns out that they lied: there are 88. We can confirm this with a simple XQuery:

for $i at $x in /MedlineCitation return text {{ $x, xdmp:node-uri($i) }} => 1 medsamp2006/12230355 2 medsamp2006/12230384 3 medsamp2006/12742516 4 medsamp2006/12742518 5 medsamp2006/12742519 [ ...skipping a few... ] 84 medsamp2006/15206831 85 medsamp2006/10389168 86 medsamp2006/9634358 87 medsamp2006/11731716 88 medsamp2006/15278624

That's interesting: the new document URIs all start with 'medsamp2006/'. That's because RecordLoader automatically prefixes every document with the base name of the input file: in this case, that's medsamp2006. If we set OUTPUT_URI_PREFIX=/foo/, all the URIs will start will '/foo/medsamp2006/'. If we set OUTPUT_URI_SUFFIX=.xml, every new URI will end with '.xml'.

If we set up a WebDAV server with a root of 'medsamp2006/', we can view these documents via any WebDAV client. Or we could query by URI. Either way, let's look at the first document:

<MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID>12230355</PMID> <DateCreated> <Year>2002</Year> <Month>09</Month> <Day>16</Day> </DateCreated> <DateCompleted> <Year>2002</Year> <Month>09</Month> <Day>25</Day> </DateCompleted> <DateRevised> <Year>2005</Year> <Month>11</Month> <Day>17</Day> </DateRevised> <Article PubModel="Print"> <Journal> <ISSN IssnType="Electronic">1539-3704</ISSN> [ ...etc ]

There's one more wrinkle here: if we look at medsamp2006.xml, we find that the PMID element appears as a child of MedlineCitation, but sometimes its also a lower-level descendant, two levels underneath MedlineCitation/CommentsCorrections. That's ok: the PMID that we want, and that RecordLoader will use, always comes immediately after the MedlineCitation element.

What if we have more that one XML file? The full Medline distribution has hundreds of files, each with up to 30,000 MedlineCitation elements. Some are 100MB of XML, or more. But RecordLoader handles them easily, because it uses a pull-based parser to stream through the input. Let's tell RecordLoader to load everything in a directory:

java -cp recordloader.jar:xcc.jar:xpp3.jar \ -DID_NAME=PMID \ com.marklogic.ps.RecordLoader /tmp/medline/*.xml

If you have too many files, though, shell expansion will fail. In that case, we can use another property, INPUT_PATH. RecordLoader will find and load every XML file in INPUT_PATH, as long as it ends with .xml. We can match other filename patterns by setting INPUT_PATTERN to a (Java-style) regular expression.

Also, we want to skip any existing documents with the same URI, in case we're resuming an interrupted load.

java -cp recordloader.jar:xcc.jar:xpp3.jar \ -DID_NAME=PMID -DINPUT_PATH=/tmp/medline -DSKIP_EXISTING=true \ com.marklogic.ps.RecordLoader

Whew - that's getting to be too ugly. Let's clean it up. We can put all the properties in a file. We'll call it medline.properties: we could use pretty much any name, but it should end with ".properties".

ID_NAME=PMID INPUT_PATH=/tmp/medline SKIP_EXISTING=true THREADS=4

Now we might want to create a shell script to hold the java command and that ugly classpath. Let's call it recordloader.sh:

#!/bin/sh # CP=$HOME/lib/java/recordloader.jar CP=$CP:$HOME/lib/java/xcc.jar CP=$CP:$HOME/lib/java/xpp3.jar java -cp $CP com.marklogic.ps.RecordLoader $* # end recordloader.sh

We can use chmod +x recordloader.sh to make it executable, and now our command-line is simple:

./recordloader.sh medline.properties

Ah, that's better. Now I can move all the jar files into a common $HOME/lib/java directory, where I keep all my third-party jars.

Did you notice the other change? We added THREADS=4 to the new properties file. Now RecordLoader will start a thread for each input file, up to 4 threads at a time.

But loading all of Medline will still take a little while: probably hours, maybe days. What happens if the power goes off? We can resume from where we left off!

At the most basic level, we can use SKIP_EXISTING=true to skip past any existing documents. But using SKIP_EXISTING, RecordLoader will query the database for every new URI it generates. That's better than nothing, but not as fast as we'd like.

If we look in the RecordLoader output, though, we can find out the last PMID that was successfully loaded. Then we set START_ID to that value: RecordLoader will process XML until it reaches the supplied value, and then it will start inserting documents. It's best to resume using both SKIP_EXISTING=true and START_ID, so that you minimize the amount of work to be done.

Note that it isn't safe to use START_ID with multiple threads: the threads won't know which file contains the START_ID record, which causes problems. If you're interested in enhancing RecordLoader to solve this problem, patches are welcome.

Use Case: Wikipedia

Let's look at an uglier case: wikipedia. The Wikipedia project is one of the largest sources of freely available content on the net, and database dumps can be downloaded from http://download.wikimedia.org/. The largest file is pages-meta-history.xml: it contains the complete history of every article, plus discussion and user pages. The bzip2-compressed XML download for enwiki-20060518 is 33 GB, and the uncompressed size is over 300 GB. That's a lot of XML.

Can RecordLoader handle it? Yes, but we need to be careful. First, let's take a quick look at the first few lines of the XML:

$ bunzip2 --stdout enwiki-20060518-pages-meta-history.xml.bz2 | head -40 <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en"> <siteinfo> <sitename>Wikipedia</sitename> <base>http://en.wikipedia.org/wiki/Main_Page</base> <generator>MediaWiki 1.7alpha</generator> <case>first-letter</case> <namespaces> <namespace key="-2">Media</namespace> [ ...etc... ] </namespaces> </siteinfo> <page> <title>AaA</title> <id>1</id> <revision> <id>233181</id> <timestamp>2001-02-06T20:07:40Z</timestamp> <contributor> <username>JimboWales</username> <id>479</id> </contributor> <comment>*</comment> [ ...etc... ]

We decide that each page element will become a document, and that the URIs will be based on the page/id element. But there's also a siteinfo element, which is a sibling of page, and appears before it in the document. If we don't do something about that, RecordLoader will automatically decide to use siteinfo as the document root, and then complain when it sees page at the same level of the input XML.

We can fix this using two new properties: RECORD_NAME tells RecordLoader which element to use as the output document root element, and IGNORE_UNKNOWN tells RecordLoader to ignore elements that don't match RECORD_NAME. Note that all the wikipedia content is in a special namespace, too, so we have to set RECORD_NAMESPACE as well as RECORD_NAME. So our wikipedia.properties file is:

#CONNECTION_STRING=xcc://user:password@hostname:portnumber ID_NAME=id RECORD_NAME=page RECORD_NAMESPACE=http://www.mediawiki.org/xml/export-0.3/ IGNORE_UNKNOWN=true OUTPUT_COLLECTIONS=wikipedia URI_PREFIX=/wikipedia/ URI_SUFFIX=.xml DEFAULT_NAMESPACE=org.wikipedia.content

We added an XCC connection string, but it's commented out because we're still using the default value, from before. Given this configuration, RecordLoader will insert each new document as a member of the collection 'wikipedia', too. And did you notice the DEFAULT_NAMESPACE? That will actually override the existing namespace in the input XML with our own choice of namespace. Why would we want to do that? Just to show that we can.

But what about INPUT_PATH and INPUT_PATTERN? How will RecordLoader find the input XML? It turns out that if we omit any input path, RecordLoader tries to read from standard input. This is especially handy for this giant compressed XML file, because I don't want to actually uncompress it onto my disk. Instead, I can pipe the output of bunzip2 directly into RecordLoader.

bunzip2 --stdout enwiki-20060518-pages-meta-history.xml.bz2 \ | ./recordloader.sh wikipedia.properties

Don't try to run that command just yet, though: you'll almost certainly hit a fatal XDMP-FRAGTOOLARGE error. This is because our input page elements are sometimes very large. Since the input pages are in alphabetical order, we see the first problem when we reach the entry for Anarchy.

It turns out that some pages are very controversial, and undergo "revision wars". This causes these page elements to contain lots of revision children, which makes some pages too large to fit into MarkLogic Server's default memory limits. We could increase those limits, but in cases like this, it makes more sense to break up the documents somehow. Now, we don't want to create a separate document for every revision: that would make it difficult to retain the page-level data.

Instead, we tell MarkLogic Server to fragment the document. We won't go into the details of fragments here: just remember to create a fragment root on namespace org.wikipedia.content and local-name revision, before you start RecordLoader.

Note that loading all of wikipedia will take a while. This time, we can't use THREADS to speed it up, either. That's because RecordLoader only knows how to spawn threads for input files. This XML is arriving through standard input, which is roughly the same thing as having just one input file.

Conclusion

We hope this tutorial was useful. For more information about RecordLoader, see the README.

Now go load some content!

Troubleshooting

NoClassDefFoundError: Your classpath isn't finding one of the jar files. Examine it carefully, and make sure the paths are all accurate.

I'm running Windows, and RecordLoader is ignoring my INPUT_PATH property: Try escaping all the back-slash characters in your path. For example, change c:\foo to c:\\foo - alternatively, change back-slashes to forward-slashes.