Semantic Storage and Retrieval with MarkLogic Server

Note: MarkLogic 7 introduces support for semantic queries that makes this toolkit obsolete, including support for all of the capabilities in SPARQL 1.0 and many of the capabilities in SPARQL 1.1. MarkLogic includes rich support for triples, and integrates querying triples with MarkLogic search in ways that allow you to not only create interesting semantic applications, but to integrate them with search in a rich way. For details, see the Semantics Developer's Guide

If you are using MarkLogic 6 or earlier, you should stick with this toolkit.

This project includes XQuery and Java code for storage and retrieval of semantic content with MarkLogic Server, so that MarkLogic Server can be a triple store. This library also supports quads. The source code is available on the project page.

Loading Semantic Data in MarkLogic Server

This project includes a plugin class, NQLoader, which used NXParser to extend RecordLoader. You will need a working installation of RecordLoader, plus the NQLoader classes and related XQuery code. You can download a jar file including the NQLoader class here.

Input Format

NQLoader expects input data as N-Triples or N-Quads. If you have RDF-XML, you can use the Raptor project's rapper tool to convert to N-Triples.

RecordLoader can handle input from standard input, files, or zip archives. It is often convenient to use zip archives, since they take very little space on disk. With modern CPUs, RecordLoader can uncompress triples or quads from the zip archives very quickly.

Configuring MarkLogic Server

The semantic queries supported in this project's XQuery code require several range indexes.

As we will see below, querying semantic information in MarkLogic Server does not usually rely on full-text queries. It is therefore possible to save both disk space and memory by disabling all full-text indexes.

The next section discusses ingestion via NQLoader, and will also touch on application server configuration.

Running NQLoader

Here is a sample configuration file for RecordLoader and NQLoader.

CONFIGURATION_CLASSNAME=com.marklogic.semantic.Configuration
INPUT_PATTERN=^.+\\.nt$
THREADS=32
CONNECTION_STRING=http://admin:admin@host-1:8012/insert.xqy http://admin:admin@host-2:8012/insert.xqy
BATCH_SIZE=10

The first property, CONFIGURATION_CLASSNAME, tells RecordLoader to use the semantic plug-in for all configuration. This also overrides the built-in Loader class with the NQLoader from semantic.jar.

The next three properties are generic to any use of RecordLoader: INPUT_PATTERN tells RecordLoader to look for *.nt files or zip entries: this is the standard extension for N-Triple and N-Quad files. Next, THREADS tells RecordLoader to start a pool of 32 worker threads. You may want more threads, or fewer, depending on your server resources.

The next property, CONNECTION_STRING, is also a generic part of RecordLoader, and supports automatic load-balancing across multiple hosts. However, NQLoader uses CONNECTION_STRING somewhat differently than the built-in loaders do. Instead of being an XCC connection string, this is an HTTP service to which NQLoader will POST an XML representation of the input semantic triples or quads (see Document Format, below). The insert.xqy referred to by this connection string must be available to the application server, in this case the one running on port 8012 on host-1 and host-2. This code is available from github: insert.xqy.

Finally, NQLoader can buffer up an arbitrary number of triples or quads to send to the insert.xqy module. Since each triple or quad is small, this buffering helps to improve ingestion performance. The number of triples or quads in each buffer is governed by the BATCH_SIZE property. NQLoader also attempts to de-duplicate tuples on the fly, using an LRU-like cache for recently-seen tuple data. NQLoader also attempts to deterministically sent the same tuples to the same CONNECTION_STRING URLs, as long as the property remains constant.

Once we have assembled the necessary tools for RecordLoader and NQLoader, and have configured MarkLogic Server with a database and an application server, we can load content. Here is a sample Java command-line.

java -cp recordloader.jar:semantic.jar:nxparser.jar:xpp3.jar \
  com.marklogic.ps.RecordLoader \
  nqloader.properties \
  *.zip

Document Format

NQLoader sends each triple or quad to MarkLogic Server using a simplified XML representation.

Here is a sample element, based on a triple from the LeHigh University Benchmark (LUBM) corpus.

<t>
  <s>http://www.Department17.University3.edu/UndergraduateStudent276</s>
  <p>http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress</p>
  <o>UndergraduateStudent276@Department17.University3.edu</o>
</t>

By default, this document is inserted as properties of an empty document. This is also known as "naked properties". You can use xdmp:document-properties() to inspect them after loading. The above example would appear like this if you run xdmp:document-properties():

<prop:properties xmlns:prop="http://marklogic.com/xdmp/property">
  <s>http://www.Department17.University3.edu/UndergraduateStudent276</s>
  <p>http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress</p>
  <o>UndergraduateStudent276@Department17.University3.edu</o>
  <prop:last-modified>2010-12-03T19:12:53-08:00</prop:last-modified>
</prop:properties>

You can change this default placement back to inserting as documents by modifying insert.xqy. But this is not recommended because inserting triples as naked properties saves a fragment.

You may also insert triples or quads manually, using the sem:tuple-insert() function. The library function will automatically generate the document URI for each triple, and will generate XML similar to example above.

Querying Semantic Data in MarkLogic Server

Now that we have some semantic data in the server, we can write simple queries that use XPath. For example:

/t
  [s eq 'http://www.Department17.University3.edu/UndergraduateStudent276']
/t
  [s eq 'http://www.Department17.University3.edu/UndergraduateStudent276']
  [p eq 'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress']
  /o

However, these XPath queries are not very interesting from a semantic point of view. This is where the semantic.xqy library becomes useful. The next sections provide a brief overview of this library. For more information about the join API, see the function reference.

Transitive Closure - Friend of a Friend

One common task in semantic applications is to perform transitive closure over an edge type. This is especially useful for queries involving a "friend of a friend" relationship. The sem:transitive-closure() function can follow a relationship for a number of generations, building a map that represents the matching network.

xquery version "1.0-ml";

import module namespace sem="http://marklogic.com/semantic"
 at "semantic.xqy";

let $m := map:map()
let $seeds := xdmp:get-request-field('seed')
let $filters as xs:string* := xdmp:get-request-field('filter')
let $gen := 6
return sem:transitive-closure(
  $m, $seeds, $gen,
  'http://xmlns.com/foaf/0.1/knows',
  true(), $filters
)

The semantic library also provides sem:serialize() function, which can serialize the FOAF map to a flat text format.

Joins on Subject, Object, and Predicate

Let's look at a slightly more complex problem: this is query 1 from the above-mentioned LUBM benchmark workload. It returns all subjects that match a set of joins. Each matching subject must be of type GraduateStudent and must take a specific course.

sem:subject-for-join(
  (sem:object-predicate-join(
    'http://www.Department0.University0.edu/GraduateCourse0',
    'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse'),
  sem:type-join(
    'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent')))