Note: MarkLogic 7 introduces support for semantic queries that makes this toolkit obsolete, including support for all of the capabilities in SPARQL 1.0 and many of the capabilities in SPARQL 1.1. MarkLogic includes rich support for triples, and integrates querying triples with MarkLogic search in ways that allow you to not only create interesting semantic applications, but to integrate them with search in a rich way. For details, see the Semantics Developer's Guide
If you are using MarkLogic 6 or earlier, you should stick with this toolkit.
This project includes XQuery and Java code for storage and retrieval of semantic content with MarkLogic Server, so that MarkLogic Server can be a triple store. This library also supports quads. The source code is available on the project page.
This project includes a plugin class, NQLoader, which used NXParser to extend RecordLoader. You will need a working installation of RecordLoader, plus the NQLoader classes and related XQuery code. You can download a jar file including the NQLoader class here.
NQLoader expects input data as
N-Triples or N-Quads.
If you have RDF-XML, you can use the
Raptor project's rapper
tool to convert to N-Triples.
RecordLoader can handle input from standard input, files, or zip archives. It is often convenient to use zip archives, since they take very little space on disk. With modern CPUs, RecordLoader can uncompress triples or quads from the zip archives very quickly.
The semantic queries supported in this project's XQuery code require several range indexes.
s
as string,
in the Unicode codepoint collation.o
as string,
in the Unicode codepoint collation.p
as string,
in the Unicode codepoint collation.c
as string,
in the Unicode codepoint collation.As we will see below, querying semantic information in MarkLogic Server does not usually rely on full-text queries. It is therefore possible to save both disk space and memory by disabling all full-text indexes.
The next section discusses ingestion via NQLoader, and will also touch on application server configuration.
Here is a sample configuration file for RecordLoader and NQLoader.
CONFIGURATION_CLASSNAME=com.marklogic.semantic.Configuration INPUT_PATTERN=^.+\\.nt$ THREADS=32 CONNECTION_STRING=http://admin:admin@host-1:8012/insert.xqy http://admin:admin@host-2:8012/insert.xqy BATCH_SIZE=10
The first property, CONFIGURATION_CLASSNAME
, tells RecordLoader
to use the semantic plug-in for all configuration.
This also overrides the built-in Loader class
with the NQLoader from semantic.jar
.
The next three properties are generic to any use of RecordLoader:
INPUT_PATTERN
tells RecordLoader
to look for *.nt
files or zip entries:
this is the standard extension for N-Triple and N-Quad files.
Next, THREADS
tells RecordLoader to start a pool
of 32 worker threads. You may want more threads, or fewer,
depending on your server resources.
The next property, CONNECTION_STRING
,
is also a generic part of RecordLoader,
and supports automatic load-balancing across multiple hosts.
However, NQLoader uses CONNECTION_STRING
somewhat differently
than the built-in loaders do. Instead of being an XCC connection string,
this is an HTTP service to which NQLoader will POST
an XML representation of the input semantic triples or quads
(see Document Format, below).
The insert.xqy
referred to by this connection string
must be available to the application server,
in this case the one running on port 8012 on host-1 and host-2.
This code is available from github:
insert.xqy.
Finally, NQLoader can buffer up an arbitrary number of triples or quads
to send to the insert.xqy
module.
Since each triple or quad is small,
this buffering helps to improve ingestion performance.
The number of triples or quads in each buffer is governed by the
BATCH_SIZE
property.
NQLoader also attempts to de-duplicate tuples on the fly,
using an LRU-like cache for recently-seen tuple data.
NQLoader also attempts to deterministically sent the same tuples
to the same CONNECTION_STRING
URLs,
as long as the property remains constant.
Once we have assembled the necessary tools for RecordLoader and NQLoader, and have configured MarkLogic Server with a database and an application server, we can load content. Here is a sample Java command-line.
java -cp recordloader.jar:semantic.jar:nxparser.jar:xpp3.jar \ com.marklogic.ps.RecordLoader \ nqloader.properties \ *.zip
NQLoader sends each triple or quad to MarkLogic Server using a simplified XML representation.
t
s
o
p
c
Here is a sample element, based on a triple from the LeHigh University Benchmark (LUBM) corpus.
<t> <s>http://www.Department17.University3.edu/UndergraduateStudent276</s> <p>http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress</p> <o>UndergraduateStudent276@Department17.University3.edu</o> </t>
By default, this document is inserted as properties of an empty document. This is also known as "naked properties". You can use xdmp:document-properties() to inspect them after loading. The above example would appear like this if you run xdmp:document-properties():
<prop:properties xmlns:prop="http://marklogic.com/xdmp/property"> <s>http://www.Department17.University3.edu/UndergraduateStudent276</s> <p>http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress</p> <o>UndergraduateStudent276@Department17.University3.edu</o> <prop:last-modified>2010-12-03T19:12:53-08:00</prop:last-modified> </prop:properties>
You can change this default placement back to inserting as documents by modifying insert.xqy. But this is not recommended because inserting triples as naked properties saves a fragment.
You may also insert triples or quads manually,
using the sem:tuple-insert()
function.
The library function will automatically generate the document URI
for each triple, and will generate XML similar to example above.
Now that we have some semantic data in the server, we can write simple queries that use XPath. For example:
/t [s eq 'http://www.Department17.University3.edu/UndergraduateStudent276']
/t [s eq 'http://www.Department17.University3.edu/UndergraduateStudent276'] [p eq 'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#emailAddress'] /o
However, these XPath queries are not very interesting
from a semantic point of view. This is where the
semantic.xqy
library becomes useful.
The next sections provide a brief overview of this library.
For more information about the join API,
see the function reference.
One common task in semantic applications is to perform
transitive closure over an edge type. This is especially
useful for queries involving a "friend of a friend"
relationship. The sem:transitive-closure()
function can follow a relationship for a number of generations,
building a map that represents the matching network.
xquery version "1.0-ml"; import module namespace sem="http://marklogic.com/semantic" at "semantic.xqy"; let $m := map:map() let $seeds := xdmp:get-request-field('seed') let $filters as xs:string* := xdmp:get-request-field('filter') let $gen := 6 return sem:transitive-closure( $m, $seeds, $gen, 'http://xmlns.com/foaf/0.1/knows', true(), $filters )
The semantic library also provides sem:serialize()
function, which can serialize the FOAF map to a flat text format.
Let's look at a slightly more complex problem: this is query 1 from the above-mentioned LUBM benchmark workload. It returns all subjects that match a set of joins. Each matching subject must be of type GraduateStudent and must take a specific course.
sem:subject-for-join( (sem:object-predicate-join( 'http://www.Department0.University0.edu/GraduateCourse0', 'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#takesCourse'), sem:type-join( 'http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#GraduateStudent')))