The vector queries shown in the LangChain, langchain4j, and LangChain.js RAG examples depend on embeddings - vector representations of text - being added to documents in MarkLogic. Vector queries can then be implemented using the new vector functions in MarkLogic 12. This project demonstrates the use of a langchain4j in-process embedding model and the MarkLogic Data Movement SDK for adding embeddings to documents in MarkLogic.
Table of contents
Setup
This example depends both on the main setup for all examples and also on having run the “Split to multiple documents” example program in the document splitting examples. That example program used langchain4j to split the text in Enron email documents and write each chunk of text to a separate document. This example will then use langchain4j to generate an embedding for the chunk of text and add it to each chunk document.
Adding embedding to documents
To use this example, you’ll need to be in this example directory. If you are not already in this example directory, run this command:
cd embedding-langchain-java
To try the embedding example, run the following Gradle task:
../gradlew addEmbeddings
After the task completes, each document in the enron-chunk
collection will now have an embedding
field consisting of an array of floating point numbers. Each document will also have been added to the enron-chunk-with-embedding
collection.
As a next step, you would likely create a MarkLogic TDE view that allows you to use the MarkLogic Optic API for querying for rows with similar embeddings. This is the exact approach used in the vector queries for each of the RAG examples mentioned above. Your TDE could look like the one shown below. Note that the value of dimension
for the embedding
column must match that of the embedding model that you used. In this example, the langchain4j in-process embedding model requires a value of 384 for the dimension
column.
{
"template": {
"context": "/",
"collections": [
"enron-chunk-with-embedding"
],
"rows": [
{
"schemaName": "example",
"viewName": "enronChunk",
"columns": [
{
"name": "uri",
"scalarType": "string",
"val": "sourceUri"
},
{
"name": "embedding",
"scalarType": "vector",
"val": "vec:vector(embedding)",
"dimension": "384",
"invalidValues": "reject"
},
{
"name": "text",
"scalarType": "string",
"val": "text"
}
]
}
]
}
}
When performing a vector query with MarkLogic, you need to ensure that the embedding that you compare to the values in the vector
column defined in your TDE have the same dimension value. Otherwise, MarkLogic will throw a XDMP-DIMMISMATCH
error. For example, since an in-process langchain4j embedding model is used in this example program, you would want to use the same embedding model to generate an embedding of a user’s chat question. If you wished to use an Azure OpenAI embedding model in the above example program, you would then need to use the same embedding model when generating an embedding of a user’s chat question.