Introduction
Apache ManifoldCF is an open source document ingestion framework. Using ManifoldCF (MCF) you can;
- Crawl documents from different content repositories such as Alfresco, Microsoft Sharepoint etc. MCF repository connectors are used for document crawling purpose.
- Transform the documents by adjusting document meta data and adding additional document metadata using technologies like Apache Tika. This is done using MCF transformation connectors.
- Save source-repository security policies and document ACLs when indexing the documents by using MCF authority connectors
- Finally index the documents in search indexes such as Apache Solr, Open Search Server or Elastic Search via output connectors.
Apache ManifoldCF can be effectively used as a document ingestion engine to build federated search applications. There are several open source enterprise search solutions developed using Apache ManifoldCF as the document ingestion engine.
This post is about enhancing documents indexed by ManifoldCF by adding semantic enhancements. Semantic enhancements enrich the documents with additional contextual knowledge about real world entities which are mentioned in the document.
Semantic enhancements may include real world entities/concepts such as people, organizations, places. By tagging documents with entities mentioned in them, documents can be connected to external knowledge bases on related entities. This concept is called linked data. Semantic data added to documents facilitate semantic search abilities to search applications where users can search documents by concepts (by a person, location or an organization etc) rather than just keywords.
To enhance documents in ManifoldCF, we are going to use Apache Stanbol.
Apache Stanbol is a framework for semantic content management. Using Stanbol, traditional content management process can be enhanced by adding semantic knowledge by linking documents to external knowledge bases like dbpedia, freebase or even a custom developed knowledge base. Stanbol integrates components for language detection, natural language detection, named entity recognition and entity linking to extenal and custom knowledge bases. By integrating these components in an enhancement chain, Stanbol can be used to perform semantic tagging for content.
Adding semantic enhancements to documents in ManifoldCF
In this post I will explain how to enhance ManifoldCF documents using Apache Stanbol as the semantic enhancement engine. We have developed a transformation connector to ManifoldCF which connects to Apache Stanbol and enhance documents by adding entity properties as document fields to the document.Following is the high-level design of the Stanbol connector chain for ManifoldCF.
![]() |
| Figure 1 : Stanbol connector chain architecture |
To configure the Stanbol connector with ManifoldCF, you need to first build the connector from source and configure it in the ManifoldCF connectors.xml. Please follow below steps to get the connector configured in ManifoldCF.
1. Build ManifoldCF 2.3 from source as the connector has dependencies to mcf components.
1. Build ManifoldCF 2.3 from source as the connector has dependencies to mcf components.
git clone https://github.com/apache/manifoldcf.git
cd manifoldcf/
git checkout release-2.3-branch
mvn clean install
2. Build the Apache Stanbol client which is used as a dependency in the Stanbol connector
git clone https://github.com/zaizi/apache-stanbol-client.git
cd apache-stanbol-client
git checkout jaxrs-1.0
mvn clean install -DskipTests=true
3. Checkout the source-code the Stanbol connector for ManifoldCF from the git project here:
https://github.com/zaizi/sensefy-connectors/tree/feature/SENSEFY-1453-modify-stanbol-connector/transformation/mcf-stanbol-connector
4. Build the Stanbol connector using maven :
mvn clean install
5. Copy the mcf-stanbol-connector-2.3-jar-with-dependencies.jar to MANIFOLDCF_INSTALL_DIR/connectors-lib
6. Configure the Stanbol connector in the connectors.xml
<transformationconnector name="Stanbol enhancer" class="org.zaizi.manifoldcf.agents.transformation.stanbol.StanbolEnhancer"/>
7. You need to run a Stanbol server to make the Stanbol connector work. You can build and start Stanbol server by following instructions in their project documentation.
Configuring a ManifoldCF Job with Stanbol connector
You need to have a repository connector and an output connector configured prior to configuring the Stanbol connector. We have configured a file repository connector and a Solr output connector for demo purpose.
Following is the ManifoldCF job configuration with 3 connectors.
- FileSystemRepo : File-system repository connector to ingest text documents in a folder.
- StanbolEnhancer : Stanbol transformation connector enhancing documents by adding semantic metadata to the document as fields
- solrOutput : Solr output connector to index the final documents in a Solr server
![]() |
| Figure 2 : ManifoldCF Job Connection |
Stanbol Connector Configurations
Section 1 : Stanbol server connection configurations
In the first section of the stanbol connector configurations, you need to provide the server url and the enhancement chain name to use for enhancements.
The default values are;
- Stanbol server url : http://localhost:8080/
- Stanbol enhancement chain : default
![]() |
| Figure 3 : Stanbol server connection configurations |
Section 2 : Dereference fields configurations
Dereference fields configuration of the Stanbol connector will let you define entity properties that you want to extract from the Stanbol entities and add to the manifoldCF document as fields.
Common entity properties that can be extracted from an entity are;
http://www.w3.org/2000/01/rdf-schema#label http://www.w3.org/2000/01/rdf-schema#comment http://www.w3.org/1999/02/22-rdf-syntax-ns#type
![]() |
| Figure 4 : Dereference fields |
LDPath program configurations
In this section, the user can define a LDPath program to select what properties to extract from the entities. The user needs to define the LDPath prefixes and the LDPath fields. The connector will generate the LDPath program based on the prefixes and the field definitions given, and send the enhancement request with the LDPath program to Stanbol.
In this example we have defined following LDPath prefix and field definitions.
Prefix definitions
Prefix : zaizi
Namespace URI : http://zaizi.com/custom
LDPath Field definitions
field name: zaizi:label
definition : rdfs:label[@en] :: xsd:string
field name: zaizi:comment
definition : rdfs:comment [@en] :: xsd:string







Nicely Written, +1 for Blogging again
ReplyDelete