Monday, January 25, 2016

Enhancing documents in Apache ManifoldCF using Apache Stanbol

Introduction


Apache ManifoldCF is an open source document ingestion framework. Using ManifoldCF (MCF) you can;

  • Crawl documents from different content repositories such as Alfresco, Microsoft Sharepoint etc. MCF repository connectors are used for document crawling purpose. 
  • Transform the documents by adjusting document meta data and adding additional document metadata using technologies like Apache Tika. This is done using MCF transformation connectors.
  • Save source-repository security policies and document ACLs when indexing the documents by using MCF authority connectors
  • Finally index the documents in search indexes such as Apache Solr, Open Search Server or Elastic Search via output connectors

Apache ManifoldCF can be effectively used as a document ingestion engine to build federated search applications. There are several open source enterprise search solutions developed using Apache ManifoldCF as the document ingestion engine.

This post is about enhancing documents indexed by ManifoldCF by adding semantic enhancements. Semantic enhancements enrich the documents with additional contextual knowledge about real world entities which are mentioned in the document. 

Semantic enhancements may include real world entities/concepts such as people, organizations, places. By tagging documents with entities mentioned in them, documents can be connected to external knowledge bases on related entities. This concept is called linked data. Semantic data added to documents facilitate semantic search abilities to search applications where users can search documents by concepts (by a person, location or an organization etc) rather than just keywords. 

To enhance documents in ManifoldCF, we are going to use Apache Stanbol.
Apache Stanbol is a framework for semantic content management. Using Stanbol, traditional content management process can be enhanced by adding semantic knowledge by linking documents to external knowledge bases like dbpedia, freebase or even a custom developed knowledge base. Stanbol integrates components for language detection, natural language detection, named entity recognition and entity linking to extenal and custom knowledge bases. By integrating these components in an enhancement chain, Stanbol can be used to perform semantic tagging for content.

Adding semantic enhancements to documents in ManifoldCF

In this post I will explain how to enhance ManifoldCF documents using Apache Stanbol as the semantic enhancement engine. We have developed a transformation connector to ManifoldCF which connects to Apache Stanbol and enhance documents by adding entity properties as document fields to the document. 

Following is the high-level design of the Stanbol connector chain for ManifoldCF.

Figure 1 : Stanbol connector chain architecture
Prerequisites


To configure the Stanbol connector with ManifoldCF, you need to first build the connector from source and configure it in the ManifoldCF connectors.xml. Please follow below steps to get the connector configured in ManifoldCF.

1. Build ManifoldCF 2.3 from source as the connector has dependencies to mcf components.
git clone https://github.com/apache/manifoldcf.git
cd manifoldcf/
git checkout release-2.3-branch 
mvn clean install

2. Build the Apache Stanbol client which is used as a dependency in the Stanbol connector
git clone https://github.com/zaizi/apache-stanbol-client.git
cd apache-stanbol-client
git checkout jaxrs-1.0
mvn clean install -DskipTests=true

3. Checkout the source-code the Stanbol connector for ManifoldCF from the git project here: 
https://github.com/zaizi/sensefy-connectors/tree/feature/SENSEFY-1453-modify-stanbol-connector/transformation/mcf-stanbol-connector

4. Build the Stanbol connector using maven : 
mvn clean install

5. Copy the mcf-stanbol-connector-2.3-jar-with-dependencies.jar to MANIFOLDCF_INSTALL_DIR/connectors-lib

6. Configure the Stanbol connector in the connectors.xml
<transformationconnector name="Stanbol enhancer" class="org.zaizi.manifoldcf.agents.transformation.stanbol.StanbolEnhancer"/>

7. You need to run a Stanbol server to make the Stanbol connector work. You can build and start Stanbol server by following instructions in their project documentation.

Configuring a ManifoldCF Job with Stanbol connector

You need to have a repository connector and an output connector configured prior to configuring the Stanbol connector. We have configured a file repository connector and a Solr output connector for demo purpose.

Following is the ManifoldCF job configuration with 3 connectors.
  1. FileSystemRepo : File-system repository connector to ingest text documents in a folder.
  2. StanbolEnhancer : Stanbol transformation connector enhancing documents by adding semantic metadata to the document as fields
  3. solrOutput : Solr output connector to index the final documents in a Solr server
Figure 2 : ManifoldCF Job Connection

Stanbol Connector Configurations

Section 1 : Stanbol server connection configurations

In the first section of the stanbol connector configurations, you need to provide the server url and the enhancement chain name to use for enhancements.

The default values are;
  1. Stanbol server url : http://localhost:8080/
  2. Stanbol enhancement chain : default
Figure 3 : Stanbol server connection configurations


You can configure Stanbol connector to use either dereference fields or an LDPath program to define the entity properties you want to extract from entity RDF data and add to the document as semantic data. 

Section 2 : Dereference fields configurations

Dereference fields configuration of the Stanbol connector will let you define entity properties that you want to extract from the Stanbol entities and add to the manifoldCF document as fields.

Common entity properties that can be extracted from an entity are;
  • http://www.w3.org/2000/01/rdf-schema#label 
  • http://www.w3.org/2000/01/rdf-schema#comment 
  • http://www.w3.org/1999/02/22-rdf-syntax-ns#type 
These properties will vary based on the entity dataset used for entity-linking in Stanbol. It uses DBpedia dataset for the default enhancement chain.                               

Figure 4 : Dereference fields
LDPath program configurations

In this section, the user can define a LDPath program to select what properties to extract from the entities. The user needs to define the LDPath prefixes and the LDPath fields. The connector will generate the LDPath program based on the prefixes and the field definitions given, and send the enhancement request with the LDPath program to Stanbol.

In this example we have defined following LDPath prefix and field definitions.

Prefix definitions

Prefix : zaizi
Namespace URI : http://zaizi.com/custom


LDPath Field definitions

field name:  zaizi:label
definition : rdfs:label[@en] :: xsd:string

field name:  zaizi:comment
definition : rdfs:comment [@en] :: xsd:string

Figure 5 : LDPath Program configurations


Section 3 : Final document field mappings configurations

In this section, user can map the entity properties to final document fields. The same mapping can be done using a metadata-adjuster connector. We have added the mapping configurations to Stanbol connector for user's convenience.

eg : 
entity property : http://zaizi.com/custom/comment
destination field : comments

entity property : http://zaizi.com/custom/label
destination field : entity_names

The user can instead of defining field mappings, can also select to keep all entity properties as semantic fields in the document. 

Figure 6 : Final document field mappings 

Semantically enhanced Solr document

After running the ManifoldCF job with Stanbol connector, you can see the relevant semantic fields added to the final Solr document as below;

Figure 7 : Final Solr document with semantic fields

1 comment: