mirror of https://github.com/silverstripe/silverstripe-textextraction synced 2024-10-22 11:06:00 +02:00

Go to file

Damian Mooyman fb70c1dd50 Merge pull request #6 from assertchris/php-5-3-compat Downgraded Guzzle version		2015-03-05 14:11:11 +13:00
_config	NEW SolrCellTextExtractor	2013-02-01 15:35:16 +01:00
.travis	API Support tika server	2015-02-25 17:55:41 +13:00
code	Downgraded Guzzle version	2015-03-05 13:57:31 +13:00
tests	API Support tika server	2015-02-25 17:55:41 +13:00
_config.php	Initial commit	2012-08-22 17:52:08 +02:00
.travis.yml	API Support tika server	2015-02-25 17:55:41 +13:00
composer.json	Downgraded Guzzle version	2015-03-05 13:57:31 +13:00
LICENSE	Added License	2012-08-22 23:23:34 +02:00
README.md	Small spelling fixes	2015-02-26 23:11:31 +13:00

README.md

Text Extraction Module

Overview

Provides an extraction API for file content, which can hook into different extractor engines based on availability and the parsed file format. The output is always a string: the file content.

Via the FileTextExtractable extension, this logic can be used to cache the extracted content on a DataObject subclass (usually File).

Note: Previously part of the sphinx module.

Requirements

SilverStripe 3.1
(optional) XPDF (pdftotext utility)
(optional) Apache Solr with ExtracingRequestHandler
(optional) Apache Tika

Supported Formats

HTML (built-in)
PDF (with XPDF or Solr)
Microsoft Word, Excel, Powerpoint (Solr)
OpenOffice (Solr)
CSV (Solr)
RTF (Solr)
EPub (Solr)
Many others (Tika)

Installation

The recommended installation is through composer. Add the following to your composer.json:

{
	"require": {
		"silverstripe/textextraction": "2.0.x-dev"
	}
}

The module depends on the Guzzle HTTP Library, which is automatically checked out by composer. Alternatively, install Guzzle through PEAR and ensure its in your include_path.

Configuration

Basic

By default, only extraction from HTML documents is supported. No configuration is required for that, unless you want to make the content available through your DataObject subclass. In this case, add the following to mysite/_config/config.yml:

File:
  extensions:
	- FileTextExtractable

XPDF

PDFs require special handling, for example through the XPDF commandline utility. Follow their installation instructions, its presence will be automatically detected. You can optionally set the binary path in mysite/_config/config.yml:

PDFTextExtractor:
	binary_location: /my/path/pdftotext

Apache Solr

Apache Solr is a fulltext search engine, an aspect which is often used alongside this module. But more importantly for us, it has bindings to Apache Tika through the ExtractingRequestHandler interface. This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files. The textextraction module retrieves the output of this service, rather than altering the index. With the raw text output, you can decide to store it in a database column for fulltext search in your database driver, or even pass it back to Solr as part of a full index update.

In order to use Solr, you need to configure a URL for it (in mysite/_config/config.yml):

SolrCellTextExtractor:
	base_url: 'http://localhost:8983/solr/update/extract'

Note that in case you're using multiple cores, you'll need to add the core name to the URL (e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract'). The "fulltext" module uses multiple cores by default, and comes prepackaged with a Solr server. Its a stripped-down version of Solr, follow the module README on how to add Apache Tika text extraction capabilities.

You need to ensure that some indexable property on your object returns the contents, either by directly accessing FileTextExtractable->extractFileAsText(), or by writing your own method around FileTextExtractor->getContent() (see "Usage" below). The property should be listed in your SolrIndex subclass, e.g. as follows:

class MyDocument extends DataObject {
	static $db = array('Path' => 'Text');
	function getContent() {
		$extractor = FileTextExtractor::for_file($this->Path);
		return $extractor ? $extractor->getContent($this->Path) : null;		
	}
}
class MySolrIndex extends SolrIndex {
	function init() {
		$this->addClass('MyDocument');
		$this->addStoredField('Content', 'HTMLText');
	}
}

Note: This isn't a terribly efficient way to process large amounts of files, since each HTTP request is run synchronously.

Tika

Support for Apache Tika (1.7 and above) is included. This can be run in one of two ways: Server or CLI.

See the Apache Tika home page for instructions on installing and configuring this.

This extension will best work with the fileinfo PHP extension installed to perform mime detection. Tika validates support via mime type rather than file extensions.

Tika - CLI

Ensure that your machine has a 'tika' command available which will run the CLI script.

#!/bin/bash
exec java -jar /usr/local/Cellar/tika/1.7/libexec/tika-app-1.7.jar "$@"

Tika Rest Server

Tika can also be run as a server.

You can configure your server endpoint by setting the url via config.

TikaServerTextExtractor:
  server_endpoint: 'http://localhost:9998'

Alternatively this may be specified via the SS_TIKA_ENDPOINT directive in your _ss_environment.php file, or an environment variable of the same name.

Then startup your server as below

java -jar tika-server-1.7.jar --host=localhost --port=9998

Usage

Manual extraction:

$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);

Extraction with FileTextExtractable extension applied:

$myFileObj = File::get()->First();
$content = $myFileObj->getFileContent();

This content can also be embedded directly within a template.

$MyFile.FileContent