silverstripe-textextraction/README.md

# Text Extraction Module

[![Build Status](https://secure.travis-ci.org/silverstripe-labs/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe-labs/silverstripe-textextraction)

## Overview

Provides an extraction API for file content, which can hook into different extractor
engines based on availability and the parsed file format.
The output is always a string: the file content.

Via the `FileTextExtractable` extension, this logic can be used to 
cache the extracted content on a `DataObject` subclass (usually `File`).

Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).

## Requirements

 * SilverStripe 3.1
 * (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
 * (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
 * (optional) [Apache Tika](http://tika.apache.org/)

### Supported Formats

 * HTML (built-in)
 * PDF (with XPDF or Solr)
 * Microsoft Word, Excel, Powerpoint (Solr)
 * OpenOffice (Solr)
 * CSV (Solr)
 * RTF (Solr)
 * EPub (Solr)
 * Many others (Tika)

## Installation

The recommended installation is through [composer](http://getcomposer.org).
Add the following to your `composer.json`:

```js
{
	"require": {
		"silverstripe/textextraction": "2.0.x-dev"
	}
}
```

The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
which is automatically checked out by composer. Alternatively, install Guzzle
through PEAR and ensure its in your `include_path`.

## Configuration

### Basic

By default, only extraction from HTML documents is supported.
No configuration is required for that, unless you want to make
the content available through your `DataObject` subclass.
In this case, add the following to `mysite/_config/config.yml`:

```yaml
File:
  extensions:
	- FileTextExtractable
```

### XPDF

PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
commandline utility. Follow their installation instructions, its presence will be automatically
detected. You can optionally set the binary path in `mysite/_config/config.yml`:

```yml
PDFTextExtractor:
	binary_location: /my/path/pdftotext
```

### Apache Solr

Apache Solr is a fulltext search engine, an aspect which is often used
alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
The textextraction module retrieves the output of this service, rather than altering the index.
With the raw text output, you can decide to store it in a database column for fulltext search
in your database driver, or even pass it back to Solr as part of a full index update.

In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):

```yml
SolrCellTextExtractor:
	base_url: 'http://localhost:8983/solr/update/extract'
```

Note that in case you're using multiple cores, you'll need to add the core name to the URL 
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
uses multiple cores by default, and comes prepackaged with a Solr server.
Its a stripped-down version of Solr, follow the module README on how to add
Apache Tika text extraction capabilities.

You need to ensure that some indexable property on your object
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
The property should be listed in your `SolrIndex` subclass, e.g. as follows:

```php
class MyDocument extends DataObject {
	static $db = array('Path' => 'Text');
	function getContent() {
		$extractor = FileTextExtractor::for_file($this->Path);
		return $extractor ? $extractor->getContent($this->Path) : null;		
	}
}
class MySolrIndex extends SolrIndex {
	function init() {
		$this->addClass('MyDocument');
		$this->addStoredField('Content', 'HTMLText');
	}
}
```

Note: This isn't a terribly efficient way to process large amounts of files, since 
each HTTP request is run synchronously.

### Tika

Support for Apache Tika (1.7 and above) is included. This can be run in one of two ways: Server or CLI.

See [the Apache Tika home page](http://tika.apache.org/1.7/index.html) for instructions on installing and
configuring this.

This extension will best work with the [fileinfo PHP extension](http://php.net/manual/en/book.fileinfo.php)
installed to perform mime detection. Tika validates support via mime type rather than file extensions.

### Tika - CLI

Ensure that your machine has a 'tika' command available which will run the CLI script.

```bash
#!/bin/bash
exec java -jar /usr/local/Cellar/tika/1.7/libexec/tika-app-1.7.jar "$@"
```

### Tika Rest Server

Tika can also be run as a server.

You can configure your server endpoint by setting the url via config.

```yaml
TikaServerTextExtractor:
  server_endpoint: 'http://localhost:9998'
```

Alternatively this may be specified via the `SS_TIKA_ENDPOINT` directive in your `_ss_environment.php` file, or an environment variable of the same name.


Then startup your server as below

```bash
java -jar tika-server-1.7.jar --host=localhost --port=9998
```

## Usage

Manual extraction:

```php
$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);
```

Extraction with `FileTextExtractable` extension applied:

```php
$myFileObj = File::get()->First();
$content = $myFileObj->getFileContent();
```

This content can also be embedded directly within a template.

```
$MyFile.FileContent
```
Initial commit 2012-08-22 17:52:08 +02:00			`# Text Extraction Module`

Travis support 2013-05-07 21:48:01 +02:00			`[![Build Status](https://secure.travis-ci.org/silverstripe-labs/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe-labs/silverstripe-textextraction)`

Initial commit 2012-08-22 17:52:08 +02:00			`## Overview`

Updated README 2012-08-22 23:22:07 +02:00			`Provides an extraction API for file content, which can hook into different extractor`
			`engines based on availability and the parsed file format.`
			`The output is always a string: the file content.`
Initial commit 2012-08-22 17:52:08 +02:00
Updated README 2012-08-22 23:22:07 +02:00			Via the `FileTextExtractable` extension, this logic can be used to
			cache the extracted content on a `DataObject` subclass (usually `File`).

			`Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).`

			`## Requirements`

3.1 compat 2013-05-07 18:47:56 +02:00			`* SilverStripe 3.1`
Updated README 2012-08-22 23:22:07 +02:00			* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00			`* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)`
API Implement Tika support API Implement support for detection via mime-type as well as file extension API Implement FileContent property for safe usage in templates API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases Marking dev-master as version 2.0 as this contains breaking changes 2015-02-18 03:31:38 +01:00			`* (optional) [Apache Tika](http://tika.apache.org/)`
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00
			`### Supported Formats`

			`* HTML (built-in)`
			`* PDF (with XPDF or Solr)`
			`* Microsoft Word, Excel, Powerpoint (Solr)`
			`* OpenOffice (Solr)`
			`* CSV (Solr)`
			`* RTF (Solr)`
			`* EPub (Solr)`
API Implement Tika support API Implement support for detection via mime-type as well as file extension API Implement FileContent property for safe usage in templates API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases Marking dev-master as version 2.0 as this contains breaking changes 2015-02-18 03:31:38 +01:00			`* Many others (Tika)`
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00
			`## Installation`

			`The recommended installation is through [composer](http://getcomposer.org).`
			Add the following to your `composer.json`:

API Support tika server 2015-02-25 02:44:03 +01:00			```js
			`{`
			`"require": {`
			`"silverstripe/textextraction": "2.0.x-dev"`
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00			`}`
API Support tika server 2015-02-25 02:44:03 +01:00			`}`
			```
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00
			`The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),`
			`which is automatically checked out by composer. Alternatively, install Guzzle`
			through PEAR and ensure its in your `include_path`.
Updated README 2012-08-22 23:22:07 +02:00
			`## Configuration`

NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00			`### Basic`

			`By default, only extraction from HTML documents is supported.`
			`No configuration is required for that, unless you want to make`
Updated README 2012-08-22 23:22:07 +02:00			the content available through your `DataObject` subclass.
API Implement Tika support API Implement support for detection via mime-type as well as file extension API Implement FileContent property for safe usage in templates API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases Marking dev-master as version 2.0 as this contains breaking changes 2015-02-18 03:31:38 +01:00			In this case, add the following to `mysite/_config/config.yml`:
Updated README 2012-08-22 23:22:07 +02:00
API Support tika server 2015-02-25 02:44:03 +01:00			```yaml
			`File:`
			`extensions:`
			`- FileTextExtractable`
			```
Initial commit 2012-08-22 17:52:08 +02:00
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00			`### XPDF`

			`PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)`
			`commandline utility. Follow their installation instructions, its presence will be automatically`
			detected. You can optionally set the binary path in `mysite/_config/config.yml`:

API Support tika server 2015-02-25 02:44:03 +01:00			```yml
			`PDFTextExtractor:`
			`binary_location: /my/path/pdftotext`
			```
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00
			`### Apache Solr`

			`Apache Solr is a fulltext search engine, an aspect which is often used`
			`alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)`
			`through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.`
			`This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.`
			`The textextraction module retrieves the output of this service, rather than altering the index.`
			`With the raw text output, you can decide to store it in a database column for fulltext search`
			`in your database driver, or even pass it back to Solr as part of a full index update.`

			In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):

API Support tika server 2015-02-25 02:44:03 +01:00			```yml
			`SolrCellTextExtractor:`
			`base_url: 'http://localhost:8983/solr/update/extract'`
			```
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00
			`Note that in case you're using multiple cores, you'll need to add the core name to the URL`
			`(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').`
			`The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)`
			`uses multiple cores by default, and comes prepackaged with a Solr server.`
			`Its a stripped-down version of Solr, follow the module README on how to add`
			`Apache Tika text extraction capabilities.`

More docs on how to use extraction with Solr 2013-05-07 20:14:01 +02:00			`You need to ensure that some indexable property on your object`
			returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
			or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
			The property should be listed in your `SolrIndex` subclass, e.g. as follows:

API Support tika server 2015-02-25 02:44:03 +01:00			```php
			`class MyDocument extends DataObject {`
			`static $db = array('Path' => 'Text');`
			`function getContent() {`
			`$extractor = FileTextExtractor::for_file($this->Path);`
			`return $extractor ? $extractor->getContent($this->Path) : null;`
More docs on how to use extraction with Solr 2013-05-07 20:14:01 +02:00			`}`
API Support tika server 2015-02-25 02:44:03 +01:00			`}`
			`class MySolrIndex extends SolrIndex {`
			`function init() {`
			`$this->addClass('MyDocument');`
			`$this->addStoredField('Content', 'HTMLText');`
More docs on how to use extraction with Solr 2013-05-07 20:14:01 +02:00			`}`
API Support tika server 2015-02-25 02:44:03 +01:00			`}`
			```
More docs on how to use extraction with Solr 2013-05-07 20:14:01 +02:00
NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00			`Note: This isn't a terribly efficient way to process large amounts of files, since`
			`each HTTP request is run synchronously.`

API Implement Tika support API Implement support for detection via mime-type as well as file extension API Implement FileContent property for safe usage in templates API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases Marking dev-master as version 2.0 as this contains breaking changes 2015-02-18 03:31:38 +01:00			`### Tika`

API Support tika server 2015-02-25 02:44:03 +01:00			`Support for Apache Tika (1.7 and above) is included. This can be run in one of two ways: Server or CLI.`
API Implement Tika support API Implement support for detection via mime-type as well as file extension API Implement FileContent property for safe usage in templates API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases Marking dev-master as version 2.0 as this contains breaking changes 2015-02-18 03:31:38 +01:00
			`See [the Apache Tika home page](http://tika.apache.org/1.7/index.html) for instructions on installing and`
			`configuring this.`

			`This extension will best work with the [fileinfo PHP extension](http://php.net/manual/en/book.fileinfo.php)`
			`installed to perform mime detection. Tika validates support via mime type rather than file extensions.`

API Support tika server 2015-02-25 02:44:03 +01:00			`### Tika - CLI`

			`Ensure that your machine has a 'tika' command available which will run the CLI script.`

			```bash
			`#!/bin/bash`
			`exec java -jar /usr/local/Cellar/tika/1.7/libexec/tika-app-1.7.jar "$@"`
			```

			`### Tika Rest Server`

			`Tika can also be run as a server.`

Small spelling fixes 2015-02-26 11:11:31 +01:00			`You can configure your server endpoint by setting the url via config.`
API Support tika server 2015-02-25 02:44:03 +01:00
			```yaml
			`TikaServerTextExtractor:`
			`server_endpoint: 'http://localhost:9998'`
			```

Small spelling fixes 2015-02-26 11:11:31 +01:00			Alternatively this may be specified via the `SS_TIKA_ENDPOINT` directive in your `_ss_environment.php` file, or an environment variable of the same name.
API Support tika server 2015-02-25 02:44:03 +01:00

			`Then startup your server as below`

			```bash
			`java -jar tika-server-1.7.jar --host=localhost --port=9998`
			```

Initial commit 2012-08-22 17:52:08 +02:00			`## Usage`

Updated README 2012-08-22 23:22:07 +02:00			`Manual extraction:`
Initial commit 2012-08-22 17:52:08 +02:00
API Support tika server 2015-02-25 02:44:03 +01:00			```php
			`$myFile = '/my/path/myfile.pdf';`
			`$extractor = FileTextExtractor::for_file($myFile);`
			`$content = $extractor->getContent($myFile);`
			```
Initial commit 2012-08-22 17:52:08 +02:00
More docs on how to use extraction with Solr 2013-05-07 20:14:01 +02:00			Extraction with `FileTextExtractable` extension applied:
Initial commit 2012-08-22 17:52:08 +02:00
API Support tika server 2015-02-25 02:44:03 +01:00			```php
			`$myFileObj = File::get()->First();`
			`$content = $myFileObj->getFileContent();`
			```

			`This content can also be embedded directly within a template.`

			```
			`$MyFile.FileContent`
			```