silverstripe-textextraction/README.md

# Text Extraction Module

## Overview

Provides an extraction API for file content, which can hook into different extractor
engines based on availability and the parsed file format.
The output is always a string: the file content.

Via the `FileTextExtractable` extension, this logic can be used to
cache the extracted content on a `DataObject` subclass (usually `File`).

Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).

## Requirements

 * SilverStripe 3.0
 * (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)

## Configuration

No configuration is required, unless you want to make
the content available through your `DataObject` subclass.
In this case, add the following to `mysite/_config.php`:

	DataObject::add_extension('File', 'FileTextExtractable');

## Usage

Manual extraction:

	$myFile = '/my/path/myfile.pdf';
	$extractor = FileTextExtractor::for_file($myFile);
	$content = $extractor->getContent($myFile);

DataObject extraction:

	$myFileObj = File::get()->First();
	$content = $myFileObj->extractFileAsText();