Provides an extraction API for file content, which can hook into different extractor engines based on availability and the parsed file format. The output is always a string: the file content.

Via the FileTextExtractable extension, this logic can be used to cache the extracted content on a DataObject subclass (usually File).

Note: Previously part of the sphinx module.

Requirements

SilverStripe 3.0
(optional) XPDF (pdftotext utility)

Configuration

No configuration is required, unless you want to make the content available through your DataObject subclass. In this case, add the following to mysite/_config.php:

DataObject::add_extension('File', 'FileTextExtractable');

Usage

Manual extraction:

$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);

DataObject extraction:

$myFileObj = File::get()->First();
$content = $myFileObj->extractFileAsText();