Go to file
2012-08-27 11:31:53 +02:00
code Better shell execution feedback from PDF extractor 2012-08-27 11:31:53 +02:00
tests Added rudimentary test coverage 2012-08-22 18:23:06 +02:00
_config.php Initial commit 2012-08-22 17:52:08 +02:00
LICENSE Added License 2012-08-22 23:23:34 +02:00
README.md Updated README 2012-08-22 23:22:46 +02:00

Text Extraction Module

Overview

Provides an extraction API for file content, which can hook into different extractor engines based on availability and the parsed file format. The output is always a string: the file content.

Via the FileTextExtractable extension, this logic can be used to cache the extracted content on a DataObject subclass (usually File).

Note: Previously part of the sphinx module.

Requirements

  • SilverStripe 3.0
  • (optional) XPDF (pdftotext utility)

Configuration

No configuration is required, unless you want to make the content available through your DataObject subclass. In this case, add the following to mysite/_config.php:

DataObject::add_extension('File', 'FileTextExtractable');

Usage

Manual extraction:

$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);

DataObject extraction:

$myFileObj = File::get()->First();
$content = $myFileObj->extractFileAsText();