Go to file
2013-02-01 15:34:54 +01:00
code FIX Case insensitive extension matching 2013-02-01 15:34:54 +01:00
tests BUG Improved HTMLTextExtractor, remove non-content tags 2012-09-06 13:41:21 +02:00
_config.php Initial commit 2012-08-22 17:52:08 +02:00
composer.json Added composer.json 2013-01-07 14:07:39 +01:00
LICENSE Added License 2012-08-22 23:23:34 +02:00
README.md Updated README 2012-08-22 23:22:46 +02:00

Text Extraction Module

Overview

Provides an extraction API for file content, which can hook into different extractor engines based on availability and the parsed file format. The output is always a string: the file content.

Via the FileTextExtractable extension, this logic can be used to cache the extracted content on a DataObject subclass (usually File).

Note: Previously part of the sphinx module.

Requirements

  • SilverStripe 3.0
  • (optional) XPDF (pdftotext utility)

Configuration

No configuration is required, unless you want to make the content available through your DataObject subclass. In this case, add the following to mysite/_config.php:

DataObject::add_extension('File', 'FileTextExtractable');

Usage

Manual extraction:

$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);

DataObject extraction:

$myFileObj = File::get()->First();
$content = $myFileObj->extractFileAsText();