silverstripe-textextraction/README.md

# Text Extraction Module

## Overview

Provides an extraction API for file content, which can hook into different extractor
engines based on availability and the parsed file format.
The output is always a string: the file content.

Via the `FileTextExtractable` extension, this logic can be used to 
cache the extracted content on a `DataObject` subclass (usually `File`).

Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).

## Requirements

 * SilverStripe 3.0
 * (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)

## Configuration

No configuration is required, unless you want to make
the content available through your `DataObject` subclass.
In this case, add the following to `mysite/_config.php`:

	DataObject::add_extension('File', 'FileTextExtractable');

## Usage

Manual extraction:

	$myFile = '/my/path/myfile.pdf';
	$extractor = FileTextExtractor::for_file($myFile);
	$content = $extractor->getContent($myFile);

DataObject extraction:

	$myFileObj = File::get()->First();
	$content = $myFileObj->extractFileAsText();
Initial commit 2012-08-22 17:52:08 +02:00			`# Text Extraction Module`

			`## Overview`

Updated README 2012-08-22 23:22:07 +02:00			`Provides an extraction API for file content, which can hook into different extractor`
			`engines based on availability and the parsed file format.`
			`The output is always a string: the file content.`
Initial commit 2012-08-22 17:52:08 +02:00
Updated README 2012-08-22 23:22:07 +02:00			Via the `FileTextExtractable` extension, this logic can be used to
			cache the extracted content on a `DataObject` subclass (usually `File`).

			`Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).`

			`## Requirements`

			`* SilverStripe 3.0`
			* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)

			`## Configuration`

			`No configuration is required, unless you want to make`
			the content available through your `DataObject` subclass.
			In this case, add the following to `mysite/_config.php`:

			`DataObject::add_extension('File', 'FileTextExtractable');`
Initial commit 2012-08-22 17:52:08 +02:00
			`## Usage`

Updated README 2012-08-22 23:22:07 +02:00			`Manual extraction:`
Initial commit 2012-08-22 17:52:08 +02:00
Updated README 2012-08-22 23:22:07 +02:00			`$myFile = '/my/path/myfile.pdf';`
			`$extractor = FileTextExtractor::for_file($myFile);`
			`$content = $extractor->getContent($myFile);`
Initial commit 2012-08-22 17:52:08 +02:00
Updated README 2012-08-22 23:22:07 +02:00			`DataObject extraction:`
Initial commit 2012-08-22 17:52:08 +02:00
Updated README 2012-08-22 23:22:07 +02:00			`$myFileObj = File::get()->First();`
			`$content = $myFileObj->extractFileAsText();`