silverstripe-textextraction/README.md

38 lines
1.1 KiB
Markdown
Raw Normal View History

2012-08-22 17:52:08 +02:00
# Text Extraction Module
## Overview
2012-08-22 23:22:07 +02:00
Provides an extraction API for file content, which can hook into different extractor
engines based on availability and the parsed file format.
The output is always a string: the file content.
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
Via the `FileTextExtractable` extension, this logic can be used to
cache the extracted content on a `DataObject` subclass (usually `File`).
Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).
## Requirements
* SilverStripe 3.0
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
## Configuration
No configuration is required, unless you want to make
the content available through your `DataObject` subclass.
In this case, add the following to `mysite/_config.php`:
DataObject::add_extension('File', 'FileTextExtractable');
2012-08-22 17:52:08 +02:00
## Usage
2012-08-22 23:22:07 +02:00
Manual extraction:
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
DataObject extraction:
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
$myFileObj = File::get()->First();
$content = $myFileObj->extractFileAsText();