silverstripe-textextraction/README.md

131 lines
4.5 KiB
Markdown
Raw Normal View History

2012-08-22 17:52:08 +02:00
# Text Extraction Module
2013-05-07 21:48:01 +02:00
[![Build Status](https://secure.travis-ci.org/silverstripe-labs/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe-labs/silverstripe-textextraction)
2012-08-22 17:52:08 +02:00
## Overview
2012-08-22 23:22:07 +02:00
Provides an extraction API for file content, which can hook into different extractor
engines based on availability and the parsed file format.
The output is always a string: the file content.
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
Via the `FileTextExtractable` extension, this logic can be used to
cache the extracted content on a `DataObject` subclass (usually `File`).
Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).
## Requirements
2013-05-07 18:47:56 +02:00
* SilverStripe 3.1
2012-08-22 23:22:07 +02:00
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
2013-02-01 15:35:16 +01:00
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
### Supported Formats
* HTML (built-in)
* PDF (with XPDF or Solr)
* Microsoft Word, Excel, Powerpoint (Solr)
* OpenOffice (Solr)
* CSV (Solr)
* RTF (Solr)
* EPub (Solr)
## Installation
The recommended installation is through [composer](http://getcomposer.org).
Add the following to your `composer.json`:
```js
2013-02-01 15:35:16 +01:00
{
"require": {
"silverstripe/textextraction": "*"
}
}
```
2013-02-01 15:35:16 +01:00
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
which is automatically checked out by composer. Alternatively, install Guzzle
through PEAR and ensure its in your `include_path`.
2012-08-22 23:22:07 +02:00
## Configuration
2013-02-01 15:35:16 +01:00
### Basic
By default, only extraction from HTML documents is supported.
No configuration is required for that, unless you want to make
2012-08-22 23:22:07 +02:00
the content available through your `DataObject` subclass.
In this case, add the following to `mysite/_config.php`:
DataObject::add_extension('File', 'FileTextExtractable');
2012-08-22 17:52:08 +02:00
2013-02-01 15:35:16 +01:00
### XPDF
PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
commandline utility. Follow their installation instructions, its presence will be automatically
detected. You can optionally set the binary path in `mysite/_config/config.yml`:
```yml
2013-02-01 15:35:16 +01:00
PDFTextExtractor:
binary_location: /my/path/pdftotext
```
2013-02-01 15:35:16 +01:00
### Apache Solr
Apache Solr is a fulltext search engine, an aspect which is often used
alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
The textextraction module retrieves the output of this service, rather than altering the index.
With the raw text output, you can decide to store it in a database column for fulltext search
in your database driver, or even pass it back to Solr as part of a full index update.
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
```yml
2013-02-01 15:35:16 +01:00
SolrCellTextExtractor:
base_url: 'http://localhost:8983/solr/update/extract'
```
2013-02-01 15:35:16 +01:00
Note that in case you're using multiple cores, you'll need to add the core name to the URL
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
uses multiple cores by default, and comes prepackaged with a Solr server.
Its a stripped-down version of Solr, follow the module README on how to add
Apache Tika text extraction capabilities.
You need to ensure that some indexable property on your object
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
The property should be listed in your `SolrIndex` subclass, e.g. as follows:
```php
class MyDocument extends DataObject {
static $db = array('Path' => 'Text');
function getContent() {
$extractor = FileTextExtractor::for_file($this->Path);
return $extractor ? $extractor->getContent($this->Path) : null;
}
}
2014-02-17 22:28:02 +01:00
class MySolrIndex extends SolrIndex {
function init() {
$this->addClass('MyDocument');
$this->addFulltextField('Content', 'HTMLText');
}
}
```
2013-02-01 15:35:16 +01:00
Note: This isn't a terribly efficient way to process large amounts of files, since
each HTTP request is run synchronously.
2012-08-22 17:52:08 +02:00
## Usage
2012-08-22 23:22:07 +02:00
Manual extraction:
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
$myFile = '/my/path/myfile.pdf';
$extractor = FileTextExtractor::for_file($myFile);
$content = $extractor->getContent($myFile);
2012-08-22 17:52:08 +02:00
Extraction with `FileTextExtractable` extension applied:
2012-08-22 17:52:08 +02:00
2012-08-22 23:22:07 +02:00
$myFileObj = File::get()->First();
2014-02-17 22:28:02 +01:00
$content = $myFileObj->extractFileAsText();