2015-11-06 23:50:19 +01:00
|
|
|
# Text extraction module
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2013-05-07 21:48:01 +02:00
|
|
|
[![Build Status](https://secure.travis-ci.org/silverstripe-labs/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe-labs/silverstripe-textextraction)
|
2015-11-06 23:50:19 +01:00
|
|
|
[![Code Quality](http://img.shields.io/scrutinizer/g/silverstripe-labs/silverstripe-textextraction.svg?style=flat-square)](https://scrutinizer-ci.com/g/silverstripe-labs/silverstripe-textextraction)
|
|
|
|
[![Version](http://img.shields.io/packagist/v/silverstripe/textextraction.svg?style=flat-square)](https://packagist.org/packages/silverstripe/silverstripe-textextraction)
|
|
|
|
[![License](http://img.shields.io/packagist/l/silverstripe/textextraction.svg?style=flat-square)](license.md)
|
2013-05-07 21:48:01 +02:00
|
|
|
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
Provides a text extraction API for file content, that can hook into different extractor
|
|
|
|
engines based on availability and the parsed file format. The output returned is always a string of the file content.
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2017-11-22 21:39:34 +01:00
|
|
|
Via the `FileTextExtractable` extension, this logic can be used to
|
2012-08-22 23:22:07 +02:00
|
|
|
cache the extracted content on a `DataObject` subclass (usually `File`).
|
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
The module supports text extraction on the following file formats:
|
2013-02-01 15:35:16 +01:00
|
|
|
|
|
|
|
* HTML (built-in)
|
|
|
|
* PDF (with XPDF or Solr)
|
|
|
|
* Microsoft Word, Excel, Powerpoint (Solr)
|
|
|
|
* OpenOffice (Solr)
|
|
|
|
* CSV (Solr)
|
|
|
|
* RTF (Solr)
|
|
|
|
* EPub (Solr)
|
2015-02-18 03:31:38 +01:00
|
|
|
* Many others (Tika)
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Requirements
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
* SilverStripe ^3.1
|
|
|
|
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
|
|
|
|
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
|
|
|
|
* (optional) [Apache Tika](http://tika.apache.org/)
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Installation
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-02-25 02:44:03 +01:00
|
|
|
```js
|
2017-11-22 21:39:34 +01:00
|
|
|
composer require silverstripe/textextraction
|
2015-02-25 02:44:03 +01:00
|
|
|
```
|
2013-02-01 15:35:16 +01:00
|
|
|
|
|
|
|
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
|
|
|
which is automatically checked out by composer. Alternatively, install Guzzle
|
|
|
|
through PEAR and ensure its in your `include_path`.
|
2012-08-22 23:22:07 +02:00
|
|
|
|
|
|
|
## Configuration
|
|
|
|
|
2013-02-01 15:35:16 +01:00
|
|
|
### Basic
|
|
|
|
|
|
|
|
By default, only extraction from HTML documents is supported.
|
|
|
|
No configuration is required for that, unless you want to make
|
2012-08-22 23:22:07 +02:00
|
|
|
the content available through your `DataObject` subclass.
|
2015-02-18 03:31:38 +01:00
|
|
|
In this case, add the following to `mysite/_config/config.yml`:
|
2012-08-22 23:22:07 +02:00
|
|
|
|
2015-02-25 02:44:03 +01:00
|
|
|
```yaml
|
|
|
|
File:
|
|
|
|
extensions:
|
2017-11-22 21:39:34 +01:00
|
|
|
- FileTextExtractable
|
2015-02-25 02:44:03 +01:00
|
|
|
```
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-05-05 05:52:01 +02:00
|
|
|
By default any extracted content will be cached against the database row.
|
2015-05-07 09:14:02 +02:00
|
|
|
In order to stay within common size constraints for SQL queries required in this operation,
|
|
|
|
the cache sets a maximum character length after which content gets truncated (default: 500000).
|
|
|
|
You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration.
|
|
|
|
|
2015-05-05 05:52:01 +02:00
|
|
|
|
|
|
|
Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth.
|
|
|
|
In order to swap out the cache backend you can use the following yaml configuration.
|
|
|
|
|
|
|
|
|
|
|
|
```yaml
|
|
|
|
---
|
|
|
|
Name: mytextextraction
|
|
|
|
After: '#textextraction'
|
|
|
|
---
|
|
|
|
Injector:
|
|
|
|
FileTextCache: FileTextCache_SSCache
|
|
|
|
FileTextCache_SSCache:
|
|
|
|
lifetime: 3600 # Number of seconds to cache content for
|
|
|
|
|
|
|
|
```
|
|
|
|
|
2013-02-01 15:35:16 +01:00
|
|
|
### XPDF
|
|
|
|
|
|
|
|
PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
|
|
|
|
commandline utility. Follow their installation instructions, its presence will be automatically
|
2017-11-22 21:39:34 +01:00
|
|
|
detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-02-25 02:44:03 +01:00
|
|
|
```yml
|
|
|
|
PDFTextExtractor:
|
2017-11-22 21:39:34 +01:00
|
|
|
binary_location: /my/path/pdftotext
|
2015-02-25 02:44:03 +01:00
|
|
|
```
|
2013-02-01 15:35:16 +01:00
|
|
|
|
|
|
|
### Apache Solr
|
|
|
|
|
|
|
|
Apache Solr is a fulltext search engine, an aspect which is often used
|
|
|
|
alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
|
|
|
|
through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
|
|
|
|
This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
|
|
|
|
The textextraction module retrieves the output of this service, rather than altering the index.
|
|
|
|
With the raw text output, you can decide to store it in a database column for fulltext search
|
|
|
|
in your database driver, or even pass it back to Solr as part of a full index update.
|
|
|
|
|
|
|
|
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
|
|
|
|
|
2015-02-25 02:44:03 +01:00
|
|
|
```yml
|
|
|
|
SolrCellTextExtractor:
|
2017-11-22 21:39:34 +01:00
|
|
|
base_url: 'http://localhost:8983/solr/update/extract'
|
2015-02-25 02:44:03 +01:00
|
|
|
```
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2017-11-22 21:39:34 +01:00
|
|
|
Note that in case you're using multiple cores, you'll need to add the core name to the URL
|
2013-02-01 15:35:16 +01:00
|
|
|
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
|
|
|
|
The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
|
|
|
|
uses multiple cores by default, and comes prepackaged with a Solr server.
|
|
|
|
Its a stripped-down version of Solr, follow the module README on how to add
|
|
|
|
Apache Tika text extraction capabilities.
|
|
|
|
|
2013-05-07 20:14:01 +02:00
|
|
|
You need to ensure that some indexable property on your object
|
|
|
|
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
|
|
|
|
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
|
|
|
|
The property should be listed in your `SolrIndex` subclass, e.g. as follows:
|
|
|
|
|
2015-02-25 02:44:03 +01:00
|
|
|
```php
|
|
|
|
class MyDocument extends DataObject {
|
|
|
|
static $db = array('Path' => 'Text');
|
|
|
|
function getContent() {
|
|
|
|
$extractor = FileTextExtractor::for_file($this->Path);
|
2017-11-22 21:39:34 +01:00
|
|
|
return $extractor ? $extractor->getContent($this->Path) : null;
|
2013-05-07 20:14:01 +02:00
|
|
|
}
|
2015-02-25 02:44:03 +01:00
|
|
|
}
|
|
|
|
class MySolrIndex extends SolrIndex {
|
|
|
|
function init() {
|
|
|
|
$this->addClass('MyDocument');
|
|
|
|
$this->addStoredField('Content', 'HTMLText');
|
2013-05-07 20:14:01 +02:00
|
|
|
}
|
2015-02-25 02:44:03 +01:00
|
|
|
}
|
|
|
|
```
|
2013-05-07 20:14:01 +02:00
|
|
|
|
2017-11-22 21:39:34 +01:00
|
|
|
Note: This isn't a terribly efficient way to process large amounts of files, since
|
2013-02-01 15:35:16 +01:00
|
|
|
each HTTP request is run synchronously.
|
|
|
|
|
2015-02-18 03:31:38 +01:00
|
|
|
### Tika
|
|
|
|
|
2015-05-06 07:00:42 +02:00
|
|
|
Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI.
|
2015-02-18 03:31:38 +01:00
|
|
|
|
2015-05-06 07:00:42 +02:00
|
|
|
See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and
|
2015-04-29 01:59:34 +02:00
|
|
|
configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning
|
|
|
|
to have it running constantly in the background. Starting tika as a CLI script for every extraction request
|
|
|
|
is fairly slow, so we recommend running it as a server.
|
2015-02-18 03:31:38 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Bugtracker
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2017-11-22 21:39:34 +01:00
|
|
|
Bugs are tracked in the issues section of this repository. Before submitting an issue please read over
|
|
|
|
existing issues to ensure yours is unique.
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
If the issue does look like a new bug:
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
- Create a new issue
|
|
|
|
- Describe the steps required to reproduce your issue, and the expected outcome. Unit tests, screenshots
|
|
|
|
and screencasts can help here.
|
2017-11-22 21:39:34 +01:00
|
|
|
- Describe your environment as detailed as possible: SilverStripe version, Browser, PHP version,
|
2015-11-06 23:50:19 +01:00
|
|
|
Operating System, any installed SilverStripe modules.
|
2015-02-25 02:44:03 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
Please report security issues to security@silverstripe.org directly. Please don't file security issues in the bugtracker.
|
2015-02-25 02:44:03 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Development and contribution
|
2017-11-22 21:39:34 +01:00
|
|
|
If you would like to make contributions to the module please ensure you raise a pull request and discuss
|
2015-11-06 23:50:19 +01:00
|
|
|
with the module maintainers.
|