silverstripe-textextraction/README.md

66 lines
2.6 KiB
Markdown
Raw Normal View History

2015-11-06 23:50:19 +01:00
# Text extraction module
2012-08-22 17:52:08 +02:00
2017-11-22 21:52:40 +01:00
[![Build Status](https://secure.travis-ci.org/silverstripe/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe/silverstripe-textextraction)
[![Code Quality](http://img.shields.io/scrutinizer/g/silverstripe/silverstripe-textextraction.svg?style=flat)](https://scrutinizer-ci.com/g/silverstripe/silverstripe-textextraction)
[![Version](http://img.shields.io/packagist/v/silverstripe/textextraction.svg?style=flat)](https://packagist.org/packages/silverstripe/silverstripe-textextraction)
[![License](http://img.shields.io/packagist/l/silverstripe/textextraction.svg?style=flat)](license.md)
2013-05-07 21:48:01 +02:00
2012-08-22 17:52:08 +02:00
2015-11-06 23:50:19 +01:00
Provides a text extraction API for file content, that can hook into different extractor
engines based on availability and the parsed file format. The output returned is always a string of the file content.
2012-08-22 17:52:08 +02:00
Via the `FileTextExtractable` extension, this logic can be used to
2012-08-22 23:22:07 +02:00
cache the extracted content on a `DataObject` subclass (usually `File`).
2015-11-06 23:50:19 +01:00
The module supports text extraction on the following file formats:
2013-02-01 15:35:16 +01:00
* HTML (built-in)
* PDF (with XPDF or Solr)
* Microsoft Word, Excel, Powerpoint (Solr)
* OpenOffice (Solr)
* CSV (Solr)
* RTF (Solr)
* EPub (Solr)
* Many others (Tika)
2013-02-01 15:35:16 +01:00
2015-11-06 23:50:19 +01:00
## Requirements
2013-02-01 15:35:16 +01:00
2015-11-06 23:50:19 +01:00
* SilverStripe ^3.1
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
* (optional) [Apache Tika](http://tika.apache.org/)
2013-02-01 15:35:16 +01:00
2015-11-06 23:50:19 +01:00
## Installation
2013-02-01 15:35:16 +01:00
2015-02-25 02:44:03 +01:00
```js
composer require silverstripe/textextraction
2015-02-25 02:44:03 +01:00
```
2013-02-01 15:35:16 +01:00
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
which is automatically checked out by composer. Alternatively, install Guzzle
through PEAR and ensure its in your `include_path`.
2012-08-22 23:22:07 +02:00
## Documentation
2012-08-22 23:22:07 +02:00
* [Configuration](docs/en/configuration.md)
* [Developer documentation](/docs/en/developer-docs.md)
2015-11-06 23:50:19 +01:00
## Bugtracker
2012-08-22 17:52:08 +02:00
Bugs are tracked in the issues section of this repository. Before submitting an issue please read over
existing issues to ensure yours is unique.
2012-08-22 17:52:08 +02:00
2015-11-06 23:50:19 +01:00
If the issue does look like a new bug:
2012-08-22 17:52:08 +02:00
2015-11-06 23:50:19 +01:00
- Create a new issue
- Describe the steps required to reproduce your issue, and the expected outcome. Unit tests, screenshots
and screencasts can help here.
- Describe your environment as detailed as possible: SilverStripe version, Browser, PHP version,
2015-11-06 23:50:19 +01:00
Operating System, any installed SilverStripe modules.
2015-02-25 02:44:03 +01:00
2015-11-06 23:50:19 +01:00
Please report security issues to security@silverstripe.org directly. Please don't file security issues in the bugtracker.
2015-02-25 02:44:03 +01:00
2015-11-06 23:50:19 +01:00
## Development and contribution
If you would like to make contributions to the module please ensure you raise a pull request and discuss
2015-11-06 23:50:19 +01:00
with the module maintainers.