2015-11-06 23:50:19 +01:00
|
|
|
# Text extraction module
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2022-07-05 09:08:18 +02:00
|
|
|
[![CI](https://github.com/silverstripe/silverstripe-textextraction/actions/workflows/ci.yml/badge.svg)](https://github.com/silverstripe/silverstripe-textextraction/actions/workflows/ci.yml)
|
2022-08-01 00:47:45 +02:00
|
|
|
[![Silverstripe supported module](https://img.shields.io/badge/silverstripe-supported-0071C4.svg)](https://www.silverstripe.org/software/addons/silverstripe-commercially-supported-module-list/)
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
Provides a text extraction API for file content, that can hook into different extractor
|
|
|
|
engines based on availability and the parsed file format. The output returned is always a string of the file content.
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2017-11-22 21:39:34 +01:00
|
|
|
Via the `FileTextExtractable` extension, this logic can be used to
|
2012-08-22 23:22:07 +02:00
|
|
|
cache the extracted content on a `DataObject` subclass (usually `File`).
|
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
The module supports text extraction on the following file formats:
|
2013-02-01 15:35:16 +01:00
|
|
|
|
|
|
|
* HTML (built-in)
|
|
|
|
* PDF (with XPDF or Solr)
|
|
|
|
* Microsoft Word, Excel, Powerpoint (Solr)
|
|
|
|
* OpenOffice (Solr)
|
|
|
|
* CSV (Solr)
|
|
|
|
* RTF (Solr)
|
|
|
|
* EPub (Solr)
|
2015-02-18 03:31:38 +01:00
|
|
|
* Many others (Tika)
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Requirements
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2022-08-01 00:47:45 +02:00
|
|
|
* Silverstripe ^4.0
|
2015-11-06 23:50:19 +01:00
|
|
|
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
|
|
|
|
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
|
|
|
|
* (optional) [Apache Tika](http://tika.apache.org/)
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Installation
|
2013-02-01 15:35:16 +01:00
|
|
|
|
2018-07-03 00:47:56 +02:00
|
|
|
```
|
2017-11-22 21:39:34 +01:00
|
|
|
composer require silverstripe/textextraction
|
2015-02-25 02:44:03 +01:00
|
|
|
```
|
2013-02-01 15:35:16 +01:00
|
|
|
|
|
|
|
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
|
|
|
which is automatically checked out by composer. Alternatively, install Guzzle
|
|
|
|
through PEAR and ensure its in your `include_path`.
|
2012-08-22 23:22:07 +02:00
|
|
|
|
2017-11-22 21:48:03 +01:00
|
|
|
## Documentation
|
2012-08-22 23:22:07 +02:00
|
|
|
|
2017-11-22 21:48:03 +01:00
|
|
|
* [Configuration](docs/en/configuration.md)
|
|
|
|
* [Developer documentation](/docs/en/developer-docs.md)
|
2015-02-18 03:31:38 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Bugtracker
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2017-11-22 21:39:34 +01:00
|
|
|
Bugs are tracked in the issues section of this repository. Before submitting an issue please read over
|
|
|
|
existing issues to ensure yours is unique.
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
If the issue does look like a new bug:
|
2012-08-22 17:52:08 +02:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
- Create a new issue
|
|
|
|
- Describe the steps required to reproduce your issue, and the expected outcome. Unit tests, screenshots
|
|
|
|
and screencasts can help here.
|
2022-08-01 00:47:45 +02:00
|
|
|
- Describe your environment as detailed as possible: Silverstripe version, Browser, PHP version,
|
|
|
|
Operating System, any installed Silverstripe modules.
|
2015-02-25 02:44:03 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
Please report security issues to security@silverstripe.org directly. Please don't file security issues in the bugtracker.
|
2015-02-25 02:44:03 +01:00
|
|
|
|
2015-11-06 23:50:19 +01:00
|
|
|
## Development and contribution
|
2017-11-22 21:39:34 +01:00
|
|
|
If you would like to make contributions to the module please ensure you raise a pull request and discuss
|
2015-11-06 23:50:19 +01:00
|
|
|
with the module maintainers.
|