silverstripe-textextraction/README.md

65 lines
2.8 KiB
Markdown
Raw Normal View History

2015-11-06 23:50:19 +01:00
# Text extraction module
2012-08-22 17:52:08 +02:00
2022-07-05 09:08:18 +02:00
[![CI](https://github.com/silverstripe/silverstripe-textextraction/actions/workflows/ci.yml/badge.svg)](https://github.com/silverstripe/silverstripe-textextraction/actions/workflows/ci.yml)
[![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/silverstripe/silverstripe-textextraction/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/silverstripe/silverstripe-textextraction/?branch=master)
[![codecov](https://codecov.io/gh/silverstripe/silverstripe-textextraction/branch/master/graph/badge.svg)](https://codecov.io/gh/silverstripe/silverstripe-textextraction)
2018-06-15 07:50:30 +02:00
[![SilverStripe supported module](https://img.shields.io/badge/silverstripe-supported-0071C4.svg)](https://www.silverstripe.org/software/addons/silverstripe-commercially-supported-module-list/)
2012-08-22 17:52:08 +02:00
2015-11-06 23:50:19 +01:00
Provides a text extraction API for file content, that can hook into different extractor
engines based on availability and the parsed file format. The output returned is always a string of the file content.
2012-08-22 17:52:08 +02:00
Via the `FileTextExtractable` extension, this logic can be used to
2012-08-22 23:22:07 +02:00
cache the extracted content on a `DataObject` subclass (usually `File`).
2015-11-06 23:50:19 +01:00
The module supports text extraction on the following file formats:
2013-02-01 15:35:16 +01:00
* HTML (built-in)
* PDF (with XPDF or Solr)
* Microsoft Word, Excel, Powerpoint (Solr)
* OpenOffice (Solr)
* CSV (Solr)
* RTF (Solr)
* EPub (Solr)
* Many others (Tika)
2013-02-01 15:35:16 +01:00
2015-11-06 23:50:19 +01:00
## Requirements
2013-02-01 15:35:16 +01:00
* SilverStripe ^4.0
2015-11-06 23:50:19 +01:00
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
* (optional) [Apache Tika](http://tika.apache.org/)
2013-02-01 15:35:16 +01:00
2015-11-06 23:50:19 +01:00
## Installation
2013-02-01 15:35:16 +01:00
```
composer require silverstripe/textextraction
2015-02-25 02:44:03 +01:00
```
2013-02-01 15:35:16 +01:00
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
which is automatically checked out by composer. Alternatively, install Guzzle
through PEAR and ensure its in your `include_path`.
2012-08-22 23:22:07 +02:00
## Documentation
2012-08-22 23:22:07 +02:00
* [Configuration](docs/en/configuration.md)
* [Developer documentation](/docs/en/developer-docs.md)
2015-11-06 23:50:19 +01:00
## Bugtracker
2012-08-22 17:52:08 +02:00
Bugs are tracked in the issues section of this repository. Before submitting an issue please read over
existing issues to ensure yours is unique.
2012-08-22 17:52:08 +02:00
2015-11-06 23:50:19 +01:00
If the issue does look like a new bug:
2012-08-22 17:52:08 +02:00
2015-11-06 23:50:19 +01:00
- Create a new issue
- Describe the steps required to reproduce your issue, and the expected outcome. Unit tests, screenshots
and screencasts can help here.
- Describe your environment as detailed as possible: SilverStripe version, Browser, PHP version,
2015-11-06 23:50:19 +01:00
Operating System, any installed SilverStripe modules.
2015-02-25 02:44:03 +01:00
2015-11-06 23:50:19 +01:00
Please report security issues to security@silverstripe.org directly. Please don't file security issues in the bugtracker.
2015-02-25 02:44:03 +01:00
2015-11-06 23:50:19 +01:00
## Development and contribution
If you would like to make contributions to the module please ensure you raise a pull request and discuss
2015-11-06 23:50:19 +01:00
with the module maintainers.