mirror of
https://github.com/silverstripe/silverstripe-textextraction
synced 2024-10-22 11:06:00 +02:00
Add supported module standard docs
This commit is contained in:
parent
1e8581d7f8
commit
7b3fb280c6
6
.gitattributes
vendored
Normal file
6
.gitattributes
vendored
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
/tests export-ignore
|
||||||
|
/docs export-ignore
|
||||||
|
/.gitattributes export-ignore
|
||||||
|
/.travis.yml export-ignore
|
||||||
|
/.travis export-ignore
|
||||||
|
/.scrutinizer.yml export-ignore
|
41
.travis.yml
41
.travis.yml
@ -1,21 +1,35 @@
|
|||||||
# See https://github.com/silverstripe-labs/silverstripe-travis-support for setup details
|
# See https://github.com/silverstripe-labs/silverstripe-travis-support for setup details
|
||||||
|
language: php
|
||||||
|
|
||||||
language: php
|
|
||||||
php:
|
php:
|
||||||
- 5.4
|
- 5.3
|
||||||
|
- 5.4
|
||||||
|
- 5.5
|
||||||
|
- 5.6
|
||||||
|
- 7.0
|
||||||
|
|
||||||
sudo: false
|
sudo: false
|
||||||
|
|
||||||
addons:
|
|
||||||
apt:
|
|
||||||
packages:
|
|
||||||
- poppler-utils
|
|
||||||
|
|
||||||
env:
|
env:
|
||||||
- DB=MYSQL CORE_RELEASE=3.1
|
- DB=MYSQL CORE_RELEASE=3.2
|
||||||
- DB=MYSQL CORE_RELEASE=3
|
|
||||||
|
matrix:
|
||||||
|
include:
|
||||||
|
- php: 5.6
|
||||||
|
env: DB=PGSQL CORE_RELEASE=3
|
||||||
|
- php: 5.6
|
||||||
|
env: DB=PGSQL CORE_RELEASE=3.1
|
||||||
|
- php: 5.6
|
||||||
|
env: DB=PGSQL CORE_RELEASE=3.2
|
||||||
|
- php: 5.6
|
||||||
|
env: DB=MYSQL CORE_RELEASE=3.3
|
||||||
|
- php: 5.6
|
||||||
|
env: DB=MYSQL CORE_RELEASE=3.2
|
||||||
|
- php: 5.6
|
||||||
|
env: DB=MYSQL CORE_RELEASE=3.1
|
||||||
|
|
||||||
before_script:
|
before_script:
|
||||||
|
- composer self-update || true
|
||||||
- mkdir -p $HOME/bin
|
- mkdir -p $HOME/bin
|
||||||
- export PATH=$PATH:$HOME/bin
|
- export PATH=$PATH:$HOME/bin
|
||||||
- export SS_TIKA_ENDPOINT="http://localhost:9998/"
|
- export SS_TIKA_ENDPOINT="http://localhost:9998/"
|
||||||
@ -23,7 +37,16 @@ before_script:
|
|||||||
- git clone git://github.com/silverstripe-labs/silverstripe-travis-support.git ~/travis-support
|
- git clone git://github.com/silverstripe-labs/silverstripe-travis-support.git ~/travis-support
|
||||||
- php ~/travis-support/travis_setup.php --source `pwd` --target ~/builds/ss
|
- php ~/travis-support/travis_setup.php --source `pwd` --target ~/builds/ss
|
||||||
- cd ~/builds/ss
|
- cd ~/builds/ss
|
||||||
|
- composer install
|
||||||
|
|
||||||
script:
|
script:
|
||||||
- ($HOME/bin/tika-rest-server &) &> /dev/null
|
- ($HOME/bin/tika-rest-server &) &> /dev/null
|
||||||
- vendor/bin/phpunit --verbose textextraction/tests/
|
- vendor/bin/phpunit --verbose textextraction/tests/
|
||||||
|
|
||||||
|
branches:
|
||||||
|
only:
|
||||||
|
- master
|
||||||
|
|
||||||
|
matrix:
|
||||||
|
allow_failures:
|
||||||
|
- php: 7.0
|
||||||
|
12
CHANGELOG.md
Normal file
12
CHANGELOG.md
Normal file
@ -0,0 +1,12 @@
|
|||||||
|
# Changelog
|
||||||
|
|
||||||
|
All notable changes to this project will be documented in this file.
|
||||||
|
|
||||||
|
This project adheres to [Semantic Versioning](http://semver.org/).
|
||||||
|
|
||||||
|
|
||||||
|
## [2.0.1]
|
||||||
|
Using Symfony mime type detection
|
||||||
|
|
||||||
|
## [2.0.0]
|
||||||
|
Clarified Tika docs
|
15
CONTRIBUTING.md
Normal file
15
CONTRIBUTING.md
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
# Contributing
|
||||||
|
|
||||||
|
- Maintenance on this module is a shared effort of those who use it
|
||||||
|
- To contribute improvements to the code, ensure you raise a pull request and discuss with the module maintainers
|
||||||
|
- Please follow the SilverStripe [code contribution guidelines](https://docs.silverstripe.org/en/contributing/code/) and [Module Standard](https://docs.silverstripe.org/en/developer_guides/extending/modules/#module-standard)
|
||||||
|
- Supply documentation that followS the [GitHub Flavored Markdown](https://help.github.com/articles/markdown-basics/) conventions
|
||||||
|
- When having discussions about this module in issues or pull request please adhere to the [SilverStripe Community Code of Conduct](https://docs.silverstripe.org/en/contributing/code_of_conduct/)
|
||||||
|
|
||||||
|
|
||||||
|
## Contributor license agreement
|
||||||
|
By supplying code to this module in patches, tickets and pull requests, you agree to assign copyright
|
||||||
|
of that code to SilverStripe Ltd., on the condition that these code changes are released under the
|
||||||
|
same BSD license as the original module. We ask for this so that the ownership in the license is clear
|
||||||
|
and unambiguous. By releasing this code under a permissive license such as BSD, this copyright assignment
|
||||||
|
won't prevent you from using the code in any way you see fit.
|
@ -1,4 +1,4 @@
|
|||||||
* Copyright (c) 2010-2012, SilverStripe Ltd.
|
* Copyright (c) 2015, SilverStripe Ltd.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
*
|
*
|
||||||
* Redistribution and use in source and binary forms, with or without
|
* Redistribution and use in source and binary forms, with or without
|
224
README.md
224
README.md
@ -1,26 +1,18 @@
|
|||||||
# Text Extraction Module
|
# Text extraction module
|
||||||
|
|
||||||
[![Build Status](https://secure.travis-ci.org/silverstripe-labs/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe-labs/silverstripe-textextraction)
|
[![Build Status](https://secure.travis-ci.org/silverstripe-labs/silverstripe-textextraction.png)](http://travis-ci.org/silverstripe-labs/silverstripe-textextraction)
|
||||||
|
[![Code Quality](http://img.shields.io/scrutinizer/g/silverstripe-labs/silverstripe-textextraction.svg?style=flat-square)](https://scrutinizer-ci.com/g/silverstripe-labs/silverstripe-textextraction)
|
||||||
|
[![Version](http://img.shields.io/packagist/v/silverstripe/textextraction.svg?style=flat-square)](https://packagist.org/packages/silverstripe/silverstripe-textextraction)
|
||||||
|
[![License](http://img.shields.io/packagist/l/silverstripe/textextraction.svg?style=flat-square)](license.md)
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Provides an extraction API for file content, which can hook into different extractor
|
Provides a text extraction API for file content, that can hook into different extractor
|
||||||
engines based on availability and the parsed file format.
|
engines based on availability and the parsed file format. The output returned is always a string of the file content.
|
||||||
The output is always a string: the file content.
|
|
||||||
|
|
||||||
Via the `FileTextExtractable` extension, this logic can be used to
|
Via the `FileTextExtractable` extension, this logic can be used to
|
||||||
cache the extracted content on a `DataObject` subclass (usually `File`).
|
cache the extracted content on a `DataObject` subclass (usually `File`).
|
||||||
|
|
||||||
Note: Previously part of the [sphinx module](https://github.com/silverstripe/silverstripe-sphinx).
|
The module supports text extraction on the following file formats:
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
* SilverStripe 3.1
|
|
||||||
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
|
|
||||||
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
|
|
||||||
* (optional) [Apache Tika](http://tika.apache.org/)
|
|
||||||
|
|
||||||
### Supported Formats
|
|
||||||
|
|
||||||
* HTML (built-in)
|
* HTML (built-in)
|
||||||
* PDF (with XPDF or Solr)
|
* PDF (with XPDF or Solr)
|
||||||
@ -31,188 +23,44 @@ Note: Previously part of the [sphinx module](https://github.com/silverstripe/sil
|
|||||||
* EPub (Solr)
|
* EPub (Solr)
|
||||||
* Many others (Tika)
|
* Many others (Tika)
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
* SilverStripe ^3.1
|
||||||
|
* (optional) [XPDF](http://www.foolabs.com/xpdf/) (`pdftotext` utility)
|
||||||
|
* (optional) [Apache Solr with ExtracingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler)
|
||||||
|
* (optional) [Apache Tika](http://tika.apache.org/)
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
The recommended installation is through [composer](http://getcomposer.org).
|
|
||||||
Add the following to your `composer.json`:
|
|
||||||
|
|
||||||
```js
|
```js
|
||||||
{
|
composer require silverstripe/textextraction
|
||||||
"require": {
|
|
||||||
"silverstripe/textextraction": "2.0.x-dev"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
||||||
which is automatically checked out by composer. Alternatively, install Guzzle
|
which is automatically checked out by composer. Alternatively, install Guzzle
|
||||||
through PEAR and ensure its in your `include_path`.
|
through PEAR and ensure its in your `include_path`.
|
||||||
|
|
||||||
## Configuration
|
## Documentation
|
||||||
|
* [Configuration](docs/en/configuration.md)
|
||||||
### Basic
|
* [Developer documentation](/docs/en/developer-docs.md)
|
||||||
|
|
||||||
By default, only extraction from HTML documents is supported.
|
## Bugtracker
|
||||||
No configuration is required for that, unless you want to make
|
Bugs are tracked in the issues section of this repository. Before submitting an issue please read over
|
||||||
the content available through your `DataObject` subclass.
|
existing issues to ensure yours is unique.
|
||||||
In this case, add the following to `mysite/_config/config.yml`:
|
|
||||||
|
If the issue does look like a new bug:
|
||||||
```yaml
|
|
||||||
File:
|
- Create a new issue
|
||||||
extensions:
|
- Describe the steps required to reproduce your issue, and the expected outcome. Unit tests, screenshots
|
||||||
- FileTextExtractable
|
and screencasts can help here.
|
||||||
```
|
- Describe your environment as detailed as possible: SilverStripe version, Browser, PHP version,
|
||||||
|
Operating System, any installed SilverStripe modules.
|
||||||
By default any extracted content will be cached against the database row.
|
|
||||||
In order to stay within common size constraints for SQL queries required in this operation,
|
Please report security issues to security@silverstripe.org directly. Please don't file security issues in the bugtracker.
|
||||||
the cache sets a maximum character length after which content gets truncated (default: 500000).
|
|
||||||
You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration.
|
## Development and contribution
|
||||||
|
If you would like to make contributions to the module please ensure you raise a pull request and discuss
|
||||||
|
with the module maintainers.
|
||||||
|
|
||||||
|
|
||||||
Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth.
|
|
||||||
In order to swap out the cache backend you can use the following yaml configuration.
|
|
||||||
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
---
|
|
||||||
Name: mytextextraction
|
|
||||||
After: '#textextraction'
|
|
||||||
---
|
|
||||||
Injector:
|
|
||||||
FileTextCache: FileTextCache_SSCache
|
|
||||||
FileTextCache_SSCache:
|
|
||||||
lifetime: 3600 # Number of seconds to cache content for
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
### XPDF
|
|
||||||
|
|
||||||
PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
|
|
||||||
commandline utility. Follow their installation instructions, its presence will be automatically
|
|
||||||
detected. You can optionally set the binary path in `mysite/_config/config.yml`:
|
|
||||||
|
|
||||||
```yml
|
|
||||||
PDFTextExtractor:
|
|
||||||
binary_location: /my/path/pdftotext
|
|
||||||
```
|
|
||||||
|
|
||||||
### Apache Solr
|
|
||||||
|
|
||||||
Apache Solr is a fulltext search engine, an aspect which is often used
|
|
||||||
alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
|
|
||||||
through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
|
|
||||||
This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
|
|
||||||
The textextraction module retrieves the output of this service, rather than altering the index.
|
|
||||||
With the raw text output, you can decide to store it in a database column for fulltext search
|
|
||||||
in your database driver, or even pass it back to Solr as part of a full index update.
|
|
||||||
|
|
||||||
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
|
|
||||||
|
|
||||||
```yml
|
|
||||||
SolrCellTextExtractor:
|
|
||||||
base_url: 'http://localhost:8983/solr/update/extract'
|
|
||||||
```
|
|
||||||
|
|
||||||
Note that in case you're using multiple cores, you'll need to add the core name to the URL
|
|
||||||
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
|
|
||||||
The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
|
|
||||||
uses multiple cores by default, and comes prepackaged with a Solr server.
|
|
||||||
Its a stripped-down version of Solr, follow the module README on how to add
|
|
||||||
Apache Tika text extraction capabilities.
|
|
||||||
|
|
||||||
You need to ensure that some indexable property on your object
|
|
||||||
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
|
|
||||||
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
|
|
||||||
The property should be listed in your `SolrIndex` subclass, e.g. as follows:
|
|
||||||
|
|
||||||
```php
|
|
||||||
class MyDocument extends DataObject {
|
|
||||||
static $db = array('Path' => 'Text');
|
|
||||||
function getContent() {
|
|
||||||
$extractor = FileTextExtractor::for_file($this->Path);
|
|
||||||
return $extractor ? $extractor->getContent($this->Path) : null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
class MySolrIndex extends SolrIndex {
|
|
||||||
function init() {
|
|
||||||
$this->addClass('MyDocument');
|
|
||||||
$this->addStoredField('Content', 'HTMLText');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Note: This isn't a terribly efficient way to process large amounts of files, since
|
|
||||||
each HTTP request is run synchronously.
|
|
||||||
|
|
||||||
### Tika
|
|
||||||
|
|
||||||
Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI.
|
|
||||||
|
|
||||||
See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and
|
|
||||||
configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning
|
|
||||||
to have it running constantly in the background. Starting tika as a CLI script for every extraction request
|
|
||||||
is fairly slow, so we recommend running it as a server.
|
|
||||||
|
|
||||||
This extension will best work with the [fileinfo PHP extension](http://php.net/manual/en/book.fileinfo.php)
|
|
||||||
installed to perform mime detection. Tika validates support via mime type rather than file extensions.
|
|
||||||
|
|
||||||
### Tika - CLI
|
|
||||||
|
|
||||||
Ensure that your machine has a 'tika' command available which will run the CLI script.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
#!/bin/bash
|
|
||||||
exec java -jar tika-app-1.8.jar "$@"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Tika Rest Server
|
|
||||||
|
|
||||||
Tika can also be run as a server. You can configure your server endpoint by setting the url via config.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
TikaServerTextExtractor:
|
|
||||||
server_endpoint: 'http://localhost:9998'
|
|
||||||
```
|
|
||||||
|
|
||||||
Alternatively this may be specified via the `SS_TIKA_ENDPOINT` directive in your `_ss_environment.php` file, or an environment variable of the same name.
|
|
||||||
|
|
||||||
|
|
||||||
Then startup your server as below
|
|
||||||
|
|
||||||
```bash
|
|
||||||
java -jar tika-server-1.8.jar --host=localhost --port=9998
|
|
||||||
```
|
|
||||||
|
|
||||||
While you can run `tika-app-1.8.jar` in server mode as well (with the `--server` flag),
|
|
||||||
it behaves differently and is not recommended.
|
|
||||||
|
|
||||||
The module will log extraction errors with `SS_Log::NOTICE` priority by default,
|
|
||||||
for example a "422 Unprocessable Entity" HTTP response for an encrypted PDF.
|
|
||||||
In case you want more information on why processing failed, you can increase
|
|
||||||
the logging verbosity in the tika server instance by passing through
|
|
||||||
a `--includeStack` flag. Logs can passed on to files or external logging services,
|
|
||||||
see [error handling](http://doc.silverstripe.org/en/developer_guides/debugging/error_handling)
|
|
||||||
documentation for SilverStripe core.
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
Manual extraction:
|
|
||||||
|
|
||||||
```php
|
|
||||||
$myFile = '/my/path/myfile.pdf';
|
|
||||||
$extractor = FileTextExtractor::for_file($myFile);
|
|
||||||
$content = $extractor->getContent($myFile);
|
|
||||||
```
|
|
||||||
|
|
||||||
Extraction with `FileTextExtractable` extension applied:
|
|
||||||
|
|
||||||
```php
|
|
||||||
$myFileObj = File::get()->First();
|
|
||||||
$content = $myFileObj->getFileContent();
|
|
||||||
```
|
|
||||||
|
|
||||||
This content can also be embedded directly within a template.
|
|
||||||
|
|
||||||
```
|
|
||||||
$MyFile.FileContent
|
|
||||||
```
|
|
||||||
|
@ -18,13 +18,13 @@
|
|||||||
"require": {
|
"require": {
|
||||||
"php": ">=5.3.2",
|
"php": ">=5.3.2",
|
||||||
"composer/installers": "*",
|
"composer/installers": "*",
|
||||||
"silverstripe/framework": "~3.1",
|
"silverstripe/framework": "^3.1",
|
||||||
"guzzle/guzzle": "~3.9",
|
"guzzle/guzzle": "^3.9",
|
||||||
"symfony/event-dispatcher": "~2.6.0@stable",
|
"symfony/event-dispatcher": "^2.6.0@stable",
|
||||||
"symfony/http-foundation": "~2.6.0"
|
"symfony/http-foundation": "^2.6.0"
|
||||||
},
|
},
|
||||||
"require-dev": {
|
"require-dev": {
|
||||||
"phpunit/phpunit": "~3.7"
|
"phpunit/phpunit": "^3.7"
|
||||||
},
|
},
|
||||||
"suggest": {
|
"suggest": {
|
||||||
"ext-fileinfo": "Improved support for file mime detection"
|
"ext-fileinfo": "Improved support for file mime detection"
|
||||||
|
145
docs/en/configuration.md
Normal file
145
docs/en/configuration.md
Normal file
@ -0,0 +1,145 @@
|
|||||||
|
# Configuration
|
||||||
|
|
||||||
|
## Basic
|
||||||
|
|
||||||
|
By default, only extraction from HTML documents is supported.
|
||||||
|
No configuration is required for that, unless you want to make
|
||||||
|
the content available through your `DataObject` subclass.
|
||||||
|
In this case, add the following to `mysite/_config/config.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
File:
|
||||||
|
extensions:
|
||||||
|
- FileTextExtractable
|
||||||
|
```
|
||||||
|
|
||||||
|
By default any extracted content will be cached against the database row.
|
||||||
|
In order to stay within common size constraints for SQL queries required in this operation,
|
||||||
|
the cache sets a maximum character length after which content gets truncated (default: 500000).
|
||||||
|
You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration.
|
||||||
|
|
||||||
|
|
||||||
|
Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth.
|
||||||
|
In order to swap out the cache backend you can use the following yaml configuration.
|
||||||
|
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
Name: mytextextraction
|
||||||
|
After: '#textextraction'
|
||||||
|
---
|
||||||
|
Injector:
|
||||||
|
FileTextCache: FileTextCache_SSCache
|
||||||
|
FileTextCache_SSCache:
|
||||||
|
lifetime: 3600 # Number of seconds to cache content for
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
## XPDF
|
||||||
|
|
||||||
|
PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
|
||||||
|
commandline utility. Follow their installation instructions, its presence will be automatically
|
||||||
|
detected. You can optionally set the binary path in `mysite/_config/config.yml`:
|
||||||
|
|
||||||
|
```yml
|
||||||
|
PDFTextExtractor:
|
||||||
|
binary_location: /my/path/pdftotext
|
||||||
|
```
|
||||||
|
|
||||||
|
## Apache Solr
|
||||||
|
|
||||||
|
Apache Solr is a fulltext search engine, an aspect which is often used
|
||||||
|
alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
|
||||||
|
through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
|
||||||
|
This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
|
||||||
|
The textextraction module retrieves the output of this service, rather than altering the index.
|
||||||
|
With the raw text output, you can decide to store it in a database column for fulltext search
|
||||||
|
in your database driver, or even pass it back to Solr as part of a full index update.
|
||||||
|
|
||||||
|
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
|
||||||
|
|
||||||
|
```yml
|
||||||
|
SolrCellTextExtractor:
|
||||||
|
base_url: 'http://localhost:8983/solr/update/extract'
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that in case you're using multiple cores, you'll need to add the core name to the URL
|
||||||
|
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
|
||||||
|
The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
|
||||||
|
uses multiple cores by default, and comes prepackaged with a Solr server.
|
||||||
|
Its a stripped-down version of Solr, follow the module README on how to add
|
||||||
|
Apache Tika text extraction capabilities.
|
||||||
|
|
||||||
|
You need to ensure that some indexable property on your object
|
||||||
|
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
|
||||||
|
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
|
||||||
|
The property should be listed in your `SolrIndex` subclass, e.g. as follows:
|
||||||
|
|
||||||
|
```php
|
||||||
|
class MyDocument extends DataObject {
|
||||||
|
static $db = array('Path' => 'Text');
|
||||||
|
function getContent() {
|
||||||
|
$extractor = FileTextExtractor::for_file($this->Path);
|
||||||
|
return $extractor ? $extractor->getContent($this->Path) : null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
class MySolrIndex extends SolrIndex {
|
||||||
|
function init() {
|
||||||
|
$this->addClass('MyDocument');
|
||||||
|
$this->addStoredField('Content', 'HTMLText');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: This isn't a terribly efficient way to process large amounts of files, since
|
||||||
|
each HTTP request is run synchronously.
|
||||||
|
|
||||||
|
## Tika
|
||||||
|
|
||||||
|
Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI.
|
||||||
|
|
||||||
|
See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and
|
||||||
|
configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning
|
||||||
|
to have it running constantly in the background. Starting tika as a CLI script for every extraction request
|
||||||
|
is fairly slow, so we recommend running it as a server.
|
||||||
|
|
||||||
|
This extension will best work with the [fileinfo PHP extension](http://php.net/manual/en/book.fileinfo.php)
|
||||||
|
installed to perform mime detection. Tika validates support via mime type rather than file extensions.
|
||||||
|
|
||||||
|
## Tika - CLI
|
||||||
|
|
||||||
|
Ensure that your machine has a 'tika' command available which will run the CLI script.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
exec java -jar tika-app-1.8.jar "$@"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tika Rest Server
|
||||||
|
|
||||||
|
Tika can also be run as a server. You can configure your server endpoint by setting the url via config.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
TikaServerTextExtractor:
|
||||||
|
server_endpoint: 'http://localhost:9998'
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively this may be specified via the `SS_TIKA_ENDPOINT` directive in your `_ss_environment.php` file, or an environment variable of the same name.
|
||||||
|
|
||||||
|
|
||||||
|
Then startup your server as below
|
||||||
|
|
||||||
|
```bash
|
||||||
|
java -jar tika-server-1.8.jar --host=localhost --port=9998
|
||||||
|
```
|
||||||
|
|
||||||
|
While you can run `tika-app-1.8.jar` in server mode as well (with the `--server` flag),
|
||||||
|
it behaves differently and is not recommended.
|
||||||
|
|
||||||
|
The module will log extraction errors with `SS_Log::NOTICE` priority by default,
|
||||||
|
for example a "422 Unprocessable Entity" HTTP response for an encrypted PDF.
|
||||||
|
In case you want more information on why processing failed, you can increase
|
||||||
|
the logging verbosity in the tika server instance by passing through
|
||||||
|
a `--includeStack` flag. Logs can passed on to files or external logging services,
|
||||||
|
see [error handling](http://doc.silverstripe.org/en/developer_guides/debugging/error_handling)
|
||||||
|
documentation for SilverStripe core.
|
23
docs/en/developer-docs.md
Normal file
23
docs/en/developer-docs.md
Normal file
@ -0,0 +1,23 @@
|
|||||||
|
# Developer documentation
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
Manual extraction:
|
||||||
|
|
||||||
|
```php
|
||||||
|
$myFile = '/my/path/myfile.pdf';
|
||||||
|
$extractor = FileTextExtractor::for_file($myFile);
|
||||||
|
$content = $extractor->getContent($myFile);
|
||||||
|
```
|
||||||
|
|
||||||
|
Extraction with `FileTextExtractable` extension applied:
|
||||||
|
|
||||||
|
```php
|
||||||
|
$myFileObj = File::get()->First();
|
||||||
|
$content = $myFileObj->getFileContent();
|
||||||
|
```
|
||||||
|
|
||||||
|
This content can also be embedded directly within a template.
|
||||||
|
|
||||||
|
```
|
||||||
|
$MyFile.FileContent
|
||||||
|
```
|
Loading…
Reference in New Issue
Block a user