mirror of
https://github.com/silverstripe/silverstripe-textextraction
synced 2024-10-22 09:06:00 +00:00
DOCS Add Windows note back into Configuration guide, bump license year
This commit is contained in:
parent
f8c3015161
commit
3d289b4e05
106
README.md
106
README.md
@ -40,110 +40,10 @@ The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
||||
which is automatically checked out by composer. Alternatively, install Guzzle
|
||||
through PEAR and ensure its in your `include_path`.
|
||||
|
||||
## Configuration
|
||||
## Documentation
|
||||
|
||||
### Basic
|
||||
|
||||
By default, only extraction from HTML documents is supported.
|
||||
No configuration is required for that, unless you want to make
|
||||
the content available through your `DataObject` subclass.
|
||||
In this case, add the following to `mysite/_config/config.yml`:
|
||||
|
||||
```yaml
|
||||
File:
|
||||
extensions:
|
||||
- FileTextExtractable
|
||||
```
|
||||
|
||||
By default any extracted content will be cached against the database row.
|
||||
In order to stay within common size constraints for SQL queries required in this operation,
|
||||
the cache sets a maximum character length after which content gets truncated (default: 500000).
|
||||
You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration.
|
||||
|
||||
|
||||
Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth.
|
||||
In order to swap out the cache backend you can use the following yaml configuration.
|
||||
|
||||
|
||||
```yaml
|
||||
---
|
||||
Name: mytextextraction
|
||||
After: '#textextraction'
|
||||
---
|
||||
Injector:
|
||||
FileTextCache: FileTextCache_SSCache
|
||||
FileTextCache_SSCache:
|
||||
lifetime: 3600 # Number of seconds to cache content for
|
||||
|
||||
```
|
||||
|
||||
### XPDF
|
||||
|
||||
PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
|
||||
commandline utility. Follow their installation instructions, its presence will be automatically
|
||||
detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`
|
||||
|
||||
```yml
|
||||
PDFTextExtractor:
|
||||
binary_location: /my/path/pdftotext
|
||||
```
|
||||
|
||||
### Apache Solr
|
||||
|
||||
Apache Solr is a fulltext search engine, an aspect which is often used
|
||||
alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
|
||||
through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
|
||||
This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
|
||||
The textextraction module retrieves the output of this service, rather than altering the index.
|
||||
With the raw text output, you can decide to store it in a database column for fulltext search
|
||||
in your database driver, or even pass it back to Solr as part of a full index update.
|
||||
|
||||
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
|
||||
|
||||
```yml
|
||||
SolrCellTextExtractor:
|
||||
base_url: 'http://localhost:8983/solr/update/extract'
|
||||
```
|
||||
|
||||
Note that in case you're using multiple cores, you'll need to add the core name to the URL
|
||||
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
|
||||
The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
|
||||
uses multiple cores by default, and comes prepackaged with a Solr server.
|
||||
Its a stripped-down version of Solr, follow the module README on how to add
|
||||
Apache Tika text extraction capabilities.
|
||||
|
||||
You need to ensure that some indexable property on your object
|
||||
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
|
||||
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
|
||||
The property should be listed in your `SolrIndex` subclass, e.g. as follows:
|
||||
|
||||
```php
|
||||
class MyDocument extends DataObject {
|
||||
static $db = array('Path' => 'Text');
|
||||
function getContent() {
|
||||
$extractor = FileTextExtractor::for_file($this->Path);
|
||||
return $extractor ? $extractor->getContent($this->Path) : null;
|
||||
}
|
||||
}
|
||||
class MySolrIndex extends SolrIndex {
|
||||
function init() {
|
||||
$this->addClass('MyDocument');
|
||||
$this->addStoredField('Content', 'HTMLText');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Note: This isn't a terribly efficient way to process large amounts of files, since
|
||||
each HTTP request is run synchronously.
|
||||
|
||||
### Tika
|
||||
|
||||
Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI.
|
||||
|
||||
See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and
|
||||
configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning
|
||||
to have it running constantly in the background. Starting tika as a CLI script for every extraction request
|
||||
is fairly slow, so we recommend running it as a server.
|
||||
* [Configuration](docs/en/configuration.md)
|
||||
* [Developer documentation](/docs/en/developer-docs.md)
|
||||
|
||||
## Bugtracker
|
||||
|
||||
|
@ -39,7 +39,7 @@ FileTextCache_SSCache:
|
||||
|
||||
PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
|
||||
commandline utility. Follow their installation instructions, its presence will be automatically
|
||||
detected. You can optionally set the binary path in `mysite/_config/config.yml`:
|
||||
detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`:
|
||||
|
||||
```yml
|
||||
PDFTextExtractor:
|
||||
|
@ -1,4 +1,4 @@
|
||||
Copyright (c) 2016, SilverStripe Limited
|
||||
Copyright (c) 2017, SilverStripe Limited
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
|
||||
|
Loading…
x
Reference in New Issue
Block a user