diff --git a/README.md b/README.md index 0c6ef46..f0c0fc9 100644 --- a/README.md +++ b/README.md @@ -40,110 +40,10 @@ The module depends on the [Guzzle HTTP Library](http://guzzlephp.org), which is automatically checked out by composer. Alternatively, install Guzzle through PEAR and ensure its in your `include_path`. -## Configuration +## Documentation -### Basic - -By default, only extraction from HTML documents is supported. -No configuration is required for that, unless you want to make -the content available through your `DataObject` subclass. -In this case, add the following to `mysite/_config/config.yml`: - -```yaml -File: - extensions: - - FileTextExtractable -``` - -By default any extracted content will be cached against the database row. -In order to stay within common size constraints for SQL queries required in this operation, -the cache sets a maximum character length after which content gets truncated (default: 500000). -You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration. - - -Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth. -In order to swap out the cache backend you can use the following yaml configuration. - - -```yaml ---- -Name: mytextextraction -After: '#textextraction' ---- -Injector: - FileTextCache: FileTextCache_SSCache -FileTextCache_SSCache: - lifetime: 3600 # Number of seconds to cache content for - -``` - -### XPDF - -PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/) -commandline utility. Follow their installation instructions, its presence will be automatically -detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml` - -```yml -PDFTextExtractor: - binary_location: /my/path/pdftotext -``` - -### Apache Solr - -Apache Solr is a fulltext search engine, an aspect which is often used -alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/) -through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface. -This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files. -The textextraction module retrieves the output of this service, rather than altering the index. -With the raw text output, you can decide to store it in a database column for fulltext search -in your database driver, or even pass it back to Solr as part of a full index update. - -In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`): - -```yml -SolrCellTextExtractor: - base_url: 'http://localhost:8983/solr/update/extract' -``` - -Note that in case you're using multiple cores, you'll need to add the core name to the URL -(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract'). -The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch) -uses multiple cores by default, and comes prepackaged with a Solr server. -Its a stripped-down version of Solr, follow the module README on how to add -Apache Tika text extraction capabilities. - -You need to ensure that some indexable property on your object -returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`, -or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below). -The property should be listed in your `SolrIndex` subclass, e.g. as follows: - -```php -class MyDocument extends DataObject { - static $db = array('Path' => 'Text'); - function getContent() { - $extractor = FileTextExtractor::for_file($this->Path); - return $extractor ? $extractor->getContent($this->Path) : null; - } -} -class MySolrIndex extends SolrIndex { - function init() { - $this->addClass('MyDocument'); - $this->addStoredField('Content', 'HTMLText'); - } -} -``` - -Note: This isn't a terribly efficient way to process large amounts of files, since -each HTTP request is run synchronously. - -### Tika - -Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI. - -See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and -configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning -to have it running constantly in the background. Starting tika as a CLI script for every extraction request -is fairly slow, so we recommend running it as a server. + * [Configuration](docs/en/configuration.md) + * [Developer documentation](/docs/en/developer-docs.md) ## Bugtracker diff --git a/docs/en/configuration.md b/docs/en/configuration.md index caa61a1..052448a 100644 --- a/docs/en/configuration.md +++ b/docs/en/configuration.md @@ -10,7 +10,7 @@ In this case, add the following to `mysite/_config/config.yml`: ```yaml File: extensions: - - FileTextExtractable + - FileTextExtractable ``` By default any extracted content will be cached against the database row. @@ -39,11 +39,11 @@ FileTextCache_SSCache: PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/) commandline utility. Follow their installation instructions, its presence will be automatically -detected. You can optionally set the binary path in `mysite/_config/config.yml`: +detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`: ```yml PDFTextExtractor: - binary_location: /my/path/pdftotext + binary_location: /my/path/pdftotext ``` ## Apache Solr @@ -60,10 +60,10 @@ In order to use Solr, you need to configure a URL for it (in `mysite/_config/con ```yml SolrCellTextExtractor: - base_url: 'http://localhost:8983/solr/update/extract' + base_url: 'http://localhost:8983/solr/update/extract' ``` -Note that in case you're using multiple cores, you'll need to add the core name to the URL +Note that in case you're using multiple cores, you'll need to add the core name to the URL (e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract'). The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch) uses multiple cores by default, and comes prepackaged with a Solr server. @@ -80,7 +80,7 @@ class MyDocument extends DataObject { static $db = array('Path' => 'Text'); function getContent() { $extractor = FileTextExtractor::for_file($this->Path); - return $extractor ? $extractor->getContent($this->Path) : null; + return $extractor ? $extractor->getContent($this->Path) : null; } } class MySolrIndex extends SolrIndex { @@ -91,7 +91,7 @@ class MySolrIndex extends SolrIndex { } ``` -Note: This isn't a terribly efficient way to process large amounts of files, since +Note: This isn't a terribly efficient way to process large amounts of files, since each HTTP request is run synchronously. ## Tika @@ -142,4 +142,4 @@ In case you want more information on why processing failed, you can increase the logging verbosity in the tika server instance by passing through a `--includeStack` flag. Logs can passed on to files or external logging services, see [error handling](http://doc.silverstripe.org/en/developer_guides/debugging/error_handling) -documentation for SilverStripe core. \ No newline at end of file +documentation for SilverStripe core. diff --git a/license.md b/license.md index 9445c8e..8794670 100644 --- a/license.md +++ b/license.md @@ -1,4 +1,4 @@ -Copyright (c) 2016, SilverStripe Limited +Copyright (c) 2017, SilverStripe Limited All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: