DOCS Add Windows note back into Configuration guide, bump license year

2024-10-22 09:06:00 +00:00 · 2017-11-23 09:48:03 +13:00 · 2017-11-23 09:48:03 +13:00 · 3d289b4e05
commit 3d289b4e05
parent f8c3015161
3 changed files with 12 additions and 112 deletions
--- a/README.md
+++ b/README.md
@ -40,110 +40,10 @@ The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
 which is automatically checked out by composer. Alternatively, install Guzzle
 through PEAR and ensure its in your `include_path`.

-## Configuration
+## Documentation

-### Basic
-
-By default, only extraction from HTML documents is supported.
-No configuration is required for that, unless you want to make
-the content available through your `DataObject` subclass.
-In this case, add the following to `mysite/_config/config.yml`:
-
-```yaml
-File:
-  extensions:
-    - FileTextExtractable
-```
-
-By default any extracted content will be cached against the database row.
-In order to stay within common size constraints for SQL queries required in this operation,
-the cache sets a maximum character length after which content gets truncated (default: 500000).
-You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration.
-
-
-Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth.
-In order to swap out the cache backend you can use the following yaml configuration.
-
-
-```yaml
---
-Name: mytextextraction
-After: '#textextraction'
---
-Injector:
-  FileTextCache: FileTextCache_SSCache
-FileTextCache_SSCache:
-  lifetime: 3600 # Number of seconds to cache content for
-
-```
-
-### XPDF
-
-PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
-commandline utility. Follow their installation instructions, its presence will be automatically
-detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`
-
-```yml
-PDFTextExtractor:
-  binary_location: /my/path/pdftotext
-```
-
-### Apache Solr
-
-Apache Solr is a fulltext search engine, an aspect which is often used
-alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
-through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
-This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
-The textextraction module retrieves the output of this service, rather than altering the index.
-With the raw text output, you can decide to store it in a database column for fulltext search
-in your database driver, or even pass it back to Solr as part of a full index update.
-
-In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
-
-```yml
-SolrCellTextExtractor:
-  base_url: 'http://localhost:8983/solr/update/extract'
-```
-
-Note that in case you're using multiple cores, you'll need to add the core name to the URL
-(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
-The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
-uses multiple cores by default, and comes prepackaged with a Solr server.
-Its a stripped-down version of Solr, follow the module README on how to add
-Apache Tika text extraction capabilities.
-
-You need to ensure that some indexable property on your object
-returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
-or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
-The property should be listed in your `SolrIndex` subclass, e.g. as follows:
-
-```php
-class MyDocument extends DataObject {
-	static $db = array('Path' => 'Text');
-	function getContent() {
-		$extractor = FileTextExtractor::for_file($this->Path);
-		return $extractor ? $extractor->getContent($this->Path) : null;
-	}
-}
-class MySolrIndex extends SolrIndex {
-	function init() {
-		$this->addClass('MyDocument');
-		$this->addStoredField('Content', 'HTMLText');
-	}
-}
-```
-
-Note: This isn't a terribly efficient way to process large amounts of files, since
-each HTTP request is run synchronously.
-
-### Tika
-
-Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI.
-
-See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and
-configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning
-to have it running constantly in the background. Starting tika as a CLI script for every extraction request
-is fairly slow, so we recommend running it as a server.
+ * [Configuration](docs/en/configuration.md)
+ * [Developer documentation](/docs/en/developer-docs.md)

 ## Bugtracker

--- a/docs/en/configuration.md
+++ b/docs/en/configuration.md
@ -10,7 +10,7 @@ In this case, add the following to `mysite/_config/config.yml`:
 ```yaml
 File:
  extensions:
-	- FileTextExtractable
+    - FileTextExtractable
 ```

 By default any extracted content will be cached against the database row.
@ -39,11 +39,11 @@ FileTextCache_SSCache:

 PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
 commandline utility. Follow their installation instructions, its presence will be automatically
-detected. You can optionally set the binary path in `mysite/_config/config.yml`:
+detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`:

 ```yml
 PDFTextExtractor:
-	binary_location: /my/path/pdftotext
+  binary_location: /my/path/pdftotext
 ```

 ## Apache Solr
@ -60,10 +60,10 @@ In order to use Solr, you need to configure a URL for it (in `mysite/_config/con

 ```yml
 SolrCellTextExtractor:
-	base_url: 'http://localhost:8983/solr/update/extract'
+  base_url: 'http://localhost:8983/solr/update/extract'
 ```

-Note that in case you're using multiple cores, you'll need to add the core name to the URL 
+Note that in case you're using multiple cores, you'll need to add the core name to the URL
 (e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
 The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
 uses multiple cores by default, and comes prepackaged with a Solr server.
@ -80,7 +80,7 @@ class MyDocument extends DataObject {
 	static $db = array('Path' => 'Text');
 	function getContent() {
 		$extractor = FileTextExtractor::for_file($this->Path);
-		return $extractor ? $extractor->getContent($this->Path) : null;		
+		return $extractor ? $extractor->getContent($this->Path) : null;
 	}
 }
 class MySolrIndex extends SolrIndex {
@ -91,7 +91,7 @@ class MySolrIndex extends SolrIndex {
 }
 ```

-Note: This isn't a terribly efficient way to process large amounts of files, since 
+Note: This isn't a terribly efficient way to process large amounts of files, since
 each HTTP request is run synchronously.

 ## Tika
@ -142,4 +142,4 @@ In case you want more information on why processing failed, you can increase
 the logging verbosity in the tika server instance by passing through
 a `--includeStack` flag. Logs can passed on to files or external logging services,
 see [error handling](http://doc.silverstripe.org/en/developer_guides/debugging/error_handling)
-documentation for SilverStripe core.
+documentation for SilverStripe core.
--- a/license.md
+++ b/license.md
@ -1,4 +1,4 @@
-Copyright (c) 2016, SilverStripe Limited
+Copyright (c) 2017, SilverStripe Limited
 All rights reserved.

 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: