DOCS Add Windows note back into Configuration guide, bump license year

2024-10-22 09:06:00 +00:00 · 2017-11-23 09:48:03 +13:00 · 2017-11-23 09:48:03 +13:00 · 3d289b4e05
commit 3d289b4e05
parent f8c3015161
3 changed files with 12 additions and 112 deletions
--- a/README.md
+++ b/README.md
@ -40,110 +40,10 @@ The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
 which is automatically checked out by composer. Alternatively, install Guzzle
 through PEAR and ensure its in your `include_path`.
-## Configuration
+## Documentation
-### Basic
+ * [Configuration](docs/en/configuration.md)
-
+ * [Developer documentation](/docs/en/developer-docs.md)
 By default, only extraction from HTML documents is supported.
 No configuration is required for that, unless you want to make
 the content available through your `DataObject` subclass.
 In this case, add the following to `mysite/_config/config.yml`:
 ```yaml
 File:
  extensions:
    - FileTextExtractable
 ```
 By default any extracted content will be cached against the database row.
 In order to stay within common size constraints for SQL queries required in this operation,
 the cache sets a maximum character length after which content gets truncated (default: 500000).
 You can configure this value through `FileTextCache_Database.max_content_length` in your yaml configuration.
 Alternatively, extracted content can be cached using SS_Cache to prevent excessive database growth.
 In order to swap out the cache backend you can use the following yaml configuration.
 ```yaml
 ---
 Name: mytextextraction
 After: '#textextraction'
 ---
 Injector:
  FileTextCache: FileTextCache_SSCache
 FileTextCache_SSCache:
  lifetime: 3600 # Number of seconds to cache content for
 ```
 ### XPDF
 PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
 commandline utility. Follow their installation instructions, its presence will be automatically
 detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`
 ```yml
 PDFTextExtractor:
  binary_location: /my/path/pdftotext
 ```
 ### Apache Solr
 Apache Solr is a fulltext search engine, an aspect which is often used
 alongside this module. But more importantly for us, it has bindings to [Apache Tika](http://tika.apache.org/)
 through the [ExtractingRequestHandler](http://wiki.apache.org/solr/ExtractingRequestHandler) interface.
 This allows Solr to inspect the contents of various file formats, such as Office documents and PDF files.
 The textextraction module retrieves the output of this service, rather than altering the index.
 With the raw text output, you can decide to store it in a database column for fulltext search
 in your database driver, or even pass it back to Solr as part of a full index update.
 In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
 ```yml
 SolrCellTextExtractor:
  base_url: 'http://localhost:8983/solr/update/extract'
 ```
 Note that in case you're using multiple cores, you'll need to add the core name to the URL
 (e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
 The ["fulltext" module](https://github.com/silverstripe-labs/silverstripe-fulltextsearch)
 uses multiple cores by default, and comes prepackaged with a Solr server.
 Its a stripped-down version of Solr, follow the module README on how to add
 Apache Tika text extraction capabilities.
 You need to ensure that some indexable property on your object
 returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
 or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
 The property should be listed in your `SolrIndex` subclass, e.g. as follows:
 ```php
 class MyDocument extends DataObject {
 	static $db = array('Path' => 'Text');
 	function getContent() {
 		$extractor = FileTextExtractor::for_file($this->Path);
 		return $extractor ? $extractor->getContent($this->Path) : null;
 	}
 }
 class MySolrIndex extends SolrIndex {
 	function init() {
 		$this->addClass('MyDocument');
 		$this->addStoredField('Content', 'HTMLText');
 	}
 }
 ```
 Note: This isn't a terribly efficient way to process large amounts of files, since
 each HTTP request is run synchronously.
 ### Tika
 Support for Apache Tika (1.8 and above) is included. This can be run in one of two ways: Server or CLI.
 See [the Apache Tika home page](http://tika.apache.org/1.8/index.html) for instructions on installing and
 configuring this. Download the latest `tika-app` for running as a CLI script, or `tika-server` if you're planning
 to have it running constantly in the background. Starting tika as a CLI script for every extraction request
 is fairly slow, so we recommend running it as a server.
 ## Bugtracker
--- a/docs/en/configuration.md
+++ b/docs/en/configuration.md
@ -10,7 +10,7 @@ In this case, add the following to `mysite/_config/config.yml`:
 ```yaml
 File:
  extensions:
-	- FileTextExtractable
+    - FileTextExtractable
 ```
 By default any extracted content will be cached against the database row.
@ -39,11 +39,11 @@ FileTextCache_SSCache:
 PDFs require special handling, for example through the [XPDF](http://www.foolabs.com/xpdf/)
 commandline utility. Follow their installation instructions, its presence will be automatically
-detected. You can optionally set the binary path in `mysite/_config/config.yml`:
+detected for \*nix operating systems. You can optionally set the binary path (required for Windows) in `mysite/_config/config.yml`:
 ```yml
 PDFTextExtractor:
-	binary_location: /my/path/pdftotext
+  binary_location: /my/path/pdftotext
 ```
 ## Apache Solr
@ -60,7 +60,7 @@ In order to use Solr, you need to configure a URL for it (in `mysite/_config/con
 ```yml
 SolrCellTextExtractor:
-	base_url: 'http://localhost:8983/solr/update/extract'
+  base_url: 'http://localhost:8983/solr/update/extract'
 ```
 Note that in case you're using multiple cores, you'll need to add the core name to the URL
--- a/license.md
+++ b/license.md
@ -1,4 +1,4 @@
-Copyright (c) 2016, SilverStripe Limited
+Copyright (c) 2017, SilverStripe Limited
 All rights reserved.
 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: