More docs on how to use extraction with Solr

2024-10-22 09:06:00 +00:00 · 2013-05-07 20:14:01 +02:00 · 2013-05-07 20:14:01 +02:00 · 24a055a741
commit 24a055a741
parent b32bc08dc4
1 changed files with 28 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -32,12 +32,13 @@ Note: Previously part of the [sphinx module](https://github.com/silverstripe/sil
 The recommended installation is through [composer](http://getcomposer.org).
 Add the following to your `composer.json`:
-	:::js
+	```js
 	{
 		"require": {
 			"silverstripe/textextraction": "*"
 		}
 	}
 	```
 The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
 which is automatically checked out by composer. Alternatively, install Guzzle
@ -60,9 +61,10 @@ PDFs require special handling, for example through the [XPDF](http://www.foolabs
 commandline utility. Follow their installation instructions, its presence will be automatically
 detected. You can optionally set the binary path in `mysite/_config/config.yml`:
-	:::yml
+	```yml
 	PDFTextExtractor:
 		binary_location: /my/path/pdftotext
 	```
 ### Apache Solr
@ -76,8 +78,10 @@ in your database driver, or even pass it back to Solr as part of a full index up
 In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
 	```yml
 	SolrCellTextExtractor:
 		base_url: 'http://localhost:8983/solr/update/extract'
 	```
 Note that in case you're using multiple cores, you'll need to add the core name to the URL 
 (e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
@ -86,6 +90,27 @@ uses multiple cores by default, and comes prepackaged with a Solr server.
 Its a stripped-down version of Solr, follow the module README on how to add
 Apache Tika text extraction capabilities.
 You need to ensure that some indexable property on your object
 returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
 or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
 The property should be listed in your `SolrIndex` subclass, e.g. as follows:
 	```php
 	class MyDocument extends DataObject {
 		static $db = array('Path' => 'Text');
 		function getContent() {
 			$extractor = FileTextExtractor::for_file($this->Path);
 			return $extractor ? $extractor->getContent($this->Path) : null;		
 		}
 	}
 	class NZQASolrIndex extends SolrIndex {
 		function init() {
 			$this->addClass('MyDocument');
 			$this->addFulltextField('Content', 'HTMLText');
 		}
 	}
 	```
 Note: This isn't a terribly efficient way to process large amounts of files, since 
 each HTTP request is run synchronously.
@ -97,7 +122,7 @@ Manual extraction:
 	$extractor = FileTextExtractor::for_file($myFile);
 	$content = $extractor->getContent($myFile);
-DataObject extraction:
+Extraction with `FileTextExtractable` extension applied:
 	$myFileObj = File::get()->First();
 	$content = $myFileObj->extractFileAsText();