More docs on how to use extraction with Solr

2024-10-22 09:06:00 +00:00 · 2013-05-07 20:14:01 +02:00 · 2013-05-07 20:14:01 +02:00 · 24a055a741
commit 24a055a741
parent b32bc08dc4
1 changed files with 28 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -32,12 +32,13 @@ Note: Previously part of the [sphinx module](https://github.com/silverstripe/sil
 The recommended installation is through [composer](http://getcomposer.org).
 Add the following to your `composer.json`:

-	:::js
+	```js
 	{
 		"require": {
 			"silverstripe/textextraction": "*"
 		}
 	}
+	```

 The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
 which is automatically checked out by composer. Alternatively, install Guzzle
@ -60,9 +61,10 @@ PDFs require special handling, for example through the [XPDF](http://www.foolabs
 commandline utility. Follow their installation instructions, its presence will be automatically
 detected. You can optionally set the binary path in `mysite/_config/config.yml`:

-	:::yml
+	```yml
 	PDFTextExtractor:
 		binary_location: /my/path/pdftotext
+	```

 ### Apache Solr

@ -76,8 +78,10 @@ in your database driver, or even pass it back to Solr as part of a full index up

 In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):

+	```yml
 	SolrCellTextExtractor:
 		base_url: 'http://localhost:8983/solr/update/extract'
+	```

 Note that in case you're using multiple cores, you'll need to add the core name to the URL 
 (e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
@ -86,6 +90,27 @@ uses multiple cores by default, and comes prepackaged with a Solr server.
 Its a stripped-down version of Solr, follow the module README on how to add
 Apache Tika text extraction capabilities.

+You need to ensure that some indexable property on your object
+returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
+or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
+The property should be listed in your `SolrIndex` subclass, e.g. as follows:
+
+	```php
+	class MyDocument extends DataObject {
+		static $db = array('Path' => 'Text');
+		function getContent() {
+			$extractor = FileTextExtractor::for_file($this->Path);
+			return $extractor ? $extractor->getContent($this->Path) : null;		
+		}
+	}
+	class NZQASolrIndex extends SolrIndex {
+		function init() {
+			$this->addClass('MyDocument');
+			$this->addFulltextField('Content', 'HTMLText');
+		}
+	}
+	```
+
 Note: This isn't a terribly efficient way to process large amounts of files, since 
 each HTTP request is run synchronously.

@ -97,7 +122,7 @@ Manual extraction:
 	$extractor = FileTextExtractor::for_file($myFile);
 	$content = $extractor->getContent($myFile);

-DataObject extraction:
+Extraction with `FileTextExtractable` extension applied:

 	$myFileObj = File::get()->First();
 	$content = $myFileObj->extractFileAsText();