mirror of
https://github.com/silverstripe/silverstripe-textextraction
synced 2024-10-22 09:06:00 +00:00
More docs on how to use extraction with Solr
This commit is contained in:
parent
b32bc08dc4
commit
24a055a741
31
README.md
31
README.md
@ -32,12 +32,13 @@ Note: Previously part of the [sphinx module](https://github.com/silverstripe/sil
|
|||||||
The recommended installation is through [composer](http://getcomposer.org).
|
The recommended installation is through [composer](http://getcomposer.org).
|
||||||
Add the following to your `composer.json`:
|
Add the following to your `composer.json`:
|
||||||
|
|
||||||
:::js
|
```js
|
||||||
{
|
{
|
||||||
"require": {
|
"require": {
|
||||||
"silverstripe/textextraction": "*"
|
"silverstripe/textextraction": "*"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
```
|
||||||
|
|
||||||
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
The module depends on the [Guzzle HTTP Library](http://guzzlephp.org),
|
||||||
which is automatically checked out by composer. Alternatively, install Guzzle
|
which is automatically checked out by composer. Alternatively, install Guzzle
|
||||||
@ -60,9 +61,10 @@ PDFs require special handling, for example through the [XPDF](http://www.foolabs
|
|||||||
commandline utility. Follow their installation instructions, its presence will be automatically
|
commandline utility. Follow their installation instructions, its presence will be automatically
|
||||||
detected. You can optionally set the binary path in `mysite/_config/config.yml`:
|
detected. You can optionally set the binary path in `mysite/_config/config.yml`:
|
||||||
|
|
||||||
:::yml
|
```yml
|
||||||
PDFTextExtractor:
|
PDFTextExtractor:
|
||||||
binary_location: /my/path/pdftotext
|
binary_location: /my/path/pdftotext
|
||||||
|
```
|
||||||
|
|
||||||
### Apache Solr
|
### Apache Solr
|
||||||
|
|
||||||
@ -76,8 +78,10 @@ in your database driver, or even pass it back to Solr as part of a full index up
|
|||||||
|
|
||||||
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
|
In order to use Solr, you need to configure a URL for it (in `mysite/_config/config.yml`):
|
||||||
|
|
||||||
|
```yml
|
||||||
SolrCellTextExtractor:
|
SolrCellTextExtractor:
|
||||||
base_url: 'http://localhost:8983/solr/update/extract'
|
base_url: 'http://localhost:8983/solr/update/extract'
|
||||||
|
```
|
||||||
|
|
||||||
Note that in case you're using multiple cores, you'll need to add the core name to the URL
|
Note that in case you're using multiple cores, you'll need to add the core name to the URL
|
||||||
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
|
(e.g. 'http://localhost:8983/solr/PageSolrIndex/update/extract').
|
||||||
@ -86,6 +90,27 @@ uses multiple cores by default, and comes prepackaged with a Solr server.
|
|||||||
Its a stripped-down version of Solr, follow the module README on how to add
|
Its a stripped-down version of Solr, follow the module README on how to add
|
||||||
Apache Tika text extraction capabilities.
|
Apache Tika text extraction capabilities.
|
||||||
|
|
||||||
|
You need to ensure that some indexable property on your object
|
||||||
|
returns the contents, either by directly accessing `FileTextExtractable->extractFileAsText()`,
|
||||||
|
or by writing your own method around `FileTextExtractor->getContent()` (see "Usage" below).
|
||||||
|
The property should be listed in your `SolrIndex` subclass, e.g. as follows:
|
||||||
|
|
||||||
|
```php
|
||||||
|
class MyDocument extends DataObject {
|
||||||
|
static $db = array('Path' => 'Text');
|
||||||
|
function getContent() {
|
||||||
|
$extractor = FileTextExtractor::for_file($this->Path);
|
||||||
|
return $extractor ? $extractor->getContent($this->Path) : null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
class NZQASolrIndex extends SolrIndex {
|
||||||
|
function init() {
|
||||||
|
$this->addClass('MyDocument');
|
||||||
|
$this->addFulltextField('Content', 'HTMLText');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
Note: This isn't a terribly efficient way to process large amounts of files, since
|
Note: This isn't a terribly efficient way to process large amounts of files, since
|
||||||
each HTTP request is run synchronously.
|
each HTTP request is run synchronously.
|
||||||
|
|
||||||
@ -97,7 +122,7 @@ Manual extraction:
|
|||||||
$extractor = FileTextExtractor::for_file($myFile);
|
$extractor = FileTextExtractor::for_file($myFile);
|
||||||
$content = $extractor->getContent($myFile);
|
$content = $extractor->getContent($myFile);
|
||||||
|
|
||||||
DataObject extraction:
|
Extraction with `FileTextExtractable` extension applied:
|
||||||
|
|
||||||
$myFileObj = File::get()->First();
|
$myFileObj = File::get()->First();
|
||||||
$content = $myFileObj->extractFileAsText();
|
$content = $myFileObj->extractFileAsText();
|
Loading…
x
Reference in New Issue
Block a user