37 Commits

Author SHA1 Message Date
Elliot Sawyer
6fe7172663 MINOR: module support for Silverstripe 3.7 2019-04-12 15:11:35 +12:00
Robbie Averill
944cabb805 Merge branch '2.1' into 2 2019-04-12 15:02:09 +12:00
Steve Boyd
2cb7201674 Change the tika requirement from 1.7.0 to 1.7 2018-07-11 16:59:36 +12:00
Daniel Hensby
eb25505a8e
Merge pull request #2 from cam-findlay/patch-1 2017-11-23 13:18:44 +00:00
Jake Dale Ovenden
eb7a45865b Allow username and password in requests to Tika server (#35) 2017-11-23 10:24:32 +13:00
Juan van den Anker
0761311170 Don't try to save the object to the cache if it has been disabled 2017-02-22 15:17:32 +13:00
Alexandre Guidet
196007314a fixed the version comparison using version_compare() instead of plain float 2016-10-19 15:46:30 +13:00
Daniel Hensby
e9e33605b4
FIX PDFTextExtractor no longer smushes words together than break across lines 2016-10-03 23:59:18 +01:00
Jake Bentvelzen
75ffe7b56a fix(PDFTextExtractor): Added support for Windows, but only if 'binary_location' is defined. Updated documentation to inform the user of this. 2016-05-13 15:07:33 +10:00
Damian Mooyman
f72ba3a978 API Whitelist bin paths for pdftotext 2016-02-25 16:40:25 +13:00
helpfulrobot
8e14595f1a Converted to PSR-2 2015-11-18 17:07:31 +13:00
Loz Calver
9ea4b79543 FIX: SolrCellTextExtractor always reporting itself as unavailable (fixes #14) 2015-06-08 12:42:31 +01:00
Christopher Pitt
fbc31692e7 Using Symfony mime type detection 2015-05-13 21:36:05 +12:00
Ingo Schommer
da6c554acb Check file existence in for_file()
finfo() will silently fail the whole request (at least on my PHP 5.4 install)
if invoked on a file that doesn't exist, so fail early here.
2015-05-12 16:45:03 +12:00
Damian Mooyman
c9d74f83db API Only invalidate cache when file is changed 2015-05-12 16:01:38 +12:00
Damian Mooyman
6cf09f26c8 Merge pull request #9 from chillu/pulls/tika-logging
Improved Tika error logging
2015-05-12 15:27:08 +12:00
Damian Mooyman
6c7ffa2c6f Merge pull request #10 from chillu/pulls/truncate-db-cache
Truncate FileContentCache by default to avoid SQL query errors
2015-05-12 15:25:59 +12:00
Damian Mooyman
1f4083dda4 BUG Fix incorrect cache key generation 2015-05-12 15:23:14 +12:00
Ingo Schommer
8aca06aef2 Truncate FileContentCache by default to avoid SQL query errors
MySQL has a packet limit of 1MB as a default
(http://dev.mysql.com/doc/refman/5.0/en/packet-too-large.html).
This interferes with the UPDATE queries required
to add file content caches. Since the query can't be terminated
correctly, the whole content will be discarded with a query error.

This change allows to truncate content prior to the UPDATE operation,
and defaults to 500 characters. This leaves some room for multibyte
characters as well as other parts of the SQL query.
2015-05-07 19:14:02 +12:00
Ingo Schommer
72ce8fc0bc Improved Tika error logging 2015-05-07 12:06:59 +12:00
Damian Mooyman
98fd4228f9 Provide alternative backends for caching of extracted content
Implement Flushable for clearing the cache
2015-05-05 17:22:45 +12:00
Christopher Pitt
b7488577ad Downgraded Guzzle version 2015-03-05 13:57:31 +13:00
Damian Mooyman
1ad9e46727 API Support tika server 2015-02-25 17:55:41 +13:00
Damian Mooyman
2977f85cb5 API Implement Tika support
API Implement support for detection via mime-type as well as file extension
API Implement FileContent property for safe usage in templates
API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases
Marking dev-master as version 2.0 as this contains breaking changes
2015-02-20 15:12:20 +13:00
cam-findlay
a34c443be5 FIX additional exception handling for Tika errors return via Guzzle.
Tika server errors via Guzzle can cause the Solr search query to return a 500 error and breaks search results pages for users. Issues was relating to uncaught exceptions from Guzzle causing a silent fail if a text file is perhaps unreadable or missing (return null never occurs which breaks the search).
2013-06-07 10:42:38 +12:00
Ingo Schommer
30223e4f7c 3.1 compat 2013-05-07 21:54:51 +02:00
Ingo Schommer
b32bc08dc4 More resilience in SolrCellTextExtractor
Shouldn't outright fail the request if a file can't be found
2013-05-07 19:27:06 +02:00
Ingo Schommer
b86483abc4 3.1 compat 2013-05-07 18:47:56 +02:00
Ingo Schommer
f2c8df2348 BUG Exclude meta info from SolrCell content retrieval
Was matching </str> greedily, which included too much content
2013-03-11 00:56:44 +01:00
Ingo Schommer
9af389f51b NEW SolrCellTextExtractor 2013-02-01 15:35:16 +01:00
Ingo Schommer
14816075b8 FIX Case insensitive extension matching 2013-02-01 15:34:54 +01:00
Ingo Schommer
788a49bf9f BUG Improved HTMLTextExtractor, remove non-content tags 2012-09-06 13:41:21 +02:00
Ingo Schommer
733644d6bb Better shell execution feedback from PDF extractor 2012-08-27 11:31:53 +02:00
Ingo Schommer
f3fcf60c0f FileTextExtractor->isAvailable() 2012-08-22 18:25:55 +02:00
Ingo Schommer
977c4e49c9 API Using paths instead of File objects in extractors
Makes coupling to File objects optional, by choosing
to use the FileTextExtractable extension.
2012-08-22 18:25:12 +02:00
Ingo Schommer
7de717b0bd 3.0 compat 2012-08-22 18:24:38 +02:00
Ingo Schommer
ec0921c6d1 Initial commit 2012-08-22 17:52:08 +02:00