Jake Bentvelzen
75ffe7b56a
fix(PDFTextExtractor): Added support for Windows, but only if 'binary_location' is defined. Updated documentation to inform the user of this.
2016-05-13 15:07:33 +10:00
Damian Mooyman
f72ba3a978
API Whitelist bin paths for pdftotext
2016-02-25 16:40:25 +13:00
helpfulrobot
8e14595f1a
Converted to PSR-2
2015-11-18 17:07:31 +13:00
Loz Calver
9ea4b79543
FIX: SolrCellTextExtractor always reporting itself as unavailable ( fixes #14 )
2015-06-08 12:42:31 +01:00
Christopher Pitt
fbc31692e7
Using Symfony mime type detection
2015-05-13 21:36:05 +12:00
Ingo Schommer
da6c554acb
Check file existence in for_file()
...
finfo() will silently fail the whole request (at least on my PHP 5.4 install)
if invoked on a file that doesn't exist, so fail early here.
2015-05-12 16:45:03 +12:00
Damian Mooyman
c9d74f83db
API Only invalidate cache when file is changed
2015-05-12 16:01:38 +12:00
Damian Mooyman
6cf09f26c8
Merge pull request #9 from chillu/pulls/tika-logging
...
Improved Tika error logging
2015-05-12 15:27:08 +12:00
Damian Mooyman
6c7ffa2c6f
Merge pull request #10 from chillu/pulls/truncate-db-cache
...
Truncate FileContentCache by default to avoid SQL query errors
2015-05-12 15:25:59 +12:00
Damian Mooyman
1f4083dda4
BUG Fix incorrect cache key generation
2015-05-12 15:23:14 +12:00
Ingo Schommer
8aca06aef2
Truncate FileContentCache by default to avoid SQL query errors
...
MySQL has a packet limit of 1MB as a default
(http://dev.mysql.com/doc/refman/5.0/en/packet-too-large.html ).
This interferes with the UPDATE queries required
to add file content caches. Since the query can't be terminated
correctly, the whole content will be discarded with a query error.
This change allows to truncate content prior to the UPDATE operation,
and defaults to 500 characters. This leaves some room for multibyte
characters as well as other parts of the SQL query.
2015-05-07 19:14:02 +12:00
Ingo Schommer
72ce8fc0bc
Improved Tika error logging
2015-05-07 12:06:59 +12:00
Damian Mooyman
98fd4228f9
Provide alternative backends for caching of extracted content
...
Implement Flushable for clearing the cache
2015-05-05 17:22:45 +12:00
Christopher Pitt
b7488577ad
Downgraded Guzzle version
2015-03-05 13:57:31 +13:00
Damian Mooyman
1ad9e46727
API Support tika server
2015-02-25 17:55:41 +13:00
Damian Mooyman
2977f85cb5
API Implement Tika support
...
API Implement support for detection via mime-type as well as file extension
API Implement FileContent property for safe usage in templates
API instead of returning the list of extensions / mime types supported, support is determined on a per-file bases
Marking dev-master as version 2.0 as this contains breaking changes
2015-02-20 15:12:20 +13:00
Ingo Schommer
30223e4f7c
3.1 compat
2013-05-07 21:54:51 +02:00
Ingo Schommer
b32bc08dc4
More resilience in SolrCellTextExtractor
...
Shouldn't outright fail the request if a file can't be found
2013-05-07 19:27:06 +02:00
Ingo Schommer
b86483abc4
3.1 compat
2013-05-07 18:47:56 +02:00
Ingo Schommer
f2c8df2348
BUG Exclude meta info from SolrCell content retrieval
...
Was matching </str> greedily, which included too much content
2013-03-11 00:56:44 +01:00
Ingo Schommer
9af389f51b
NEW SolrCellTextExtractor
2013-02-01 15:35:16 +01:00
Ingo Schommer
14816075b8
FIX Case insensitive extension matching
2013-02-01 15:34:54 +01:00
Ingo Schommer
788a49bf9f
BUG Improved HTMLTextExtractor, remove non-content tags
2012-09-06 13:41:21 +02:00
Ingo Schommer
733644d6bb
Better shell execution feedback from PDF extractor
2012-08-27 11:31:53 +02:00
Ingo Schommer
f3fcf60c0f
FileTextExtractor->isAvailable()
2012-08-22 18:25:55 +02:00
Ingo Schommer
977c4e49c9
API Using paths instead of File objects in extractors
...
Makes coupling to File objects optional, by choosing
to use the FileTextExtractable extension.
2012-08-22 18:25:12 +02:00
Ingo Schommer
7de717b0bd
3.0 compat
2012-08-22 18:24:38 +02:00
Ingo Schommer
ec0921c6d1
Initial commit
2012-08-22 17:52:08 +02:00