silverstripe-framework/model/URLSegmentFilter.php

<?php
/**
 * @package framework
 * @subpackage model
 */

/**
 * Filter certain characters from "URL segments" (also called "slugs"), for nicer (more SEO-friendly) URLs.
 * Uses {@link SS_Transliterator} to convert non-ASCII characters to meaningful ASCII representations.
 * Use {@link $default_allow_multibyte} to allow a broader range of characters without transliteration.
 * 
 * Caution: Should not be used on full URIs with domains or query parameters.
 * In order to retain forward slashes in a path, each individual segment needs to be filtered individually.
 * 
 * See {@link FileNameFilter} for similar implementation for filesystem-based URLs.
 */
class URLSegmentFilter extends Object {
	
	/**
	 * @var Boolean
	 */
	static $default_use_transliterator = true;
	
	/**
	 * @var Array See {@link setReplacements()}.
	 */
	static $default_replacements = array(
		'/&amp;/u' => '-and-',
		'/&/u' => '-and-',
		'/\s/u' => '-', // remove whitespace
		'/_/u' => '-', // underscores to dashes
		'/[^A-Za-z0-9+.-]+/u' => '', // remove non-ASCII chars, only allow alphanumeric plus dash and dot
		'/[\-]{2,}/u' => '-', // remove duplicate dashes
		'/^[\.\-_]/u' => '', // Remove all leading dots, dashes or underscores
	);
	
	/**
	 * Doesn't try to replace or transliterate non-ASCII filters.
	 * Useful for character sets that have little overlap with ASCII (e.g. far eastern),
	 * as well as better search engine optimization for URLs.
	 * @see http://www.ietf.org/rfc/rfc3987
	 * 
	 * @var boolean
	 */
	static $default_allow_multibyte = false;
	
	/**
	 * @var Array See {@link setReplacements()}
	 */
	public $replacements = array();
	
	/**
	 * Note: Depending on the applied replacement rules, this method might result in an empty string. 
	 * 
	 * @param String URL path (without domain or query parameters), in utf8 encoding
	 * @return String A filtered path compatible with RFC 3986
	 */
	public function filter($name) {
		if(!$this->getAllowMultibyte()) {
			// Only transliterate when no multibyte support is requested
			$transliterator = $this->getTransliterator();
			if($transliterator) $name = $transliterator->toASCII($name);
		}
		
		$name = mb_strtolower($name);
		$replacements = $this->getReplacements();
		
		// Unset automated removal of non-ASCII characters, and don't try to transliterate
		if($this->getAllowMultibyte() && isset($replacements['/[^A-Za-z0-9+.-]+/u'])) {
			unset($replacements['/[^A-Za-z0-9+.-]+/u']);
		}
		
		foreach($replacements as $regex => $replace) {
			$name = preg_replace($regex, $replace, $name);
		}

		// Multibyte URLs require percent encoding to comply to RFC 3986.
		// Without this setting, the "remove non-ASCII chars" regex takes care of that.
		if($this->getAllowMultibyte()) $name = rawurlencode($name);
		
		return $name;
	}
	
	/**
	 * @param Array Map of find/replace used for preg_replace().
	 */
	public function setReplacements($r) {
		$this->replacements = $r;
	}
	
	/**
	 * @return Array
	 */
	public function getReplacements() {
		return ($this->replacements) ? $this->replacements : self::$default_replacements;
	}
		
	/**
	 * @var SS_Transliterator
	 */
	protected $transliterator;
	
	/**
	 * @return SS_Transliterator|NULL
	 */
	public function getTransliterator() {
		if($this->transliterator === null && self::$default_use_transliterator) {
			$this->transliterator = SS_Transliterator::create();
		} 
		return $this->transliterator;
	}
	
	/**
	 * @param SS_Transliterator|FALSE
	 */
	public function setTransliterator($t) {
		$this->transliterator = $t;
	}
	
	/**
	 * @var boolean
	 */
	protected $allowMultibyte;
	
	/**
	 * @param boolean
	 */
	public function setAllowMultibyte($bool) {
		$this->allowMultibyte = $bool;
	}
	
	/**
	 * @return boolean
	 */
	public function getAllowMultibyte() {
		return ($this->allowMultibyte !== null) ? $this->allowMultibyte : self::$default_allow_multibyte;
	}
}
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`<?php`
			`/**`
MINOR Update @package values to match renaming sapphire 2012-04-12 18:02:46 +12:00			`* @package framework`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`* @subpackage model`
			`*/`

			`/**`
			`* Filter certain characters from "URL segments" (also called "slugs"), for nicer (more SEO-friendly) URLs.`
API CHANGE Renames Transliterator to SS_Transliterator The intl extension in PHP 5.4 provides a Transliterator class, which conflicts with the SilverStripe one. This leads to some really weird ReflectionExceptions about Transliterator's constructor being private. 2012-06-15 15:54:47 +12:00			`* Uses {@link SS_Transliterator} to convert non-ASCII characters to meaningful ASCII representations.`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`* Use {@link $default_allow_multibyte} to allow a broader range of characters without transliteration.`
			`*`
			`* Caution: Should not be used on full URIs with domains or query parameters.`
			`* In order to retain forward slashes in a path, each individual segment needs to be filtered individually.`
			`*`
			`* See {@link FileNameFilter} for similar implementation for filesystem-based URLs.`
			`*/`
MINOR Using late static binding instead of Object::create() calls 2012-04-04 16:59:30 +02:00			`class URLSegmentFilter extends Object {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00
			`/**`
			`* @var Boolean`
			`*/`
			`static $default_use_transliterator = true;`

			`/**`
			`* @var Array See {@link setReplacements()}.`
			`*/`
			`static $default_replacements = array(`
			`'/&/u' => '-and-',`
			`'/&/u' => '-and-',`
			`'/\s/u' => '-', // remove whitespace`
			`'/_/u' => '-', // underscores to dashes`
			`'/[^A-Za-z0-9+.-]+/u' => '', // remove non-ASCII chars, only allow alphanumeric plus dash and dot`
			`'/[\-]{2,}/u' => '-', // remove duplicate dashes`
			`'/^[\.\-_]/u' => '', // Remove all leading dots, dashes or underscores`
			`);`

			`/**`
			`* Doesn't try to replace or transliterate non-ASCII filters.`
			`* Useful for character sets that have little overlap with ASCII (e.g. far eastern),`
			`* as well as better search engine optimization for URLs.`
			`* @see http://www.ietf.org/rfc/rfc3987`
			`*`
			`* @var boolean`
			`*/`
			`static $default_allow_multibyte = false;`

			`/**`
			`* @var Array See {@link setReplacements()}`
			`*/`
			`public $replacements = array();`

			`/**`
			`* Note: Depending on the applied replacement rules, this method might result in an empty string.`
			`*`
			`* @param String URL path (without domain or query parameters), in utf8 encoding`
			`* @return String A filtered path compatible with RFC 3986`
			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function filter($name) {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`if(!$this->getAllowMultibyte()) {`
			`// Only transliterate when no multibyte support is requested`
			`$transliterator = $this->getTransliterator();`
			`if($transliterator) $name = $transliterator->toASCII($name);`
			`}`

MINOR Removed mbstring support checks, its an installation requirement 2012-02-06 11:54:48 +01:00			`$name = mb_strtolower($name);`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`$replacements = $this->getReplacements();`
BUGFIX Urlencode paths in URLSegmentFilter when $allowMultibyte=true to avoid creating invalid URLs (and breaking assumptions based on ascii-only URLs, such as static publishing filename creation) 2012-02-06 11:56:26 +01:00
			`// Unset automated removal of non-ASCII characters, and don't try to transliterate`
FIX Remove instances of lines longer than 120c The entire framework repo (with the exception of system-generated files) has been amended to respect the 120c line-length limit. This is in preparation for the enforcement of this rule with PHP_CodeSniffer. 2012-09-27 09:34:00 +12:00			`if($this->getAllowMultibyte() && isset($replacements['/[^A-Za-z0-9+.-]+/u'])) {`
			`unset($replacements['/[^A-Za-z0-9+.-]+/u']);`
			`}`
BUGFIX Urlencode paths in URLSegmentFilter when $allowMultibyte=true to avoid creating invalid URLs (and breaking assumptions based on ascii-only URLs, such as static publishing filename creation) 2012-02-06 11:56:26 +01:00
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`foreach($replacements as $regex => $replace) {`
			`$name = preg_replace($regex, $replace, $name);`
			`}`
BUGFIX Urlencode paths in URLSegmentFilter when $allowMultibyte=true to avoid creating invalid URLs (and breaking assumptions based on ascii-only URLs, such as static publishing filename creation) 2012-02-06 11:56:26 +01:00
			`// Multibyte URLs require percent encoding to comply to RFC 3986.`
			`// Without this setting, the "remove non-ASCII chars" regex takes care of that.`
			`if($this->getAllowMultibyte()) $name = rawurlencode($name);`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00
			`return $name;`
			`}`

			`/**`
			`* @param Array Map of find/replace used for preg_replace().`
			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function setReplacements($r) {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`$this->replacements = $r;`
			`}`

			`/**`
			`* @return Array`
			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function getReplacements() {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`return ($this->replacements) ? $this->replacements : self::$default_replacements;`
			`}`

			`/**`
API CHANGE Renames Transliterator to SS_Transliterator The intl extension in PHP 5.4 provides a Transliterator class, which conflicts with the SilverStripe one. This leads to some really weird ReflectionExceptions about Transliterator's constructor being private. 2012-06-15 15:54:47 +12:00			`* @var SS_Transliterator`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`*/`
			`protected $transliterator;`

			`/**`
API CHANGE Renames Transliterator to SS_Transliterator The intl extension in PHP 5.4 provides a Transliterator class, which conflicts with the SilverStripe one. This leads to some really weird ReflectionExceptions about Transliterator's constructor being private. 2012-06-15 15:54:47 +12:00			`* @return SS_Transliterator\|NULL`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function getTransliterator() {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`if($this->transliterator === null && self::$default_use_transliterator) {`
API CHANGE Renames Transliterator to SS_Transliterator The intl extension in PHP 5.4 provides a Transliterator class, which conflicts with the SilverStripe one. This leads to some really weird ReflectionExceptions about Transliterator's constructor being private. 2012-06-15 15:54:47 +12:00			`$this->transliterator = SS_Transliterator::create();`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`}`
			`return $this->transliterator;`
			`}`

			`/**`
API CHANGE Renames Transliterator to SS_Transliterator The intl extension in PHP 5.4 provides a Transliterator class, which conflicts with the SilverStripe one. This leads to some really weird ReflectionExceptions about Transliterator's constructor being private. 2012-06-15 15:54:47 +12:00			`* @param SS_Transliterator\|FALSE`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function setTransliterator($t) {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`$this->transliterator = $t;`
			`}`

			`/**`
			`* @var boolean`
			`*/`
			`protected $allowMultibyte;`

			`/**`
			`* @param boolean`
			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function setAllowMultibyte($bool) {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`$this->allowMultibyte = $bool;`
			`}`

			`/**`
			`* @return boolean`
			`*/`
Method visibility according to coding conventions 2012-09-19 12:07:39 +02:00			`public function getAllowMultibyte() {`
ENHANCEMENT More flexible URL filtering through new URLSegmentFilter API. Support for multibyte URL segments through URLPathFilter::$default_allow_multibyte. Abstraction from Convert::raw2url() (and SiteTree->generateURLSegment()) 2011-11-14 12:26:51 +01:00			`return ($this->allowMultibyte !== null) ? $this->allowMultibyte : self::$default_allow_multibyte;`
			`}`
MINOR Add newline to end of files without one 2012-03-24 16:04:52 +13:00			`}`