OK, here is the new and improved version. I have tested it fairly thoroughly, and it seems to work very well.
I wanted this to:
1) Work well for most people without any extra configuration.
2) Allow essentially everything that is legal, as long as it doesn’t cause a problem for browsers (or users)
3) Allow customization via system settings and lexicon.
This code below can replace the function cleanAlias in /core/model/modx/modresource.class.php
This is Revo code, but it could be modified to replace the stripAlias function in manager/processors/save_content.processor.php for MODx version 0.9.6
function cleanAlias($alias) {
global $modx;
$charset = $this->xpdo->getOption('modx_charset',null,'UTF-8'); // determine the charset
$charset = !empty($charset) ? strtoupper($charset) : 'UTF-8';
$alias = html_entity_decode($alias, ENT_QUOTES, $charset); // convert html entity codes into normal text
$alias = strip_tags($alias); // remove html
// find the value to replace '&'
$modx->lexicon->load('default');
$and = $modx->lexicon('and');
$alias = str_replace('&',$and,$alias);
$alias = str_replace(html_entity_decode(' '),' ',$alias); // replace non-breaking spaces with normal spaces
// let user preserve uppercase (useful for CamelCaseURLs)
if($this->xpdo->getOption('alias_allow_uppercase',null,0) != 1 ) {
$alias = mb_convert_case($alias, MB_CASE_LOWER, $charset); // convert to lowercase
}
$unsafechars = '/[\0\x0B\t\n\r\f\a&=+%#<>"~`@\?\[\]\{\}\|\^\'\\\\]/'; // pattern to match reserved and unsafe chars
$alias = preg_replace($unsafechars, '', $alias); // clean the alias
$alias = trim($alias);
$separator = $this->xpdo->getOption('alias_word_separator',null,'-'); // let user use a special separator or no separator
$separator = preg_replace($unsafechars, '', $separator); // clean the separator
$alias = preg_replace('/\s+/', $separator, $alias); // replace whitespace with the separator
if ($this->xpdo->getOption('alias_allow_punctuation',null,0) != 1) {
$alias= preg_replace('/[;:!\,\.\/\(\)\*]/', '', $alias); // remove common punctuation including . and /
}
// collapse common separators (yes, a space is allowed as a separate if someone really wants to use it
$alias = preg_replace('|-+|', '-', $alias);
$alias = preg_replace('|_+|', '_', $alias);
$alias = preg_replace('| +|', ' ', $alias);
// collapse characters that could cause directory problems (while still allowing pseudo folders)
$alias = preg_replace('|\/+|', '/', $alias);
$alias = preg_replace('|\.+|', '.', $alias); // don't allow ..
$alias = preg_replace('/\.\//', '.', $alias); // don't allow ./
// clean up the begining and end of the alias, also removing path chars from beginning and end
$alias = trim($alias, ' -/.');
return $alias;
}
This should work for 99.99% of users, works well without any extra configuration, and is highly customizable via system settings and the lexicon. Please respond to this post if you find any bugs.
There are a few things that would be nice to add to this, such as replacing various unicode dashes with a regular dash.
This is really quite awesome. Even works for rtl languages!