The short answer... Replace this:
$lookBehind = '/(?<!&|&[^;]|&[^;][^;]|&[^;][^;][^;]|&[^;][^;][^;][^;]|&[^;][^;][^;][^;][^;])';
with this:
$lookBehind = '/(?<!&|&[\w#]|&[\w#]\w|&[\w#]\w\w|&[\w#]\w\w\w|&[\w#]\w\w\w\w|&[\w#]\w\w\w\w\w)';
The long answer...
The root cause of the problem is due to a bug in the PCRE library (v6.6 and earlier). The changelog for PCRE version 6.7 explains this quite nicely in item # 5:
...
5. A negated single-character class was not being recognized as fixed-length in lookbehind assertions such as (?<=[^f]), leading to an incorrect compile error "lookbehind assertion is not fixed length".
...
The regex in question is using a negated single-character class so will always result in this regex compile error. So the solution is to 1.) use pcre v6.7 or higher, or 2.) change the regex so that is does not use a negated single-character class. Solution 2 sounds like the most universal fix since pcre 6.6 (and earlier) is still in general use.
Solution:
The regex in question is a negative lookbehind that
attempts to ensure that the current position in the string is not part of an HTML entity. Well, in addition to the PCRE compile error, there are other problems with this regex! For starters, the sub expression that is causing all the problems: ’[^;]’, matches exactly one character that is anything other than a semi-colon. This does, in fact, correctly match characters that are part of an HTML entity (i.e. ’©’, ’{’, ’&x1FF’, etc), but it also erroneously matches many,
many characters that are NOT valid within an HTML entity! e.g. This regex matches all of the following strings and considers each them to be valid HTML entities: ’&#$%^*’, ’&=][{}’, ’& A B’, etc. To put it mildly, this regex is doing a
really crappy job of (negatively) matching an HTML entity!
There are other problems that I have with the code posted above, but let me address just the immediate problem at hand. Lets fix the PCRE compile error problem and improve its false matching. So what is a valid HTML entity? Well there are three basic types but they are all sandwiched between an ampersand & and a semi-colon ;. The insides can be a word, a decimal number preceded with a # or a hexadecimal number preceded with a #x or a #X. (i.e. ’©’, ’{’, ’&x1FF;’, etc). Here is a replacement for the $lookbehind regex fragment which fixes the PCRE pre-v6.7 compile bug and improves the accuracy of the HTML entity matching: (same one as given above in the short answer)
$lookBehind = '/(?<!&|&[\w#]|&[\w#]\w|&[\w#]\w\w|&[\w#]\w\w\w|&[\w#]\w\w\w\w|&[\w#]\w\w\w\w\w)';
For each position after the opening & we match only characters that are actually valid within an HTML entity (instead of matching anything that is not a semi-colon, as the old regex erroneously did). Note that this regex is not perfect but is much better than the original. It still matches invalid sequences such as ’&_word’, ’XY’ and ’&12345’. A more accurate version could be crafted (see below), but this one is a major improvement and should solve the PCRE compile error problem.
If you are anal and want a more exact regex here’s a more correct (but slow!) solution:
$lookBehind = '/(?<!&|&#|&#[0-9]|&#[0-9][0-9]|&#[0-9][0-9][0-9]|&#[0-9][0-9' .
'][0-9][0-9]|&#[0-9][0-9][0-9][0-9][0-9]|&#[0-9][0-9][0-9][0-' .
'9][0-9][0-9]|&#[xX]|&#[xX][0-9A-Fa-f]|&#[xX][0-9A-Fa-f][0-9A' .
'-Fa-f]|&#[xX][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]|&#[xX][0-9A-F' .
'a-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]|&[A-Za-z0-9]|&[A-Za-z0' .
'-9][A-Za-z0-9]|&[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]|&[A-Za-z0-' .
'9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]|&[A-Za-z0-9][A-Za-z0-9][' .
'A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]|&[A-Za-z0-9][A-Za-z0-9][A-Z' .
'a-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9])';
Here is the same thing in a format that can be understood by us mere mortals...
(?<!&
|&#
|&#[0-9]
|&#[0-9][0-9]
|&#[0-9][0-9][0-9]
|&#[0-9][0-9][0-9][0-9]
|&#[0-9][0-9][0-9][0-9][0-9]
|&#[0-9][0-9][0-9][0-9][0-9][0-9]
|&#[xX]
|&#[xX][0-9A-Fa-f]
|&#[xX][0-9A-Fa-f][0-9A-Fa-f]
|&#[xX][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
|&#[xX][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
|&[A-Za-z0-9]
|&[A-Za-z0-9][A-Za-z0-9]
|&[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]
|&[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]
|&[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9]
|&[A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9][A-Za-z0-9])
Hope this helps!