Basic Search snippet and highlight plugin modification

172 Posts

TimGS Reply #11, 15 years, 3 months ago

Just an update to say that I have added pagination and the result, 1.2 beta, is at http://www.web.nutclough.co.uk/modx_search.php.

To enable pagination just add &searchPaginate=`1` in the call. If you are happy with the default templating, well and good, if not, the comments at the top of the code will have to suffice till I write some proper documentation.

I’m on holiday next week so the chances of me deciding its had enough testing to drop the ’beta’ are pretty low until the end of the month. You can see it in action at www.vintage.nutclough.co.uk with the items-per-page set to a rather low 4 - purely to test that the pagination works. The default is 10 but that site currently only has 10 searchable documents! Note that the home page on that site is set to be unsearchable.

I don’t have any plans for any major extra features unless I need them for work, as for me, it now fulfils the function it was designed to do. On the other hand if anyone has any suggestions and can convince me they are worthwhile and not taking things away from the primary aims (simplicity, reliablility - not doing alot, but doing it well) then I’ll code them. If you can’t convince me, well hey, its GPL’d isn’t it!

Enjoy,
Tim.

Edit: Actually I do have one minor addition to follow - the pagination option will also soon have an option for the user to change the number of items on a page.

118 Posts

DorianJ Reply #12, 15 years, 2 months ago

I can’t figure out how to use this search snippet. Do you use separate templates for the search page and the results page?

Every time I use this with GET, it returns...

« MODx Parse Error »
MODx encountered the following error while attempting to parse the requested resource:
« PHP Parse Error »

PHP error debug
Error: Invalid argument supplied for foreach()
Error type/ Nr.: Warning - 2
File: /public_html/manager/includes/document.parser.class.inc.php(769) : eval()’d code
Line: 241

Every time I use this with POST, it returns...

No search terms given.

-Dorian

172 Posts

TimGS Reply #13, 15 years, 2 months ago

Apols DorianJ, you appear to have found a quite embarassing bug. Fix provided at http://www.web.nutclough.co.uk/files/TimSearch-1.2.txt - sorry for the delay - have been on holiday.

Like any web based form, there are two quite distinct parts. You need:

one or more pages/documents with a HTML form,
and one page/document to process (in this case via a snippet) the data (i.e. the search terms) and output the results.

The HTML form (click on the ’Search’ tab on http://www.web.nutclough.co.uk/modx_search.php) needs to be wherever you want the search form to be, the snippet needs to be on the page where the search results are displayed.

Sure I could make the snippet echo the form by adding a parameter to indicate whether the form or the results (or both) are required, but there is no point here i.e. there is no need to process the form with a snippet - its always going to be the same.

Make sure you ensure the method in the form (see the line in the HTML form that reads <form action="[~27~]" method="get">) matches the &searchRequestType, unless the latter is set to ’REQUEST’, in which case both get and post requests will be handled. This may be why the behaviour was different for your get and post methods.

Other news:

The pagination has been extended so the user can change the number of items per page e.g. http://vintage.nutclough.co.uk/search-results?searchTerms=cap&search=Go%21

Short of a very convincing request, or a requirement at work (money automatically being convincing

) that leaves just testing and any resultant bug fixes to do.

It did come up in a conversation at work that this snippet contains all the PHP required for an AJAX based search system, subject to testing of the templating. I do not have the Javascript end of things however, so it remains nothing more than a conversation.

-- Tim.

172 Posts

TimGS Reply #14, 15 years, 2 months ago

For info, I have updated ver 1.2 beta so it is suitable for unfriendly URLs.

Further documentation on my website http://www.web.nutclough.co.uk/modx_search.php.

-- Tim.

1,717 Posts

coroico Reply #15, 15 years, 2 months ago

Regarding the updates proposed for the search highlight plugin:

- removing the urldecode function disallow the highlighting of search terms like "l’éducation école"

- the use of :

      $word_ents[] = htmlentities($searchArray[$i], ENT_NOQUOTES, 'UTF-8');
      $word_ents[] = htmlentities($searchArray[$i], ENT_COMPAT, 'UTF-8');
      $word_ents[] = htmlentities($searchArray[$i], ENT_QUOTES, 'UTF-8');

is correct for UTF-8 but may be could be wrong for other character encoding (ISO-8859-1, cp1252, ...)

172 Posts

TimGS Reply #16, 15 years, 2 months ago

Quote from: coroico at Mar 01, 2009, 05:05 PM

- removing the urldecode function disallow the highlighting of search terms like "l’éducation école"

Edit (again!): My current (recently updated) version at http://www.web.nutclough.co.uk/files/highlightplugin.txt does appear to work with AjaxSearch, and is as far as I can see is more likely to succeed than the unmodified plugin version 1.3. If you find it doesn’t work please post with more information. The above issue AFAIK was dependant on the &advsearch parameter - please provide as much info as possible if any more problems come up. I will respond but the more info that is given the better and quicker the response can be!

The problem was not (directly) with removing urldecode() however.

Running urldecode twice (thats once automatically by PHP, and once manually by yourself) just doesn’t work! Quote from php.net, as refered to previously:

The superglobals $_GET and $_REQUEST are already decoded. Using urldecode() on an element in $_GET or $_REQUEST could have unexpected and dangerous results.

Consider what happens to (for example) a search term like "b+w" if urldecode is run twice. Or just try it with AjaxSearch (any &advsearch mode) and the unmodified plugin. It will always be ’double decoded’ to ’b w’ not ’b+w’.

At least in this case, it will just not work - but being in the habit of running urldecode() twice could be a very serious security problem if writing code that accesses a database. Have a look elsewhere on the net - I’m only regurgitating information written elsewhere by people more knowledgeable than myself.

This issue doesn’t apply at all with my search snippet as the various modes of AjaxSearch (exact, all, one, none) are not present. There is instead one algorithm that prioritises results based on the number of distinct search terms found in a document - as there is no ’exact’ mode, spaces are never passed.

Incidentally, If you insert "l’éducation école" into a document using TinyMCE (even if you or I prefer HTML, this is what a client would do) AjaxSearch will not find it (unless we produce non-validating sites due to raw characters).

Quote from: coroico at Mar 01, 2009, 05:05 PM

- the use of :
      $word_ents[] = htmlentities($searchArray[$i], ENT_NOQUOTES, 'UTF-8');
      $word_ents[] = htmlentities($searchArray[$i], ENT_COMPAT, 'UTF-8');
      $word_ents[] = htmlentities($searchArray[$i], ENT_QUOTES, 'UTF-8');
is correct for UTF-8 but may be could be wrong for other character encoding (ISO-8859-1, cp1252, ...)

We always use UTF8 at work, and as that was my primary motive it sufficed.

Bear in mind that with the above code as is, you are extremely unlikely to have false highlights if the character set is not UTF8 - just consider what has to happen for this to occur!!! Extremely few people ever will see a negative effect and the majority will see a positive, because TinyMCE does store entities.

However you have made a good point. A better and more general solution would be to detect the database encoding which I believe is held somewhere in $modx->config.
Edit: Its held in the global variable $database_connection_charset;

Admittedly this is not a problem when used with AjaxSearch as the entities wouldn’t have been found by AjaxSearch anyway, so there isn’t any point in trying to highlight them in that case.

As my more basic search snippet does find these, I needed to be able to highlight them, hence the above code. Note also that as of v1.2 beta, my own basic search snippet assumes UTF8. I will provide scope for alternative character sets soon.

-- Tim.

172 Posts

TimGS Reply #17, 15 years, 2 months ago

Updated version on http://www.web.nutclough.co.uk/modx_search.php to cater for non-UTF8 character sets. Thanks to Coroico for highlighting this issue.

I still regard this as a beta version, but am confident enough to use it on two sites currently under development for clients.

To be fair to others using it with clients, its worth pointing out that this means it is being tested well, but with techniques and designs used by my work, e.g. UTF-8 only! If you are straying away from this or any of the default settings, I suggest checking things well. Any bugs reported to me will be fixed ASAP.

-- Tim.

1,717 Posts

coroico Reply #18, 15 years, 1 month ago

Tim, find enclosed a proposal for the release 1.4 of the searchHighlight plugin

Lot of your improvements have been integrated. The main differences with your 1.3 release are:

- take into account of the page charset (linked with the database charset) for the use of htmlentities

- the correction of an issue regarding $database_connection_charset. Not declared as a global variable, so always empty!

- an issue corrected, regarding the class names for the new terms generated. When you have for instance a document with "alphabétisation" and "alphabétisation", you need to use the same class ajaxSearch_highlight1 for the two terms. For that you are obliged to memorize the index of the original class.

- Regarding the lookBehind assertion, your assertion doesn’t run properly. If you search "ute", you don’t avoid to match é with:

	$pattern = '/(?<!&)(?<!&.)(?<!&.^;)(?<!&.^;^;)(?<!&.^;^;^;)(?<!&.^;^;^;^;)(?<!&.^;^;^;^;^;)(?<!&.^;^;^;^;^;^;)(?<!&.^;^;^;^;^;^;^;)'.preg_quote($word, '/') . '(?=[^>]*<)/' . $pcreModifier;

My proposal is to use:

    $pcreModifier = ($pgCharset == 'UTF-8') ? 'iu' : 'i';
    $lookBehind = '/(?<!&|&[^;]|&[^;][^;]|&[^;][^;][^;])'; // avoid a match with a html entity
    $lookAhead = '(?=[^>]*<)/'; // avoid a match with a html tag

with:

$pattern = $lookBehind . preg_quote($word, '/') . $lookAhead . $pcreModifier;

This simpler solution do the assumption that the searchterm has at least 3 characters.

The advSearchHighlight plugin (which is a variant of searchHighlight plugin) works on the demo site of ajaxSearch.
I have created a document named "html entities" with lot of words encoded as html entities.
This document belongs to the French document hierarchy (French is a beautifull language with lot of accented characters wink

)

So for a test, you need select first the "french documents" (left bottom side of each page). As test do for instance a search with "alphabétisation" or "éducation école" and click on the link "html entities" to display the document with the highlighted searchterms.
Look at the source code of the page. Then try to search "ute" or "rave" to check that html entities are not found by the advSearchHighlight plugin.

Here are some directs links with the results:
alphabétisation Here alphabétisation and alphabétisation are correctly highlighted
éducation école
ute In this document é are not highlighted.
gra In this document à are not highlighted.

Thanks for your feedbacks about this new release of the searchHighlight plugin

172 Posts

TimGS Reply #19, 15 years, 1 month ago

Quote from: coroico at Mar 15, 2009, 01:13 AM

- take into account of the page charset (linked with the database charset) for the use of htmlentities

Sounds good.

Quote from: coroico at Mar 15, 2009, 01:13 AM

- the correction of an issue regarding $database_connection_charset. Not declared as a global variable, so always empty!

I don’t think I fiddled with that bit.

Quote from: coroico at Mar 15, 2009, 01:13 AM

- an issue corrected, regarding the class names for the new terms generated. When you have for instance a document with "alphabétisation" and "alphabétisation", you need to use the same class ajaxSearch_highlight1 for the two terms. For that you are obliged to memorize the index of the original class.

Yes - I didn’t see that one. Not sure how often people use the indexed classes, but if the functionality is there it needs to work.

Quote from: coroico at Mar 15, 2009, 01:13 AM

- Regarding the lookBehind assertion, your assertion doesn’t run properly. If you search "ute", you don’t avoid to match é with:
	$pattern = '/(?<!&)(?<!&.)(?<!&.^;)(?<!&.^;^;)(?<!&.^;^;^;)(?<!&.^;^;^;^;)(?<!&.^;^;^;^;^;)(?<!&.^;^;^;^;^;^;)(?<!&.^;^;^;^;^;^;^;)'.preg_quote($word, '/') . '(?=[^>]*<)/' . $pcreModifier;

Puzzled here - I thought (?<!&.^;) should catch this.

Quote from: coroico at Mar 15, 2009, 01:13 AM

My proposal is to use:

    $pcreModifier = ($pgCharset == 'UTF-8') ? 'iu' : 'i';
    $lookBehind = '/(?<!&|&[^;]|&[^;][^;]|&[^;][^;][^;])'; // avoid a match with a html entity
    $lookAhead = '(?=[^>]*<)/'; // avoid a match with a html tag

with:

$pattern = $lookBehind . preg_quote($word, '/') . $lookAhead . $pcreModifier;

This simpler solution do the assumption that the searchterm has at least 3 characters.

Your look-behind looks neater but the longest entity is &thetasym; so (to be pedantic) I think you need another couple of &[^;].... in there.

-- Tim.

172 Posts

TimGS Reply #20, 14 years, 11 months ago

Update (1.3 beta) at http://www.timspencerweb.co.uk/modx_search . Comments / bug reports welcomed!

Also included on this page is info on searching PDFs (method requires server admin rights to install pdftotext).

-- Tim.