Custom field aggregations in Sphinx using SphinxQL

Posted by Dan Sosedoff on September 06, 2010

Sphinx is a really powerful tool for a full-text database search. It is the perfect option as a search engine on your website’s data.
In default mode it works as a regular tcp server and has multiple native language bindings for php, ruby, c, etc. But its another outstanding feature is MySQL Protocol Connectoin and SphinxQL, which is similar to native mysql query language.

So, ok. Lets say we have N documents with M attributes. Attributes could be different: string, integer, double, boolean. Out objective is to perform attribute aggregation based on specified search term (user-defined, etc). That will give us full information on data selected only by search term. Its only use-case when you really need to get these aggregate fields. Next part is tricky and not really efficient.

First of all, you have to setup Sphinx search daemon instance using different configuration file (it could not run both). Another problem – you have to setup another data sources and index files, Sphinx puts a lock on all used-right-now files.

Lets assume we have a database of books. We need to build a form with sliders which could be used as user-friendly search filter. All we need is to get a list of min and max attributes values. But there is a problem: sometimes, while working with sphinx you might find yourself trying to use it like you usually do with regular RDMS. Unfortunately, sphinx has a different design. Basically, sphinx has one primary field which presents in each search request – DocumentID. Its an unique id that represents your data ID, which makes it harder to product aggregate data. And there is no way to get rid of that field.
The whole idea of our aggregation – using boolean match mode with no weighting performed at all. In that case all results will have weight field = 1. That will give us ability to group all the results by weight field, rejecting the DocumentID field.

Here is the sample query:

SELECT
  MIN(reviews) AS min_reviews, MAX(reviews) AS max_reviews,
  MIN(pages) AS min_pages, MAX(pages) AS max_pages,
  MIN(pub_year) AS min_date, MAX(pub_year) AS max_date,
  @weight AS w
FROM 
  INDEX_NAME
WHERE
  MATCH('SEARCH_TERM') AND pages > 30
GROUP BY w OPTION ranker = none

The result of this query will be one row with field alias names. Thats’s it.

All statements are fully customizable. Just check full SphinxQL reference for details.