Custom field aggregations in Sphinx using SphinxQL

Posted by Dan Sosedoff on September 06, 2010

Sphinx is a really powerful tool for a full-text database search. It is the perfect option as a search engine on your website’s data.
In default mode it works as a regular tcp server and has multiple native language bindings for php, ruby, c, etc. But its another outstanding feature is MySQL Protocol Connectoin and SphinxQL, which is similar to native mysql query language.

So, ok. Lets say we have N documents with M attributes. Attributes could be different: string, integer, double, boolean. Out objective is to perform attribute aggregation based on specified search term (user-defined, etc). That will give us full information on data selected only by search term. Its only use-case when you really need to get these aggregate fields. Next part is tricky and not really efficient.

First of all, you have to setup Sphinx search daemon instance using different configuration file (it could not run both). Another problem – you have to setup another data sources and index files, Sphinx puts a lock on all used-right-now files.

Lets assume we have a database of books. We need to build a form with sliders which could be used as user-friendly search filter. All we need is to get a list of min and max attributes values. But there is a problem: sometimes, while working with sphinx you might find yourself trying to use it like you usually do with regular RDMS. Unfortunately, sphinx has a different design. Basically, sphinx has one primary field which presents in each search request – DocumentID. Its an unique id that represents your data ID, which makes it harder to product aggregate data. And there is no way to get rid of that field.
The whole idea of our aggregation – using boolean match mode with no weighting performed at all. In that case all results will have weight field = 1. That will give us ability to group all the results by weight field, rejecting the DocumentID field.

Here is the sample query:

SELECT
  MIN(reviews) AS min_reviews, MAX(reviews) AS max_reviews,
  MIN(pages) AS min_pages, MAX(pages) AS max_pages,
  MIN(pub_year) AS min_date, MAX(pub_year) AS max_date,
  @weight AS w
FROM 
  INDEX_NAME
WHERE
  MATCH('SEARCH_TERM') AND pages > 30
GROUP BY w OPTION ranker = none

The result of this query will be one row with field alias names. Thats’s it.

All statements are fully customizable. Just check full SphinxQL reference for details.

Disable auto-incremental field in Rails Migrations

Posted by Dan Sosedoff on August 12, 2010

Since i`ve been using both DataMapper (Merb/Sinatra) and ActiveRecord (Rails) a lot i noticed that AR acts weight when i manually set PK key, particularly ID field, which you dont have to define by default. In DM you have to define it as ‘Serial’.

So, the task is to create/update records in your database which is supposes to represent data from primary database. In such cases all ID`s should be unique and equal to each other.

To disable autoincremental ID field in your AR models just use this option:

create_table :foo, :id => false do |t|
  t.integer   :id, :options => 'PRIMARY KEY'
  # .... rest of columns
end

Debugging PHP applications in terminal

Posted by Dan Sosedoff on July 03, 2010

Most of Ruby web frameworks have terminal logging in development environments. It makes application debugging process much easier than using file logging. Especially for AJAX requests. Of course there is simple solution – use Firebug and javascript console. Unfortunately it is not that convenient.

So, one day i came up with idea to make the same ruby-like style of application logging, even with colorized output.
I decided to use EventMachine library as a server platform. And API is based on JSON packets, via json/pure. Simple, isnt it?

And i called it DebugServer. Pretty understandable :)

Installation

To install just type

$ sudo gem install debugserver

And for information type:

$ debugserver -i
Usage: debugserver [options]
    -h, --host HOSTNAME              Server hostname
    -p, --port PORT                  Server port
    -i, --info                       Get usage information

By default it will start on localhost:9000.

Usage in PHP

First, download library from GitHub repository: http://github.com/sosedoff/debugclient-php

And then:

// place it into app initialization file
$debug = DebugClient::instance();
$debug->connect();
 
// .... and write it to terminal output
$obj = array(
  'id' => rand(0, 0xFFFF),
  'name' => 'Sample name',
  'time' => strftime('%m-%d-%Y', time())
);
 
// these functions are globally defined
debug_clear(); // clear terminal
debug('This is a plain text message.');
debug_info('This is an informational message.');
debug_warning('This is a warning message.');
debug_error('This is an error message.');
debug_dump($obj);
 
// optional
$debug->close();

Results:
Output

Sources

DebugServer: http://github.com/sosedoff/debugserver
PHP Client: http://github.com/sosedoff/debugclient-php

Using Amazon product images on your website

Posted by Dan Sosedoff on June 15, 2010

Amazon has an awesome image service. You can use their product images on your site, adjusting them for you needs. All you have to know – one image url of your product. Having that string will provide you an access to its dynamic image scaling service which i had to use recently.

So, lets say you have books on your website, but you dont have any good images for them. There is 2 ways to solve your problem: 1) download it from whatever place and resize 2) use amazon!

Here goes small overview.

Unfortunately, i didnt have any time to play with image service for different countries, but i assume that wont change that much. Lets take a look on a regular image:

http://ecx.images-amazon.com/images/I/41ygBmdaIfL._SL500_SS100_.jpg

It has different parts:
1) URL base: http://ecx.images-amazon.com/images/I/
2) Image code: 41ygBmdaIfL
3) Size format (surrounded by underscores): _SL500_SS100_
4) Format: jpg/gif/png

Some words about image format. It can vary from square thumbnails to images with specific max width and height. For example: _SX100_ will produce image that 100 pixels wide, height will be calculated proportionally. SH100 will give opposite result, scaled by 100 pixels maximum height, SS100 – 100×100 pixels thumbnail. And so on, you can find other similar crop codes while exploring amazon store on different pages, all you need is to take a look on image sources.

Now, we need to use this with Ruby:

require 'net/http'
 
module Amazon
  # parse amazon image url and get image code and extension
  def self.parse_image(url)
    result = url.scan(/^http:\/\/ecx.images-amazon.com\/images\/I\/([a-z0-9\-\%]{1,})(.*)_.(jpg|jpeg|gif)/i)
    unless result.nil?
      unless result[0].nil?
        match = result.first
        return {:code => match.first.to_s, :extension => match.last.to_s}
      end
    end
  end
 
  # make a new amazon image url based on code and size
  def self.make_image(image, size)
    "http://ecx.images-amazon.com/images/I/#{image[:code]}._#{size.upcase}.#{image[:extension]}"
  end
 
  # check if actual image exists
  def self.check_image(url)
    begin
      uri = URI.parse(url)
      req = Net::HTTP::Get.new(uri.path)
      res = Net::HTTP.start(uri.host, uri.port) { |http| http.request(req) }
      return res.code == '200' && res.content_length.to_i > 0
    rescue Exception
      false
    end
  end
end

And usage:

url = 'http://ecx.images-amazon.com/images/I/51O65dIoZCL._SX117_.jpg'
info = Amazon.parse_image(url)
unless info.nil?
  new_url = Amazon.make_image(info, 'sx100')
  if Amazon.check_image(new_url)
    puts "Cool! Resized image: #{new_url}"
  else
    puts "Sorry, this image does not exist!"
  end
else
  puts "Cant identify image!"
end

Some notes about the process. The only reason why method “check_image” uses GET method instead of HEAD is because if image cannot be generated or not found in amazon`s cache the response is still valid sometimes. I`ve checked it on 50k images and sometimes HEAD request indicates that response is valid while it not supposed to. Otherwise i would use HEAD.

Handy HTTP requests with Curb and Ruby

Posted by Dan Sosedoff on June 13, 2010

While working on one of the projects, i tried to find multi-purpose HTTP request class that can use different network interfaces/ip addresses with retry option (if connection slow or server not responding for some reason).

Here is a small class wrapper build on top of Ruby Curb implemented as a module:

module ApiRequest
  USER_AGENTS = [
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3',
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.70 Safari/533.4',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2) Gecko/20100323 Namoroka/3.6.2',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100401 Ubuntu/9.10 (karmic) Firefox/3.5.9'
  ]
 
  CONNECTION_TIMEOUT = 10
 
  @@interfaces = []
 
  # get random user-agent string for usage
  def random_agent
    USER_AGENTS[rand(USER_AGENTS.size-1)]
  end
 
  # get random IP/network interface specified in @@interfaces
  def random_interface
    size = @@interfaces.size
    size > 0 ? @@interfaces[rand(size-1)] : nil
  end
 
  # perform request, assign_to - specify network interface/ip
  def perform(url, assign_to=nil)
    puts url
    interface = assign_to.nil? ? self.random_interface : assign_to
    req = Curl::Easy.new(url)
    req.timeout = CONNECTION_TIMEOUT
    req.interface = interface unless interface.nil?
    req.headers['User-Agent'] = self.random_agent
    begin
      req.perform
      if req.response_code == 200
        return req.downloaded_bytes > 0 ? req.body_str : nil
      else
        nil
      end
    rescue Exception
      return nil
    end
  end
 
  # perform request by number of attempts
  def fetch(url, attempts=3)
    result = nil
    1.upto(attempts) do |a|
      result = self.perform(url)
      break unless result.nil?
    end
    return result
  end
end

And sample usage:

class TestRequest
  include ApiRequest
 
  def foo
     body = self.fetch('http://google.com')
  end
end

If module variable “@@interfaces” is array of ip addresses or network interfaces then one of them (randomly selected) will be used to perform request. Also, function “fetch” has parameter “attempts” which set to 3 by default. It means that operation will be invoked n times until result is downloaded from url. Otherwise – it returns nil.
Function perform has a parameter “assign_to” (which it not used in “fetch” function) that allows to bind request to specified interface. It is useful if you have situation when you might use different workers that bound to exact interface or just one that uses random ip`s. Also, class ApiRequest has a list of user agents which it uses randomly for each performed request.

Pastie: http://pastie.org/private/j19j3hbebte9bjqaydslmg

Making HTTP requests from different network interfaces with Ruby and Curb

Posted by Dan Sosedoff on June 09, 2010

At some point you will find that you have reached requests per IP limit while using some API or crawling resources. And if you`re doing it via standard Net::HTTP you`ll face the problem that you cannot assign request class to specified network interface (or IP). Bummer? No. Even if you cant do it with core class you might take a look on Curb – libcurl ruby binding. It has everything that you need to make regular get/post/etc requests. And of course – easy.

A simple example (real ip`s are changed):

require 'rubygems'
require 'curb'
 
ip_addresses = [
  '1.1.1.1',
  '2.2.2.2',
  '3.3.3.3',
  '4.4.4.4',
  '5.5.5.5'
]
 
ip_addresses.each do |ip|
  req = Curl::Easy.new('http://www.ip-adress.com/')
  req.interface = ip
  req.perform
  result_ip = req.body_str.scan(/<h2>My IP address is: ([\d\.]{1,})<\/h2>/).first
  puts("for #{ip} got response: #{result_ip}")
end

Output (ip`s are changed):

for 1.1.1.1 got response: 1.1.1.1
for 2.2.2.2 got response: 2.2.2.2
for 3.3.3.3 got response: 3.3.3.3
for 4.4.4.4 got response: 4.4.4.4
for 5.5.5.5 got response: 5.5.5.5

At least its working. Havent done any performance tests.
Sample on pastie: http://pastie.org/private/afxlcuk1npwjov3wer5hw

Making colorized console output with Ruby

Posted by Dan Sosedoff on June 01, 2010

If you develop some console application you might want your output be more informative, have different colors for operations or logging purposes. It is possible to do with general ANSI escape codes, which are supported by most common console terminals.

The ASCII escape structure is pretty simple. It begins with “ESC” symbol, which is code 27 in ASCII table. Then “[” symbol. Parameters that goes after “[” symbol are separated by “;” and finally ends with closing sequence: “ESC[0m“.

You can extend basic ruby String class with following code:

class String
    # colorize functions
    def red; colorize(self, "\e[1m\e[31m"); end
    def green; colorize(self, "\e[1m\e[32m"); end
    def dark_green; colorize(self, "\e[32m"); end
    def yellow; colorize(self, "\e[1m\e[33m"); end
    def blue; colorize(self, "\e[1m\e[34m"); end
    def dark_blue; colorize(self, "\e[34m"); end
    def pur; colorize(self, "\e[1m\e[35m"); end
    def colorize(text, color_code) "#{color_code}#{text}\e[0m" ; end
end

And sample usage code:

puts "Starting some job...".blue
puts "Processing thing 1 [#{"OK".green}]"
puts "Processing thing 2 [#{"FAIL".red}]"
puts "Oooops! This is a warning!".yellow
puts "Another color!".pur

The output:

colored console

Nice and useful.

How to test mailers in Sinatra, Rails and Merb

Posted by Dan Sosedoff on May 03, 2010

Once you need to test mailers within your Sinara, Rails or Merb application you wish to see a real output. No need to setup delivery via SMTP. Just create an executable ruby script somewhere, for example: /usr/bin/fake-sendmail.sh with following content:

$ touch /usr/bin/fake-sendmail.sh 
$ chmod +x /usr/bin/fake-sendmail.sh  # make it executable (will require root priv.)
#!/usr/bin/ruby
path = "/tmp/fake-mailer"
Dir.mkdir(path) if !File.exists?(path)
File.open("#{path}/#{Time.now.to_i}.txt", "w") do |f|
    f.puts ARGV.inspect
    $stdin.each_line { |line| f.puts line }
end

And then configure your mailing system:

1. Sinatra

There is a plugin for Sinatra (ported from Merb, http://github.com/foca/sinatra-mailer),

Sinatra::Mailer.config = {:sendmail_path => "/usr/bin/fake-sendmail.sh"}
Sinatra::Mailer.delivery_method = :sendmail

2. Merb

Edit your development configuration file (config/environments/development.rb)

Merb::BootLoader.after_app_loads do
    Merb::Mailer.delivery_method = :sendmail
    Merb::Mailer.config = { :sendmail_path => '/usr/bin/fake-sendmail.sh' }
end

3. Rails.

Edit your development environment file with following options:

config.action_mailer.delivery_method = :sendmail 
config.action_mailer.sendmail_settings = {:location => "/usr/bin/fake-sendmail.sh"}

How to identify your clients` browser type

Posted by Dan Sosedoff on October 02, 2009

Some time ago i wrote a simple regex patterns to determine whether my client crawler bot, mobile client or just regular one. Easy to expand and to use.

function is_mobile($agent) {
    $pattern = '/(blackberry|motorokr|motorola|sony|windows ce|240x320|176x220|palm|mobile|iphone|ipod|symbian|nokia|samsung|midp)/i';
    return (bool)preg_match($pattern, $agent);
}
 
function is_crawler($agent) {
    pattern = '/(google|yahoo|baidu|bot|webalta|ia_archiver)/';
    return (bool)preg_match($pattern, $agent);
}

Rails-like PHP url router

Posted by Dan Sosedoff on September 20, 2009

My previous version of php url router was not good enough, so i`ve done some core modification to the class. Its more flexible now. The whole idea is very similar to Rails` and Merb`s url router (merb is a rails-fork). In fact, previous version doesn`t even support GET variables in the request uri. New version just merges variables from request string into the same parameters array where other variables are stored. Also there is a command “default_routes” that maps default routes like this “/:controller/:action/:id”. So, basically you have two options – use defaults & custom routes or use only custom, which means that all other requests will be ignored. Function default_routes must be declared last.

Here is the sources:

define('ROUTER_DEFAULT_CONTROLLER', 'home');
define('ROUTER_DEFAULT_ACTION', 'index');
 
class Router {
  public $request_uri;
  public $routes;
  public $controller, $controller_name;
  public $action, $id;
  public $params;
  public $route_found = false;
 
  public function __construct() {
    $request = $_SERVER['REQUEST_URI'];
    $pos = strpos($request, '?');
    if ($pos) $request = substr($request, 0, $pos);
 
    $this->request_uri = $request;
    $this->routes = array();
  }
 
  public function map($rule, $target=array(), $conditions=array()) {
    $this->routes[$rule] = new Route($rule, $this->request_uri, $target, $conditions);
  }
 
  public function default_routes() {
    $this->map('/:controller');
    $this->map('/:controller/:action');
    $this->map('/:controller/:action/:id');
  }
 
  private function set_route($route) {
    $this->route_found = true;
    $params = $route->params;
    $this->controller = $params['controller']; unset($params['controller']);
    $this->action = $params['action']; unset($params['action']);
    $this->id = $params['id']; 
    $this->params = array_merge($params, $_GET);
 
    if (empty($this->controller)) $this->controller = ROUTER_DEFAULT_CONTROLLER;
    if (empty($this->action)) $this->action = ROUTER_DEFAULT_ACTION;
    if (empty($this->id)) $this->id = null;
 
    $w = explode('_', $this->controller);
    foreach($w as $k => $v) $w[$k] = ucfirst($v);
    $this->controller_name = implode('', $w);
  }
 
  public function execute() {
    foreach($this->routes as $route) {
      if ($route->is_matched) {
        $this->set_route($route);
        break;
      }
    }
  }
}
 
class Route {
  public $is_matched = false;
  public $params;
  public $url;
  private $conditions;
 
  function __construct($url, $request_uri, $target, $conditions) {
    $this->url = $url;
    $this->params = array();
    $this->conditions = $conditions;
    $p_names = array(); $p_values = array();
 
    preg_match_all('@:([\w]+)@', $url, $p_names, PREG_PATTERN_ORDER);
    $p_names = $p_names[0];
 
    $url_regex = preg_replace_callback('@:[\w]+@', array($this, 'regex_url'), $url);
    $url_regex .= '/?';
 
    if (preg_match('@^' . $url_regex . '$@', $request_uri, $p_values)) {
      array_shift($p_values);
      foreach($p_names as $index => $value) $this->params[substr($value,1)] = urldecode($p_values[$index]);
      foreach($target as $key => $value) $this->params[$key] = $value;
      $this->is_matched = true;
    }
 
    unset($p_names); unset($p_values);
  }
 
  function regex_url($matches) {
    $key = str_replace(':', '', $matches[0]);
    if (array_key_exists($key, $this->conditions)) {
      return '('.$this->conditions[$key].')';
    } 
    else {
      return '([a-zA-Z0-9_\+\-%]+)';
    }
  }
}

Setup example:

$r = new Router(); // create router instance 
 
$r->map('/', array('controller' => 'home')); // main page will call controller "Home" with method "index()"
$r->map('/login', array('controller' => 'auth', 'action' => 'login'));
$r->map('/logout', array('controller' => 'auth', 'action' => 'logout'));
$r->map('/signup', array('controller' => 'auth', 'action' => 'signup'));
$r->map('/profile/:action', array('controller' => 'profile')); // will call controller "Profile" with dynamic method ":action()"
$r->map('/users/:id', array('controller' => 'users'), array('id' => '[\d]{1,8}')); // define filters for the url parameters
 
$r->default_routes();
$r->execute();

Usage example:

$router = new Router();
// ... some configs ...
$controller = $router->controller; // will return name as it appears in url, ex: 'user_images'
$controller = $router->controller_name; // will return processed name of controller
// for example, if class name in url is 'user_images', then 'controller_name' var will be UserImages
$router->action;
$router->id; // if parameter :id presents
$router->params; // array(...)
$router->route_matched; // true - if route found, false - if not

Small and useful class.