Generate sitemaps with Ruby and XmlSitemap gem

Posted by Dan Sosedoff on June 18, 2010

Made a simple gem for website sitemap generation. Could be used in any Ruby/Rails/Merb/Sinatra application. It does not have any caching in that case if you want to use framework built-in cache methods.

Installation:

$ sudo gem install xml-sitemap

Example

pages = Page.all(:order => [:updated_at.desc] # DM model
map = XmlSitemap::Map.new('somedomain.com') do |m|
  m.add(:url => '/', :period => :daily, :priority => 1.0)
  m.add(:url => '/contractors', :period => :daily, :priority => 1.0)
  pages.each do |p|
    m.add(
      :url => url_for_page(p),
      :updated => p.updated_at,
      :priority => 0.5,
      :period => :never
    )
  end
end
# render the sitemap
puts map.render

Sinatra Example

# ... your code
 
get '/sitemap.xml' do
  map = XmlSitemap::Map.new('domain.com') do |m|
    m.add(:url => '/')
    m.add(:url => '/posts', :period => :weekly)
  end
 
  headers['Content-Type'] = 'text/xml'
  map.render
end
 
# ... more code

Options

:url – page path, relative to domain (ex.: /test), String.
:period – freqchange attribute, Symbol, :none, :never, :always, :hourly, :daily, :weekly, :monthly, :yearly
:priority – priority attribute, Float class,(0.0..1.0)
:updated – (optional) last_update attribute, Time class

Source Code

http://github.com/sosedoff/xml-sitemap

Snippet: Cleanup your Git repository

Posted by Dan Sosedoff on June 15, 2010

Snippet (found on net) for removing files from repository that are no longer present under your project.

$ git rm $(git ls-files -d)

For best use add it to bash alias file: ~/.bashrc or ~/.bash-aliases (under ubuntu):

alias gitclean='git rm $(git ls-files -d)'

Using Amazon product images on your website

Posted by Dan Sosedoff on June 15, 2010

Amazon has an awesome image service. You can use their product images on your site, adjusting them for you needs. All you have to know – one image url of your product. Having that string will provide you an access to its dynamic image scaling service which i had to use recently.

So, lets say you have books on your website, but you dont have any good images for them. There is 2 ways to solve your problem: 1) download it from whatever place and resize 2) use amazon!

Here goes small overview.

Unfortunately, i didnt have any time to play with image service for different countries, but i assume that wont change that much. Lets take a look on a regular image:

http://ecx.images-amazon.com/images/I/41ygBmdaIfL._SL500_SS100_.jpg

It has different parts:
1) URL base: http://ecx.images-amazon.com/images/I/
2) Image code: 41ygBmdaIfL
3) Size format (surrounded by underscores): _SL500_SS100_
4) Format: jpg/gif/png

Some words about image format. It can vary from square thumbnails to images with specific max width and height. For example: _SX100_ will produce image that 100 pixels wide, height will be calculated proportionally. SH100 will give opposite result, scaled by 100 pixels maximum height, SS100 – 100×100 pixels thumbnail. And so on, you can find other similar crop codes while exploring amazon store on different pages, all you need is to take a look on image sources.

Now, we need to use this with Ruby:

require 'net/http'
 
module Amazon
  # parse amazon image url and get image code and extension
  def self.parse_image(url)
    result = url.scan(/^http:\/\/ecx.images-amazon.com\/images\/I\/([a-z0-9\-\%]{1,})(.*)_.(jpg|jpeg|gif)/i)
    unless result.nil?
      unless result[0].nil?
        match = result.first
        return {:code => match.first.to_s, :extension => match.last.to_s}
      end
    end
  end
 
  # make a new amazon image url based on code and size
  def self.make_image(image, size)
    "http://ecx.images-amazon.com/images/I/#{image[:code]}._#{size.upcase}.#{image[:extension]}"
  end
 
  # check if actual image exists
  def self.check_image(url)
    begin
      uri = URI.parse(url)
      req = Net::HTTP::Get.new(uri.path)
      res = Net::HTTP.start(uri.host, uri.port) { |http| http.request(req) }
      return res.code == '200' && res.content_length.to_i > 0
    rescue Exception
      false
    end
  end
end

And usage:

url = 'http://ecx.images-amazon.com/images/I/51O65dIoZCL._SX117_.jpg'
info = Amazon.parse_image(url)
unless info.nil?
  new_url = Amazon.make_image(info, 'sx100')
  if Amazon.check_image(new_url)
    puts "Cool! Resized image: #{new_url}"
  else
    puts "Sorry, this image does not exist!"
  end
else
  puts "Cant identify image!"
end

Some notes about the process. The only reason why method “check_image” uses GET method instead of HEAD is because if image cannot be generated or not found in amazon`s cache the response is still valid sometimes. I`ve checked it on 50k images and sometimes HEAD request indicates that response is valid while it not supposed to. Otherwise i would use HEAD.

Handy HTTP requests with Curb and Ruby

Posted by Dan Sosedoff on June 13, 2010

While working on one of the projects, i tried to find multi-purpose HTTP request class that can use different network interfaces/ip addresses with retry option (if connection slow or server not responding for some reason).

Here is a small class wrapper build on top of Ruby Curb implemented as a module:

module ApiRequest
  USER_AGENTS = [
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727)',
    'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3',
    'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.70 Safari/533.4',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.2) Gecko/20100323 Namoroka/3.6.2',
    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100401 Ubuntu/9.10 (karmic) Firefox/3.5.9'
  ]
 
  CONNECTION_TIMEOUT = 10
 
  @@interfaces = []
 
  # get random user-agent string for usage
  def random_agent
    USER_AGENTS[rand(USER_AGENTS.size-1)]
  end
 
  # get random IP/network interface specified in @@interfaces
  def random_interface
    size = @@interfaces.size
    size > 0 ? @@interfaces[rand(size-1)] : nil
  end
 
  # perform request, assign_to - specify network interface/ip
  def perform(url, assign_to=nil)
    puts url
    interface = assign_to.nil? ? self.random_interface : assign_to
    req = Curl::Easy.new(url)
    req.timeout = CONNECTION_TIMEOUT
    req.interface = interface unless interface.nil?
    req.headers['User-Agent'] = self.random_agent
    begin
      req.perform
      if req.response_code == 200
        return req.downloaded_bytes > 0 ? req.body_str : nil
      else
        nil
      end
    rescue Exception
      return nil
    end
  end
 
  # perform request by number of attempts
  def fetch(url, attempts=3)
    result = nil
    1.upto(attempts) do |a|
      result = self.perform(url)
      break unless result.nil?
    end
    return result
  end
end

And sample usage:

class TestRequest
  include ApiRequest
 
  def foo
     body = self.fetch('http://google.com')
  end
end

If module variable “@@interfaces” is array of ip addresses or network interfaces then one of them (randomly selected) will be used to perform request. Also, function “fetch” has parameter “attempts” which set to 3 by default. It means that operation will be invoked n times until result is downloaded from url. Otherwise – it returns nil.
Function perform has a parameter “assign_to” (which it not used in “fetch” function) that allows to bind request to specified interface. It is useful if you have situation when you might use different workers that bound to exact interface or just one that uses random ip`s. Also, class ApiRequest has a list of user agents which it uses randomly for each performed request.

Pastie: http://pastie.org/private/j19j3hbebte9bjqaydslmg

Making HTTP requests from different network interfaces with Ruby and Curb

Posted by Dan Sosedoff on June 09, 2010

At some point you will find that you have reached requests per IP limit while using some API or crawling resources. And if you`re doing it via standard Net::HTTP you`ll face the problem that you cannot assign request class to specified network interface (or IP). Bummer? No. Even if you cant do it with core class you might take a look on Curb – libcurl ruby binding. It has everything that you need to make regular get/post/etc requests. And of course – easy.

A simple example (real ip`s are changed):

require 'rubygems'
require 'curb'
 
ip_addresses = [
  '1.1.1.1',
  '2.2.2.2',
  '3.3.3.3',
  '4.4.4.4',
  '5.5.5.5'
]
 
ip_addresses.each do |ip|
  req = Curl::Easy.new('http://www.ip-adress.com/')
  req.interface = ip
  req.perform
  result_ip = req.body_str.scan(/<h2>My IP address is: ([\d\.]{1,})<\/h2>/).first
  puts("for #{ip} got response: #{result_ip}")
end

Output (ip`s are changed):

for 1.1.1.1 got response: 1.1.1.1
for 2.2.2.2 got response: 2.2.2.2
for 3.3.3.3 got response: 3.3.3.3
for 4.4.4.4 got response: 4.4.4.4
for 5.5.5.5 got response: 5.5.5.5

At least its working. Havent done any performance tests.
Sample on pastie: http://pastie.org/private/afxlcuk1npwjov3wer5hw

Setting processor affinity for a certain task or process in Linux

Posted by Dan Sosedoff on June 06, 2010

When you are using SMP you might want to override the kernel’s process scheduling and bind a certain process to a specific CPU(s).

What is this?

CPU affinity is nothing but a scheduler property that “bonds” a process to a given set of CPUs on the SMP system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs. Note that the Linux scheduler also supports natural CPU affinity:

The scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications. For example, application such as Oracle (ERP apps) use # of cpus per instance licensed. You can bound Oracle to specific CPU to avoid license problem. This is a really useful on large server having 4 or 8 CPUS

Setting processor affinity for a certain task or process using taskset command

taskset is used to set or retrieve the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. However taskset is not installed by default. You need to install schedutils (Linux scheduler utilities) package.

$ apt-get install shedutils

Under latest version of Debian / Ubuntu Linux taskset is installed by default using util-linux package.

The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and the highest order bit corresponding to the last logical CPU. For example:

  • 0×00000001 is processor #0 (1st processor)
  • 0×00000003 is processors #0 and #1
  • 0×00000004 is processors #2 (3rd processor)

To set the processor affinity of process 13545 to processor #0 (1st processor) type following command:

$ taskset 0x00000001 -p 13545

If you find a bitmask hard to use, then you can specify a numerical list of processors instead of a bitmask using -c flag:

$ taskset -c 1 -p 13545
$ taskset -c 3,4 -p 13545

where -p : Operate on an existing PID and not launch a new task (default is to launch a new task)

via http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html

Making colorized console output with Ruby

Posted by Dan Sosedoff on June 01, 2010

If you develop some console application you might want your output be more informative, have different colors for operations or logging purposes. It is possible to do with general ANSI escape codes, which are supported by most common console terminals.

The ASCII escape structure is pretty simple. It begins with “ESC” symbol, which is code 27 in ASCII table. Then “[” symbol. Parameters that goes after “[” symbol are separated by “;” and finally ends with closing sequence: “ESC[0m“.

You can extend basic ruby String class with following code:

class String
    # colorize functions
    def red; colorize(self, "\e[1m\e[31m"); end
    def green; colorize(self, "\e[1m\e[32m"); end
    def dark_green; colorize(self, "\e[32m"); end
    def yellow; colorize(self, "\e[1m\e[33m"); end
    def blue; colorize(self, "\e[1m\e[34m"); end
    def dark_blue; colorize(self, "\e[34m"); end
    def pur; colorize(self, "\e[1m\e[35m"); end
    def colorize(text, color_code) "#{color_code}#{text}\e[0m" ; end
end

And sample usage code:

puts "Starting some job...".blue
puts "Processing thing 1 [#{"OK".green}]"
puts "Processing thing 2 [#{"FAIL".red}]"
puts "Oooops! This is a warning!".yellow
puts "Another color!".pur

The output:

colored console

Nice and useful.

Moved to Rackspace Cloud!

Posted by Dan Sosedoff on June 01, 2010

Moved this blog to rackspace cloud after almost 2 year service with UltraHosting. Had a dedicated server (512 Ram/ 70 Hdd / Celeron 1.8 / 100mbit) for $66/mo in Toronto, Canada. Good service.

Now switched to PHP 5.3.2 + php-fpm + nginx 0.7.65