Nokogiri::HTMLはGoogle検索結果をパースできない→解決策あり

Nokogiriは実在するぶっ壊れたHTMLでもパースできるらしい。
Nokogiri.parseはヒューリスティックにHTMLかXMLかを判断して、合ったパーサを使うようになっている。でもHTMLだとわかっている場合はNokogiri::HTML.parseと明示したほうがよい。nokogiri.rbより引用。

module Nokogiri
  class << self
    ###
    # Parse an HTML or XML document.  +string+ contains the document.
    def parse string, url = nil, encoding = nil, options = nil
      doc =
        if string =~ /^\s*<[^Hh>]*html/i # Probably html
          Nokogiri::HTML.parse(string, url, encoding, options || 2145)
        else
          Nokogiri::XML.parse(string, url, encoding, options || 2159)
        end
      yield doc if block_given?
      doc
    end
  end
end

で、デフォルトのoptionsの2145というのはわかりにくい。なんでもlibxml2のオプションでこんなのみたい。libxml/HTMLparser.hより引用。

typedef enum {
    HTML_PARSE_RECOVER  = 1<<0, /* Relaxed parsing */
    HTML_PARSE_NOERROR	= 1<<5,	/* suppress error reports */
    HTML_PARSE_NOWARNING= 1<<6,	/* suppress warning reports */
    HTML_PARSE_PEDANTIC	= 1<<7,	/* pedantic error reporting */
    HTML_PARSE_NOBLANKS	= 1<<8,	/* remove blank nodes */
    HTML_PARSE_NONET	= 1<<11,/* Forbid network access */
    HTML_PARSE_COMPACT  = 1<<16 /* compact small text nodes */
} htmlParserOption;

で、テキトーにあてはめてみると、 HTML_PARSE_RECOVER、 HTML_PARSE_NOERROR、 HTML_PARSE_NOWARNING、 HTML_PARSE_NONET の組み合わせであることがわかった。

r = 0
[0,5,6,11].each do |x|
  r += 1<<x
end
r                               # => 2145

HTML_PARSE_RECOVERが指定してあるということは、ぶっ壊れたHTMLにも寛容なんだな。ユーザに優しい。optionsには触らなくても大丈夫だな。
だけどさー、マジックナンバーなんか指定しないでnokogiri.rbに明示してほしいところ。読みづらくてかなわん。

# -*- coding: utf-8 -*-
require 'nokogiri'

def test(html)
  xpath = %{//a[contains(.,"次へ")]}
  nokogiri = Nokogiri::HTML.parse(html, nil)
  tree = nokogiri.xpath(xpath)
  tree.first["href"]
end

test '<html><body><a href="1.html"><span id="0"><b>次へ</b></span></a></body></html>' # => "1.html"
# かなりカオスなHTMLでも解析できるぞ！
test '<html><body><a href="2.html"><span id="0"><b>次へ</b></span></a>' # => "2.html"
test '<a href="3.html"><span id="0"><b>次へ</b></span></a></body>' # => "3.html"
test '<a href="4.html"><span id="0"><b>次へ</span></a></b></body>' # => "4.html"
test '<a href="5.html"><span id="0"><b>次へ></span></a></b></body>' # => "5.html"
test '<a href="6.html"><span id="0"><b>次へ</a></span></b>' # => "6.html"
test '<a href="7.html"><b>次へ</a></span><span id="0"></b>' # => "7.html"

しかし、Google検索はきちんとNokogiriで扱えない。困ったもんだ。腐ったHTML吐くほうが悪いんだろうけど。HTML要素もBODY要素も閉じてないし…行につめこみすぎて解析する気が起きん。

http://www.google.com/search?q=ruby&hl=ja&num=100

# -*- coding: utf-8 -*-
require 'nokogiri'
require 'open-uri'
require 'kconv'

url = "http://www.google.com/search?q=ruby&hl=ja&num=100"
xpath = %{//a[contains(.,"次へ")]}
nokogiri = Nokogiri::HTML.parse(open(url).read.toutf8, nil, 'UTF-8')
tree = nokogiri.xpath(xpath)
puts tree.to_html
# >>

このスクリプトを実行したら、本来ならば「次へ」のリンク先がわかるのだが…
FirefoxのDOMパーサはこんな腐ったのも扱える（現にAutoPagerizeで対応している）のにlibxml2じゃだめだorz

# -*- coding: utf-8 -*-
require 'nokogiri'
require 'open-uri'

url = "http://www.google.com/search?q=ruby&num=100"
xpath = %{//a[contains(.,"Next")]}
nokogiri = Nokogiri::HTML.parse(open(url).read, nil, 'UTF-8')
tree = nokogiri.xpath(xpath)
puts tree.to_html
# >> <a href="/search?num=100&amp;hl=en&amp;ie=UTF-8&amp;q=ruby&amp;start=100&amp;sa=N"><img src="nav_next.gif" width="100" height="26" alt="" border="0"><br>Next</a>

英語版のGoogleだとちゃんとパースできるぞ!？となると、日本語の問題か…

# -*- coding: utf-8 -*-
require 'nokogiri'
require 'open-uri'
require 'kconv'

url = "http://www.google.co.jp/search?q=ruby&num=10"
xpath = %{//body}
nokogiri = Nokogiri::HTML.parse(open(url).read, nil, 'UTF-8')
tree = nokogiri.xpath(xpath)
puts tree.to_html.toutf8
# >> <body id="gsr" topmargin="3" marginheight="3">
# >> <div id="header">
# >> (略)
# >> <td><a href="/search?hl=ja&amp;ie=UTF-8&amp;q=ruby&amp;start=50&amp;sa=N"><img src="nav_page.gif" width="16" height="26" alt="" border="0"><br>6</a></td>
# >> <td><a href="/search?hl=ja&amp;ie=UTF-8&amp;q=ruby&amp;start=60&amp;sa=N"><img src="nav_page.gif" width="16" height="26" alt="" border="0"><br>7</a></td>
# >> <td><a href="/search?hl=ja&amp;ie=UTF-8&amp;q=ruby&amp;start=70&amp;sa=N"><img src="nav_page.gif" width="16" height="26" alt="" border="0"><br>8</a></td>
# >> <td><a href="/search?hl=ja&amp;ie=UTF-"></a></td>
# >> </tr></table>
# >> </body>

bodyを取り出してみてもなんか途中で解析が途切れているんだよね。なんだよ、

<a href="/search?hl=ja&amp;ie=UTF-">

というタグは!？libxml2は2.7.2。最新版だ。

追記

結局Ruby 1.9のエンコーディングが悪さをしていたという結論。
まだNokogiriはRuby 1.9のエンコーディングシステムに対応していないので、force_encoding("ASCII-8BIT")した状態でHTMLを渡さないといけない。なんてこった。

# -*- coding: utf-8 -*-
require 'nokogiri'
require 'open-uri'
require 'kconv'

url = "http://www.google.com/search?q=ruby&hl=ja&num=100"
xpath = %{//a[contains(.,"次へ")]}
nokogiri = Nokogiri::HTML.parse(open(url).read.toutf8.force_encoding("ASCII-8BIT"))
tree = nokogiri.xpath(xpath)
puts tree.to_html
# >> <a href="/search?num=100&amp;hl=ja&amp;ie=UTF-8&amp;q=ruby&amp;start=100&amp;sa=N"><img src="nav_next.gif" width="100" height="26" alt="" border="0"><br><b>次へ</b></a>