`

nokogiri抓取网络资源

阅读更多

 写道
Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.
 Nokogiri的解析能力+open-uri的网络访问想组合就可以用来抓取网络上的一些资源了,下面的这段代码用来抓取清杯浅酌这个wp博客。由于ruby代码写的比较java化,只能通过平时多写多看来提高自己的美感,高手请飘过。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
desc "Fetch articles from http://xuzhuoer.com/"
task :fetch => :environment do
  ids = Nokogiri::HTML(open("http://xuzhuoer.com/archives/"))
  ids.css('.post li a').each_with_index do |link, index|
    href = link.attr("href")
    doc = Nokogiri::HTML(open(href))
    # get the article's content & title & tag_list
    content = doc.css('.post > .content').inner_html
    title = doc.css('h1').text
    tags = ""
    doc.css('.post_info a').each do |tag|
      tags << tag.text << " "
    end
    # create post and save it
    @post = Post.create!(:body => content, :tag_list => tags.strip!, :title => title )
    # get the article's comments
    doc.css('#comments > .comment').each_with_index do |comment, index|
      author = comment.css('.author').text
      unless comment.css('a[@class="author"]').empty?
        author_url = comment.css('a[@class="author"]').attr('href')
      end
      body = comment.css('.content p').text
      # fetch the author's md5(email) to get gravatar
      md5 = comment.css('img').attr('src').text[31...63]
      # create & save comment
      Comment.create!(:author => author, :author_url => author_url,
          :body => body, :avatar_md5 => md5,
          :commentable_type => "Post",  :commentable_id => @post.id)
      sleep(5)
    end
    sleep(rand(5))
  end
end
 

如果抓取的网站资源需要登陆后才能看到,那么这个方法就显得无能为力了。不过加上Mechanize,结果就可能不一样了。mechazie能够模拟表单的提交并在以后的表单操作中自动设置cookie。

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics