nokogiri抓取网络资源

jackdong

浏览: 27144 次
性别:
来自: 武汉

最近访客更多访客>>

halloffame

sonta1

riverng

fypop

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

rails2

CSS rubygems Ruby XML HTML

写道

Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.
XML is like violence - if it doesn’t solve your problems, you are not using enough of it.

Nokogiri的解析能力+open-uri的网络访问想组合就可以用来抓取网络上的一些资源了，下面的这段代码用来抓取清杯浅酌这个wp博客。由于ruby代码写的比较java化，只能通过平时多写多看来提高自己的美感，高手请飘过。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
desc "Fetch articles from http://xuzhuoer.com/"
task :fetch => :environment do
  ids = Nokogiri::HTML(open("http://xuzhuoer.com/archives/"))
  ids.css('.post li a').each_with_index do |link, index|
    href = link.attr("href")
    doc = Nokogiri::HTML(open(href))
    # get the article's content & title & tag_list
    content = doc.css('.post > .content').inner_html
    title = doc.css('h1').text
    tags = ""
    doc.css('.post_info a').each do |tag|
      tags << tag.text << " "
    end
    # create post and save it
    @post = Post.create!(:body => content, :tag_list => tags.strip!, :title => title )
    # get the article's comments
    doc.css('#comments > .comment').each_with_index do |comment, index|
      author = comment.css('.author').text
      unless comment.css('a[@class="author"]').empty?
        author_url = comment.css('a[@class="author"]').attr('href')
      end
      body = comment.css('.content p').text
      # fetch the author's md5(email) to get gravatar
      md5 = comment.css('img').attr('src').text[31...63]
      # create & save comment
      Comment.create!(:author => author, :author_url => author_url,
          :body => body, :avatar_md5 => md5,
          :commentable_type => "Post",  :commentable_id => @post.id)
      sleep(5)
    end
    sleep(rand(5))
  end
end

如果抓取的网站资源需要登陆后才能看到，那么这个方法就显得无能为力了。不过加上Mechanize，结果就可能不一样了。mechazie能够模拟表单的提交并在以后的表单操作中自动设置cookie。

分享到：

利用paperclip实现kindEditoer的图片上传 ...

2010-08-25 10:18
浏览 2166
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论