< div>< ul>< li>< div>用REGEX刮擦Ruby/w

ndh0cuux 于 2023-01-25 发布在 Ruby

关注(0)|答案(1)|浏览(133)

我想做网站https://www.bananatic.com/es/forum/games/的刮
并提取标签“名称”，“视图”和“回复”。我有一个大问题，以获得非空内容的“名称”标签。你能帮助我吗？我需要保存只有元素，确实有文字。
这是我的代码，我有三个变量：

每个保存回复中的内容*。
pir保存视图中的内容*
res保存名称中的内容。*

每个数组应该只包含它们拥有的元素。但是在*名称*中保存了写入内容[”“]，我不希望它们保存在数组中。x1c 0d1x

require 'nokogiri'
    require 'open-uri'
    require 'pp'
    require 'csv'

    unless File.readable?('data.html')
      url = 'https://www.bananatic.com/de/forum/games/'
      data = URI.open(url).read
      File.open('data.html', 'wb') { |f| f << data }
    end
    data = File.read('data.html')
    document = Nokogiri::HTML(data)

    per = document.xpath('//div[@class="replies"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\d+/] }

    p per

    pir = document.xpath('//div[@class="views"]/text()[string-length(normalize-space(.)) > 0]')
                  .map { |node| node.to_s[/\w+/] }

    p pir

    links2 = document.css('.topics ul li div')
    res = links2.map do |lk|
      name = lk.css('.name  p a').inner_text
      [name]
    end
    p res

为了解决这个问题，我添加了一个正则表达式，但是我的尝试失败了。我只是用**.to_s[/\w+/]替换了.inner_text**，但是我没有得到它。

👇🏼 现在我有了一个空值数组还有一些字母***“a”***我不知道它们出现在哪里。

ruby

来源：https://stackoverflow.com/questions/75189055/divullidiv-scraping-ruby-with-regex-w

1条答案

按热度按时间

3b6akqbq1#

这可能有助于XPath和CSS。
对于您的CSS检查这个：https://kittygiraudel.github.io/selectors-explained/
以下内容将为您提供所需信息

document.xpath('//div[@class="topics"]/ul/li//div[@class="name"]/a[@class="js-link avatar"]/text()').map {|node| node.to_s.strip}`.

如果你想知道你的数组是从哪里来的，退一步，直接打印出lk.css('.name p a').to_s，但真实的的问题是你的选择器刚刚关闭。
尽管如此，看看页面的结构，你会更好地与这样的东西：

require 'nokogiri'
require 'open-uri'

url = "https://www.bananatic.com/de/forum/games/"

doc = Nokogiri::HTML(URI.open(url))
# Set a root node set to start from
topics = doc.xpath('//div[@class="topics"]/ul/li')

# loop the set 
details = topics.filter_map do |topic| 
  next unless topic.at_xpath('.//div[@class="name"]') # skip ones without the needed info
  # Map details into a Hash
  {name: topic.at_xpath('.//div[@class="name"]/a[@class="js-link avatar"]/text()').to_s.strip,
   post_year: topic.at_xpath('.//div[@class="name"]/text()[string-length(normalize-space(.)) > 0]').to_s[/\d{4}/],
   replies: topic.at_xpath('.//div[@class="replies"]/text()').to_s.strip, 
   views: topic.at_xpath('.//div[@class="views"]/text()').to_s.strip 
  }
end

details的结果为：

[{:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"236"},
 {:name=>"MrCasual2502", :post_year=>"2016", :replies=>"0", :views=>"164"},
 {:name=>"EdgarAllen", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"0"},
 {:name=>"RAMONVC", :post_year=>"2022", :replies=>"0", :views=>"1"},
 {:name=>"tokyobreez", :post_year=>"2021", :replies=>"2", :views=>"18"},
 {:name=>"matrix12334", :post_year=>"2022", :replies=>"0", :views=>"2"},
 {:name=>"juggalohomie420", :post_year=>"2017", :replies=>"3", :views=>"89"},
 {:name=>"Imas86", :post_year=>"2022", :replies=>"2", :views=>"2"},
 {:name=>"SmilesImposterr", :post_year=>"2021", :replies=>"1", :views=>"17"},
 {:name=>"bebb", :post_year=>"2019", :replies=>"7", :views=>"22"},
 {:name=>"IMBANANAZ", :post_year=>"2016", :replies=>"1", :views=>"4"},
 {:name=>"IWantSteamKeys", :post_year=>"2021", :replies=>"1", :views=>"4"},
 {:name=>"gamormoment", :post_year=>"2021", :replies=>"1", :views=>"47"},
 {:name=>"Lovestruck", :post_year=>"2021", :replies=>"3", :views=>"46"},
 {:name=>"KillerBotAldwin1", :post_year=>"2021", :replies=>"1", :views=>"95"},
 {:name=>"purplevestynstr", :post_year=>"2020", :replies=>"1", :views=>"13"},
 {:name=>"Janabanana", :post_year=>"2021", :replies=>"3", :views=>"3"},
 {:name=>"apache724", :post_year=>"2017", :replies=>"3", :views=>"33"},
 {:name=>"MrsSue66", :post_year=>"2021", :replies=>"1", :views=>"38"}]

赞(0）回复(0）举报 2023-01-25

我来回答

< div>< ul>< li>< div>用REGEX刮擦Ruby/w

1条答案

相关问题

热门标签

最新问答