ruby 需要一种方法来抓取文件中的信息，如果信息在另一个文件中则跳过

hk8txs48 于 2023-06-05 发布在 Ruby

关注(0)|答案(1)|浏览(543)

我有一个名为skip.txt的文件，其中包含以下信息：

stackoverflow.com 
github.com 
www.sa-k.net 
yoursearch.me 
search1.speedbit.com 
duckfm.net
search.clearch.org 
webcache.googleusercontent.com

我还有一个名为information.txt的文件，其中包含以下信息：

http://search.clearch.org/?a=web&q=Viewcat_h.php%3Fidcategory%3D%20%3Cstrong%3ESite%3C%2Fstrong%3E%20.pl%20
https://moodle.org/mod/forum/discuss.php?d=246409
http://webcache.googleusercontent.com/search?q=cache:oqPwN7FtDWgJ
http://www.aquariumist.com.ua/spr.php?id=7
http://search.clearch.org/?a%3Dweb%26q%3DViewcat_h.php%253Fidcategory%253D%2520%253Cstrong%253ESite%253C%252Fstrong%253E%2520.pl%2520%2Binurl:viewCat_h.php?idCategory%3D&hl=en&gbv=1&ct=clnk
http://www.astbury.leeds.ac.uk/research/spr.php
http://www.media4play.li/s/spr+php+id.html
http://v.virscan.org/SPR/PHP.ID.html
http://search.clearch.org/?a=images&q=Viewcat_h.php%3Fidcategory%3D+
http://search.clearch.org/?a=web&q=Inurl%20Viewcat_h.php%3Fidcategory%3D%20Site%20Clinsp=%3Fpvaid%3D97f2b2aa136c4af0936453a19d9ab1b2%26fcoid%3D302363
http://webcache.googleusercontent.com/search?q=cache:5qNE1JBqUeIJ
http://search.clearch.org/?a%3Dweb%26q%3DInurl%2520Viewcat_h.php%253Fidcategory%253D%2520Site%2520Cl%26insp%3D%253Fpvaid%253D97f2b2aa136c4af0936453a19d9ab1b2%2526fcoid%253D302363%2Binurl:viewCat_h.php?idCategory%3D&hl=en&gbv=1&ct=clnk

我想要一种方法来获取此信息并移动到next URL，是否有一种方法可以从skip.txt文件中读取，如果information.txt文件包含skip.txt文件中的任何内容，则移动到文件中的下一个URL？
预期输出：

http://www.astbury.leeds.ac.uk/research/spr.php
http://www.media4play.li/s/spr+php+id.html
http://v.virscan.org/SPR/PHP.ID.html
https://moodle.org/mod/forum/discuss.php?d=246409
http://www.aquariumist.com.ua/spr.php?id=7

我做了一些研究，发现了grep函数，但这需要一个复杂的正则表达式，我不是很擅长。因此，如果您可以帮助我找到一种跳过skip.txt中的信息的方法，或者帮助我使用正则表达式，那就太好了！提前谢谢你。

ruby

来源：https://stackoverflow.com/questions/36513202/need-a-way-to-grab-information-out-of-a-file-and-skip-if-information-is-in-anoth

1条答案

按热度按时间

x6h2sr281#

假设将跳过文件读入变量skip，将信息文件读入变量info_file。然后，

skip_arr = skip.split("\n").map(&:strip)
  #=> ["stackoverflow.com", "github.com", "www.sa-k.net", "yoursearch.me",
  #    "search1.speedbit.com", "duckfm.net", "search.clearch.org",
  #    "webcache.googleusercontent.com"]

.map(&:strip)（您可以将其视为.map { |s| s.strip }）使用String#strip删除skip.split("\n")生成的数组元素周围的任何空格。这可能不是必要的，但这是一种预防措施，不会造成伤害。

info_arr = info.split("\n")
  #=> ["http://search.clearch.org/?a=web&q=Viewcat_h...,
  #    "https://moodle.org/mod/forum/discuss.php?d=246409",
  #    "http://webcache.googleusercontent.com/search?q=cache:oqPwN7FtDWgJ",
  #    "http://www.aquariumist.com.ua/spr.php?id=7",
  #    "http://search.clearch.org/?a%3Dweb%26q%3DViewcat_h.php...,
  #    "http://www.astbury.leeds.ac.uk/research/spr.php",
  #    "http://www.media4play.li/s/spr+php+id.html",
  #    "http://v.virscan.org/SPR/PHP.ID.html",
  #    "http://search.clearch.org/?a=images&q=Viewcat_h.php%3Fidcategory%3D+",
  #    "http://search.clearch.org/?a=web&q=Inurl%20Viewcat_h.php...,
  #    "http://webcache.googleusercontent.com/search?q=cache:5qNE1JBqUeIJ",
  #    "http://search.clearch.org/?a%3Dweb%26q%3DInurl%2520Viewcat_h.php...]

接下来我们定义一个正则表达式。

r = / 
    (?<=\/\/)  # match two forward slashes in a positive lookbehind
    #{ Regexp.union(skip_arr) } # match any element of skip_arr
    (?=\/)     # match a forward slash in a positive lookahead
    /x         # free-spacing regex definition mode
#=> / 
    (?<=\/\/)  # match two forward slashes in a positive lookbehind
    (?-mix:stackoverflow\.com|github\.com|www\.sa\-k\.net|yoursearch\.me|
      search1\.speedbit\.com|duckfm\.net|search\.clearch\.org|
      webcache\.googleusercontent\.com) # match any element of skip_arr
    (?=\/)     # match a forward slash in a positive lookahead
    /x

最后，使用Array#reject方法删除info.arr中与这个“regex”匹配的元素：

info_arr.reject { |s| s =~ r }
  #=> ["https://moodle.org/mod/forum/discuss.php?d=246409",
  #    "http://www.aquariumist.com.ua/spr.php?id=7", 
  #    "http://www.astbury.leeds.ac.uk/research/spr.php",
  #    "http://www.media4play.li/s/spr+php+id.html",
  #    "http://v.virscan.org/SPR/PHP.ID.html"]

赞(0）回复(0）举报 2023-06-05

我来回答

ruby 需要一种方法来抓取文件中的信息，如果信息在另一个文件中则跳过

1条答案

相关问题

热门标签

最新问答