ruby 需要一种方法来抓取文件中的信息,如果信息在另一个文件中则跳过

hk8txs48  于 2023-06-05  发布在  Ruby
关注(0)|答案(1)|浏览(543)

我有一个名为skip.txt的文件,其中包含以下信息:

stackoverflow.com 
github.com 
www.sa-k.net 
yoursearch.me 
search1.speedbit.com 
duckfm.net
search.clearch.org 
webcache.googleusercontent.com

我还有一个名为information.txt的文件,其中包含以下信息:

http://search.clearch.org/?a=web&q=Viewcat_h.php%3Fidcategory%3D%20%3Cstrong%3ESite%3C%2Fstrong%3E%20.pl%20
https://moodle.org/mod/forum/discuss.php?d=246409
http://webcache.googleusercontent.com/search?q=cache:oqPwN7FtDWgJ
http://www.aquariumist.com.ua/spr.php?id=7
http://search.clearch.org/?a%3Dweb%26q%3DViewcat_h.php%253Fidcategory%253D%2520%253Cstrong%253ESite%253C%252Fstrong%253E%2520.pl%2520%2Binurl:viewCat_h.php?idCategory%3D&hl=en&gbv=1&ct=clnk
http://www.astbury.leeds.ac.uk/research/spr.php
http://www.media4play.li/s/spr+php+id.html
http://v.virscan.org/SPR/PHP.ID.html
http://search.clearch.org/?a=images&q=Viewcat_h.php%3Fidcategory%3D+
http://search.clearch.org/?a=web&q=Inurl%20Viewcat_h.php%3Fidcategory%3D%20Site%20Clinsp=%3Fpvaid%3D97f2b2aa136c4af0936453a19d9ab1b2%26fcoid%3D302363
http://webcache.googleusercontent.com/search?q=cache:5qNE1JBqUeIJ
http://search.clearch.org/?a%3Dweb%26q%3DInurl%2520Viewcat_h.php%253Fidcategory%253D%2520Site%2520Cl%26insp%3D%253Fpvaid%253D97f2b2aa136c4af0936453a19d9ab1b2%2526fcoid%253D302363%2Binurl:viewCat_h.php?idCategory%3D&hl=en&gbv=1&ct=clnk

我想要一种方法来获取此信息并移动到next URL,是否有一种方法可以从skip.txt文件中读取,如果information.txt文件包含skip.txt文件中的任何内容,则移动到文件中的下一个URL?
预期输出:

http://www.astbury.leeds.ac.uk/research/spr.php
http://www.media4play.li/s/spr+php+id.html
http://v.virscan.org/SPR/PHP.ID.html
https://moodle.org/mod/forum/discuss.php?d=246409
http://www.aquariumist.com.ua/spr.php?id=7

我做了一些研究,发现了grep函数,但这需要一个复杂的正则表达式,我不是很擅长。因此,如果您可以帮助我找到一种跳过skip.txt中的信息的方法,或者帮助我使用正则表达式,那就太好了!提前谢谢你。

x6h2sr28

x6h2sr281#

假设将跳过文件读入变量skip,将信息文件读入变量info_file。然后,

skip_arr = skip.split("\n").map(&:strip)
  #=> ["stackoverflow.com", "github.com", "www.sa-k.net", "yoursearch.me",
  #    "search1.speedbit.com", "duckfm.net", "search.clearch.org",
  #    "webcache.googleusercontent.com"]

.map(&:strip)(您可以将其视为.map { |s| s.strip })使用String#strip删除skip.split("\n")生成的数组元素周围的任何空格。这可能不是必要的,但这是一种预防措施,不会造成伤害。

info_arr = info.split("\n")
  #=> ["http://search.clearch.org/?a=web&q=Viewcat_h...,
  #    "https://moodle.org/mod/forum/discuss.php?d=246409",
  #    "http://webcache.googleusercontent.com/search?q=cache:oqPwN7FtDWgJ",
  #    "http://www.aquariumist.com.ua/spr.php?id=7",
  #    "http://search.clearch.org/?a%3Dweb%26q%3DViewcat_h.php...,
  #    "http://www.astbury.leeds.ac.uk/research/spr.php",
  #    "http://www.media4play.li/s/spr+php+id.html",
  #    "http://v.virscan.org/SPR/PHP.ID.html",
  #    "http://search.clearch.org/?a=images&q=Viewcat_h.php%3Fidcategory%3D+",
  #    "http://search.clearch.org/?a=web&q=Inurl%20Viewcat_h.php...,
  #    "http://webcache.googleusercontent.com/search?q=cache:5qNE1JBqUeIJ",
  #    "http://search.clearch.org/?a%3Dweb%26q%3DInurl%2520Viewcat_h.php...]

接下来我们定义一个正则表达式。

r = / 
    (?<=\/\/)  # match two forward slashes in a positive lookbehind
    #{ Regexp.union(skip_arr) } # match any element of skip_arr
    (?=\/)     # match a forward slash in a positive lookahead
    /x         # free-spacing regex definition mode
#=> / 
    (?<=\/\/)  # match two forward slashes in a positive lookbehind
    (?-mix:stackoverflow\.com|github\.com|www\.sa\-k\.net|yoursearch\.me|
      search1\.speedbit\.com|duckfm\.net|search\.clearch\.org|
      webcache\.googleusercontent\.com) # match any element of skip_arr
    (?=\/)     # match a forward slash in a positive lookahead
    /x

最后,使用Array#reject方法删除info.arr中与这个“regex”匹配的元素:

info_arr.reject { |s| s =~ r }
  #=> ["https://moodle.org/mod/forum/discuss.php?d=246409",
  #    "http://www.aquariumist.com.ua/spr.php?id=7", 
  #    "http://www.astbury.leeds.ac.uk/research/spr.php",
  #    "http://www.media4play.li/s/spr+php+id.html",
  #    "http://v.virscan.org/SPR/PHP.ID.html"]

相关问题