ruby-on-rails Rails URI.open Import正在从.TXT Import中删除标点符号

35g0bw71 于 2023-08-08 发布在 Ruby

关注(0)|答案(2)|浏览(99)

我有一个Rails方法，它试图从一个非常大的遗留代码库中提取保存在AWS上的.txt文件中的chapter文本。
经过大量的实验，我发现下面的代码作为我的chapter.rb模型中的方法是可行的：

def summon_text 
    require 'open-uri'
    if !self.text || self.text == ""

        puts "https://the-petulant-poetess.s3.us-east-2.amazonaws.com/stories/#{self.story.author.id}/#{self.id}.txt"
        readtext = URI.open("https://the-petulant-poetess.s3.us-east-2.amazonaws.com/stories/#{self.story.author.id}/#{self.id}.txt".encode('UTF-8','ISO-8859-1', :invalid => :replace, :undef => :replace, :replace => "?"), "r:UTF-8") { |f| f.read }
        if readtext
            self.update(text: readtext)
            self.save!
        else
            puts "ISSUE WITH https://the-petulant-poetess.s3.us-east-2.amazonaws.com/stories/#{self.story.author.id}/#{self.id}.txt"
        end
    end
end

字符串
然后，我在控制台中使用Chapter.find_each(&:summon_text)运行该命令。它的工作，以导入文本，但很多标点符号（我的意思是几乎大多数的一些章节）是失踪，不正确，或其他不正确的。
以下是我所知道的：

TXT文件在其当前的化身是html格式。
我将它们保存在binary数据类型中，因为我只使用text字段出现错误。
故障排除时，我有一些ASCII翻译错误消息，所以这可能是什么？

有人能看出我哪里错了吗？
编辑：样本问题
一条评论要求提供一个正在发生的事情的例子。下面是一些示例段落应该是什么样子的（根据.txt文档（你可以看到here的完整源代码）：
也许如果他没有再次和潘西分手，和财政部长吵架，并受到转移到神秘事务司的威胁，他会说，“不，”两个女巫。他甚至没有费心去检查他们。
帕德玛·辛格·帕蒂尔说：“这将是一个节奏的改变。”
“你在学校的时候是个魔 Mage ，”赫敏韦斯莱·格兰杰说，又增加了一些轻浮的语气。
...下面是它在下载后呈现的（除了SO过滤掉一堆错误的空格）（你可以看到整个here）：
也许如果他没有再次和潘西分手，和财政部长吵架，并受到转移到神秘事务司的威胁，他会说，不，对两个女巫。他甚至没有费心去检查他们。
帕德玛·辛格·帕蒂尔说：“这将是一个节奏的改变。”
“你在学校的时候是个魔 Mage ，”赫敏韦斯莱·格兰杰说，又增添了几分轻浮。

ruby-on-rails

来源：https://stackoverflow.com/questions/76734853/rails-uri-open-import-is-deleting-punctuation-from-txt-import

2条答案

按热度按时间

vsnjm48y1#

将内容保存到文件，然后从中读取

require 'active_record'
require 'tempfile'
require 'net/http'

ActiveRecord::Base.logger = Logger.new(STDOUT)
ActiveRecord::Base.establish_connection(adapter: 'sqlite3', database: ':memory:')

class Author < ActiveRecord::Base
  connection.create_table table_name, force: true do |t|
    t.string :name
  end
end

class Story < ActiveRecord::Base
  connection.create_table table_name, force: true do |t|
    t.string :name
    t.references :author
  end

  belongs_to :author
end

class Chapter < ActiveRecord::Base
  connection.create_table table_name, force: true do |t|
    t.text :text
    t.references :story
  end

  belongs_to :story

  def summon_text
    url = URI("https://the-petulant-poetess.s3.us-east-2.amazonaws.com/stories/#{story.author.id}/#{id}.txt")
    content = Net::HTTP.get(url)

    ::Tempfile.open('.txt') do |file|
      file.write(content)
      text = File.read(file.path)
      update!(text: text)
    end
  end
end

author = Author.create!(name: 'JK Rowling')
story = Story.create!(name: 'Harry Potter', author: author)
chapter = Chapter.create!(text: 'The Magic Stone ...', story: story)
chapter.summon_text
pp chapter.text # Disclaimer:  As always, the characters remain the property of JK Rowling....

字符串
您可以将此文件保存到mybookshelf.rb，然后使用

ruby mybookshelf.rb

型

赞(0）回复(0）举报 2023-08-08

mefy6pfw2#

代码中的encode method实际上不会影响文本文件的内容。它被应用于URL字符串，这意味着它只是转换代表URL的字符串，而不是URL指向的文件。
.encode('UTF-8','ISO-8859-1', :invalid => :replace, :undef => :replace, :replace => "?")本质上是说“将此URL字符串视为ISO-8859 - 1编码，并将其转换为UTF-8，用?替换任何不可转换的字符”。
但如果URL不包含任何非ASCII字符（大多数URL不包含），则此操作不会起到任何作用。
但是，从URL（URI.open(..., "r:UTF-8")）读取文件的方式确实会对文件内容产生影响。这是指定文件内容编码的地方。如果此处未正确指定编码，则可能导致结果字符串中缺少字符或不正确字符等问题。
我建议在打开文件时尝试不同的字符编码。这些文件可能使用的编码与UTF-8不同，因此打开它们可能会导致某些字符被误解或替换。要以ISO-8859 - 1打开文件，然后转换为UTF-8，您可以执行以下操作：

readtext = URI.open("https://the-petulant-poetess.s3.us-east-2.amazonaws.com/stories/#{self.story.author.id}/#{self.id}.txt", "r:ISO-8859-1") { |f| f.read.encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => "?") }

字符串
这将以ISO-8859-1的形式打开文件，然后将其转换为UTF-8，将所有不可转换的字符替换为?。
您也可以作为described in this article尝试Manfred/Ensure-encoding gem，以确定来自不受信任来源的字符串中的首选编码。
你上面提到的建议行给了我一个'open_http': 403 Forbidden (OpenURI::HTTPError)错误。
鉴于您使用AWS S3托管文件，您确定为您尝试访问的文件设置了正确的权限吗？检查bucket policy或IAM policy related to your S3 resource。
如果权限不是问题，则使用AWS S3时的替代方法是使用AWS SDK for Ruby（aws-sdk-s3 gem）。这使您能够直接管理您的S3存储桶和对象，并可能绕过您在使用open-uri时遇到的任何问题。

def summon_text
  require 'aws-sdk-s3'
  s3 = Aws::S3::Resource.new(region: 'us-east-2')

  # assuming story.author.id and id are already defined
  object = s3.bucket('the-petulant-poetess').object("stories/#{self.story.author.id}/#{self.id}.txt")

  if object.exists?
    readtext = object.get.body.read.encode('UTF-8', :invalid => :replace, :undef => :replace, :replace => "?")
    if readtext
      self.update(text: readtext)
      self.save!
    else
      puts "ISSUE WITH https://the-petulant-poetess.s3.us-east-2.amazonaws.com/stories/#{self.story.author.id}/#{self.id}.txt"
    end
  end
end

型
这是从S3直接读取，它应该尊重文件的任何编码，然后强制将编码转换为UTF-8。
您需要在您的环境中设置AWS凭据（即访问密钥ID和秘密访问密钥），以便AWS SDK正常工作。

赞(0）回复(0）举报 2023-08-08

我来回答

ruby-on-rails Rails URI.open Import正在从.TXT Import中删除标点符号

2条答案

相关问题

热门标签

最新问答