ruby 有没有办法从UTF-8编码的文件中删除BOM？

r6vfmomb 于 2023-04-05 发布在 Ruby

关注(0)|答案(6)|浏览(166)

有没有办法从UTF-8编码的文件中删除BOM？
我知道我所有的JSON文件都是用UTF-8编码的，但是编辑JSON文件的数据输入人员将其保存为UTF-8和BOM。
当我运行Ruby脚本来解析JSON时，它失败了，并出现错误。我不想手动打开58个以上的JSON文件，然后在没有BOM的情况下转换为UTF-8。

ruby

来源：https://stackoverflow.com/questions/5011504/is-there-a-way-to-remove-the-bom-from-a-utf-8-encoded-file

6条答案

按热度按时间

h7appiyu1#

在ruby〉= 1.9.2中，你可以使用r:bom|utf-8模式
这应该可以工作（我还没有结合json测试它）：

json = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  json = JSON.parse(file.read)
}

BOM表是否在文件中可用并不重要。
Andrew说，File#rewind不能与BOM一起使用。
如果你需要一个rewind-function，你必须记住这个位置，并将rewind替换为pos=：

#Prepare test file
File.open('file.txt', "w:utf-8"){|f|
  f << "\xEF\xBB\xBF" #add BOM
  f << 'some content'
}

#Read file and skip BOM if available
File.open('file.txt', "r:bom|utf-8"){|f|
  pos =f.pos
  p content = f.read  #read and write file content
  f.pos = pos   #f.rewind  goes to pos 0
  p content = f.read  #(re)read and write file content
}

赞(0）回复(0）举报 2023-04-05

uyto3xhc2#

因此，解决方案是通过gsub在BOM上进行搜索和替换！我强制将字符串编码为UTF-8，并强制将正则表达式模式编码为UTF-8。
我能够通过查看http://self.d-struct.org/195/howto-remove-byte-order-mark-with-ruby-and-iconv和http://blog.grayproductions.net/articles/ruby_19s_string推导出解决方案

def read_json_file(file_name, index)
  content = ''
  file = File.open("#{file_name}\\game.json", "r") 
  content = file.read.force_encoding("UTF-8")

  content.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')

  json = JSON.parse(content)

  print json
end

赞(0）回复(0）举报 2023-04-05

inn6fuwd3#

您还可以使用File.read和CSV.read方法指定编码，但不指定read模式。

File.read(path, :encoding => 'bom|utf-8')
CSV.read(path, :encoding => 'bom|utf-8')

赞(0）回复(0）举报 2023-04-05

ru9i0ody4#

轰炸|如果你只读取文件一次，UTF-8”编码可以很好地工作，但是如果你调用File#rewind，就像我在代码中做的那样，就会失败。为了解决这个问题，我做了以下操作：

def ignore_bom
  @file.ungetc if @file.pos==0 && @file.getc != "\xEF\xBB\xBF".force_encoding("UTF-8")
end

这似乎工作得很好。不确定是否有其他类似的类型字符需要注意，但它们可以很容易地内置到这个方法中，可以在您倒带或打开的任何时候调用。

赞(0）回复(0）举报 2023-04-05

8iwquhpp5#

服务器端清理utf-8 bom字节，对我有用：

csv_text.gsub!("\xEF\xBB\xBF".force_encoding(Encoding::BINARY), '')

赞(0）回复(0）举报 2023-04-05

63lcw9qa6#

我刚刚为smarter_csv gem实现了这个，并希望在有人遇到这个问题时分享这个。
问题是要删除与字符串编码无关的字节序列。解决方案是使用String类中的方法bytes和byteslice。
参见：https://ruby-doc.org/core-3.1.1/String.html#method-i-bytes

UTF_8_BOM = %w[ef bb bf].freeze

    def remove_bom(str)
      str_as_hex = str.bytes.map{|x| x.to_s(16)}
      return str.byteslice(3..-1) if str_as_hex[0..2] == UTF_8_BOM

      str
    end

赞(0）回复(0）举报 2023-04-05

我来回答

ruby 有没有办法从UTF-8编码的文件中删除BOM？

6条答案

相关问题

热门标签

最新问答