ruby 在进行一些字符串操作时,我遇到了一些奇怪的编码

oxosxuxt  于 2023-05-28  发布在  Ruby
关注(0)|答案(1)|浏览(102)

我的代码只是使用HTML标签创建字符串的inline-diff(基于每个单词),因此CSS可以隐藏/显示被删除/添加的内容。
在我的测试中,我使用()进行添加,使用{}进行删除。
以下是我的文本:
输入:

"e&nbsp;<b><u>Zerg</u></b>&nbsp;a"
"e Zerg a"

输出:

"e(?)(\240){&nbsp;<b>}{<u>}Zerg(?)(\240){</u>}{</b>}{&nbsp;}a"

现在,我不做任何改变编码在所有,所以...我真的很困惑,一个问号和240日元是怎么进去的。o.o
这是什么编码?
我使用Ruby 1.8.7。
我找到了问题的根源。当我将新字符串转换为一个数组以供 Diff::LCS 使用时,会发生这种情况:
代码:

def self.convert_html_string_to_html_array(str)
=begin
  Things like &nbsp (and other char codes), and tags need to be considered the same element
  also handles the decision to diff per char or per word

  also need to take into consideration JavaScript and CSS that might be in the middle of a selection
=end
    result = Array.new
    compare_words = str.has_at_least_one_word?
    i = 0
    while i < str.length do
      cur_char = str[i..i]
      case cur_char
      when "&"
        # For this we have two situations, a stray char code, and a char code preceeding a tag
        next_index = str.index(";", i)
        case str[next_index + 1..next_index + 1] # Check to see if tag
        when "<"
          next_index = str.index(">", i)
        end
        result << str[i..next_index]
        i = next_index
      when "<"
        next_index = str.index(">", i)
        result << str[i..next_index]
        i = next_index
      when " "
        result << cur_char
      else
        if compare_words
          # In here we need to check the above rules again, cause tags can be touching regular text
          next_index = i + 1
          next_index = str.index(" ", next_index)
          next_index = str.length if next_index.nil?
          next_index -= 1

          if i < str.length and str[i..next_index].include?("<") # Beginning of a tag
            next_index = str.index(">", i)
          end

          result << str[i..next_index] # We don't want to include the space
          i = next_index
        else
          result << cur_char
        end
      end
      i += 1
    end

    return result # Removes the trailing empty string
  end

澄清一下:

'e Zerg a'

变成这样:

[
    [0] "e",
    [1] "\302",
    [2] "\240",
    [3] "Z",
    [4] "e",
    [5] "r",
    [6] "g",
    [7] "\302",
    [8] "\240",
    [9] "a"
]
h4cxqtbf

h4cxqtbf1#

更新到1.9.2或更高版本(我建议使用RVM)。1.8.7有一些奇怪的事情发生在字符串上...

相关问题