如何在Python中替换保存关联信息的字符串

tvz2xvvm  于 12个月前  发布在  Python
关注(0)|答案(4)|浏览(98)

给定一个带有相关信息的单词的有序列表(或元组列表)。我想用其他字符串替换一些字符串,但要跟踪相关的信息。
假设我们有一个简单的例子,我们的输入数据是两个列表:

words = ["hello", "I", "am", "I", "am", "Jone", "101"]
info = ["1", "3", "23", "4", "6", "5", "12"]

输入也可以只是元组列表:

list_tuples = list(zip(words, info)))

“list_words”的每个项具有来自“list_info”的相关联的项(具有相同的索引)。例如,“hello”对应于“1”,第二个“I”对应于“4”。
我想应用一些规范化规则将它们转换为:

words = ["hello", "I'm", "I'm", "Jone", "one hundred and one"]
info = ["1", ["3", "23"], ["4", "6"], "5", "12"]

或者另一种可能的解决方案:

words = ["hello", "I'm", "I'm", "Jone", "one", "hundred", "and", "one"]
info = ["1", ["3", "23"], ["4", "6"], "5", "12", "12", "12", "12"]

请注意,这是一个简单的情况,其思想是应用多个规范化规则(数字到单词,替换,其他缩写等)。我知道如何使用正则表达式将字符串转换为另一个字符串,但在这种情况下,我会丢失相关信息:

def normalize_texts_loosing_info(text):
  # Normalization rules
  text = re.sub(r"I am", "I\'m", text)
  text = re.sub(r"101", "one hundred and one", text)
  # other normalization rules. e.g.
  # text = re.sub(r"we\'ll", "we will", text)
  # text = re.sub(r"you are", "you\'re", text)
  # ....
  return text.split()

words = ["hello", "I", "am", "I", "am", "Jone", "101"]
print(words)
print(" ".join(words))
output = normalize_texts(" ".join(words))
print(output)

问题是我如何对一个有序的字符串/单词列表应用一些转换,但保留这些单词的相关信息?
PD:谢谢你所有有用的评论

dldeef67

dldeef671#

IIUC,如果顺序并不重要,一种简单的方法是迭代字符串中的每个单词,将它们的值(* 如果有的话 *)替换并存储在一个临时字典中,该字典稍后将与原始字典的副本合并:

dict_string = {"hello": 1, "I":2, "am":4, "Jone":6} # original OP

from string import punctuation as pct
from defaultdict import list

def normalize_texts(d, o, t):
    _d = d.copy()
    tmp = defaultdict(list)
    o = o.translate(str.maketrans(pct, " " * len(pct)))

    for w in o.split():
        tmp[t].append(_d.pop(w))

    return dict(_d | tmp)

测试/输出:

lst = [("I am", "I'm"), ("hello Jone", "Hi Jack")]

for pair in lst:
    print(normalize_texts(dict_string, *pair))
    
{'hello': 1, 'Jone': 6, "I'm": [2, 4]}
{'I': 2, 'am': 4, 'Hi Jack': [1, 6]}
0vvn1miw

0vvn1miw2#

使用regex库的partial匹配功能,可以跟踪哪些模式仍然适用。

def apply_transformations(words_infos: list[tuple[str, Any]], transformations: dict[regex.Pattern, tuple[str, ...]]) \
        -> list[tuple[str, Any]]:
    # Create the list which we modify to create the result
    result = words_infos.copy()
    # Keep track of the difference between the lengths of input and output
    offset = 0

    # Apply a transformation of tokens [start, end) replacing them with the tokens in new
    # During that we collect together the infos of all input tokens into one tuple,
    # then we copy this info to all new tokens we create
    # Afterwards we update `offset`
    def apply(start: int, end: int, new: tuple[str, ...]):
        nonlocal offset
        print(start, end, new, offset)
        new_info = tuple(words_infos[i][1] for i in range(start, end))
        if len(new_info) == 1:
            new_info, = new_info
        result[start + offset: end + offset] = [(w, new_info) for w in new]
        offset -= (end - start) - len(new)

    # Keep track of partial matches that might still be applied
    partials = []
    for i, (word, info) in enumerate(words_infos):
        new_partials = []
        # Try all patterns starting at this token
        for pattern, res in transformations.items():
            if m := pattern.fullmatch(word, partial=True):
                if m.partial:  # We have a partial match, add it to the backlog
                    new_partials.append((pattern, i, word))
                else:
                    # Apply this transformation immediately, replacing only this token.
                    apply(i, i + 1, res)
                    partials = []  # After applying something, we completely aboard everything else going on
                    new_partials = []
                    break
        # Look thought the backlog, add the current token to them and check if the pattern is now fully applied.
        for pattern, first, prefix in partials:
            if m := pattern.fullmatch(prefix + " " + word, partial=True):
                if m.partial:
                    new_partials.append((pattern, first, prefix + " " + word))
                else:
                    apply(first, i + 1, transformations[pattern])
                    new_partials = []
                    break
        partials = new_partials
    return result

这个函数使用更合理的list[tuple]数据表示,而不是保持两个列表同步。转换被定义为从regex.Pattern示例(即不是正则表达式本身)到字符串序列的Map。如果你需要更复杂的转换,比如更类似于re.sub的输入,您可以通过替换transformations dict中的值并修改本地apply函数来处理它。

def main():
    words = ["hello", "I", "am", "I", "am", "Jone", "101"]
    info = ["1", "3", "23", "4", "6", "5", "12"]

    word_infos = list(zip(words, info, strict=True))

    transformations = {
        regex.compile(r"I am"): ("I\'m",),
        regex.compile(r"101"): ("one hundred and one",)
    }
    result = apply_transformations(word_infos, transformations)
    print(result)
kulphzqa

kulphzqa3#

答:这不是一个明确的答案……只是个建议
与其尝试使用join()来“规范化”创建的字符串,为什么不尝试使用字典键的list呢?这种方法需要你迭代键,你可以这样定义规则:

keys = list(dict_string.keys())

i = 0

while i < len(keys):
    # rule 1
    if keys[i]=='I' and keys[i+1]=='am':
        L = [i,i+1]
        output_string["I'm"] = L
        i += 1 # since we've already looked at i+1
    
    # rule 2
    elif number_to_words_condition:
        ...
    else:
        output_string[keys[i]] = i
    
    i += 1

当然,您可能必须使用tryexcept来避免IndexError,以防在字典的末尾遇到“I”。

brccelvz

brccelvz4#

有一个方便的num2words模块可以将数字转换为单词。利用这个和一份名单现在是秩序维护。转换回字典。

from collections import OrderedDict
from num2words import num2words

dict_string = {"hello": 1, "I":2, "am":4, "Jone":6, "101":99}

# convert to list of tuples to be able to insert at position
t_lst = list(dict_string.items())
temp = []
values = []
index = 0

list_of_joined_keys = ["I", "am"]

for t in t_lst:
    k, v = t
    if k.isnumeric():
        new_key = num2words(int(k))
        temp.append((new_key,v))
    elif k in list_of_joined_keys:
        values.append(v)
        index += 1
    else:
        temp.append(t)

joined_key = "'".join(list_of_joined_keys)
temp.insert(index - 1, (joined_key, values)) 

pprint(OrderedDict(temp))

OrderedDict([('hello', 1),
             ("I'am", [2, 4]),
             ('Jone', 6),
             ('one hundred and one', 99)])

相关问题