csv 我生成的XML文件在单词之间包含不需要的空格

cetgtptt  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(80)

我更新了我的问题:我有一个XML文件,其中包含几个标题。但是,这些标题是法语,而不是英语,像这样:

<entity name="MissionTemplate">
  <string name="code" required="true" title="Bond Restitution Date"/>
  <string name="description" namecolumn="true" title="Description"/>

我想用英语翻译每个标题。为了使它,我有一个翻译的CSV文件,我必须使用它来翻译XML文件。这个CSV文件不在最终的应用程序中,只是在我的计算机本地,它只是用来获得正确的翻译。下面是一个示例:

Table;Champ;Anglais;Français

"HeaderTable8011405";"Code";"Code";"Code"

"HeaderTable8011405";"Activity_Type";"Auction,Mission";"Vente,Mission"

"HeaderTable8011405";"Activity_Type";"Activity Type";"Type activité"
...

我还有另一个CSV翻译的文件,其中包含英语单词和法语单词之间的对应关系。它在最终应用程序中。因此,当用户更改语言时,此文件用于翻译应用程序。下面是该文件的示例:

"key","message","comment","context"
"Code","Code",,
"Auction,Mission","Vente,Mission",, 
"Activity Type","Type activité",,
...

我必须在XML文件中获得法语标题,并在本地CSV文件(第一个CSV翻译的文件)中搜索它。如果有法语标题,我就复制英语翻译,并替换XML文件中的法语翻译。最后,我必须在第二个CSV翻译文件(用于在英语和法语之间切换的文件)中添加一个新行,第一列是英语单词,第二列是法语单词。总结一下,有3个文件:一个XML文件(包含几个标题),一个CSV本地文件,其中包含良好的翻译(不在我的应用程序中,只是用来获得正确的英语翻译),另一个CSV文件,其中包含英语和法语单词之间的对应关系(它在应用程序中使用两种语言之间的切换:英文和法文)。
我可以手工制作这个以前的任务,因为它是如此繁琐和耗时.
所以,我试着创建一个Python程序来代替我完成这个任务。代码如下:

from bs4 import BeautifulSoup
import pandas as pd
import csv

xml_file = input("Give here the name of the XML file which you want translate : ")
translations_file = input("Give here the name of the translation's file : ")
final_translations_file = input("Give here the name of the final translation's file : ")

# Reading the data inside the xml
# file to a variable under the name
# data
with open('MissionTemplate_model.xml', 'r') as f:
    data = f.read()

# Passing the stored data inside
# the beautifulsoup parser, storing
# the returned object
data = BeautifulSoup(data, "xml")

# Title list
title_list = []

# Get each title in the XML file
for element in data.find_all():
    if 'title' in element.attrs:
        title_content = element['title']
        title_list.append(title_content)

# Check if the title is translated
missing_translations = []
translation_list= pd.read_csv(translations_file,delimiter=';')
translation_fr_list = translation_list.Français
translation_en_list = translation_list.Anglais
header = ['key', 'message', 'comment', 'context']
additional_data = []
for each_title in title_list:
    found = False
    lineNb = 0
    for each_translation in translation_fr_list:
        # It's all right
        if each_title == each_translation: 
            found = True
            translation_en = translation_en_list[lineNb]
            translation_fr = each_translation
            for element in data.find_all():
                if 'title' in element.attrs and element['title'] == translation_fr:
                    # Set the title in the XML file
                    element['title'] = translation_en
                    # Add couple of translation in the additional data's list
                    additional_data.append([translation_en,translation_fr,None,None])
            continue
        lineNb = lineNb + 1
    # Else
    if found == False:
        missing_translations.append(each_title)
# Load the CSV file using pandas
df = pd.read_csv(final_translations_file)
# Create a DataFrame from additional_data
additional_df = pd.DataFrame(additional_data, columns=header)
# Add the news datas at the end of the existing DataFrame
updated_df = pd.concat([df, additional_df], ignore_index=True)
# Save the updated DataFrame in the CSV file 
updated_df = updated_df.applymap(lambda x: ' '.join(x.strip().split()) if isinstance(x, str) else x)
updated_df.to_csv(final_translations_file, index=False, quoting=csv.QUOTE_NONE, escapechar=' ')




# Load the CSV file using pandas
df = pd.read_csv(final_translations_file, sep=',', skipinitialspace=True)

# Function to check if a string begins and ends by double quotes
def has_quotes(s):
    return s.startswith('"') and s.endswith('"')

# Explore the lines of the DataFrame
for index, row in df.iterrows():
    # Check if the two first columns doesn't already have double quotes
    if not has_quotes(row[0]) and not has_quotes(row[1]):
        # Modify this line
        df.iloc[index, :2] = '"' + df.iloc[index, :2] + '"'

# Save the modified DataFrame as new csv file 
df.to_csv(final_translations_file, index=False, quoting=csv.QUOTE_NONE, escapechar=' ')

# Open again the file in read/write mode
with open(final_translations_file, "r+") as f:
    # Read the current content of the file

    content = f.read()

    # Find the first line's position (up to the first occurence of \n)
    first_line_end = content.find("\n") + 1

    # Extract the first line and check if it contains double quotes
    first_line = content[:first_line_end]
    if not has_quotes(first_line):
        modified_first_line = '"' + '","'.join(first_line.strip().split(",")) + '"\n'

        # Replace the first line by the modified 
        f.seek(0)
        f.write(modified_first_line)
        f.write(content[first_line_end:])




# Replace the content of the XML original XML file by the modified content
modified_xml_file = data.prettify()
with open(xml_file, 'w') as f:
    f.write(modified_xml_file)

# Write the missing translations in a file
with open('missing_translations.txt','w') as f:
    missing_translations.insert(0,xml_file+":\n") # Add the name of the file at the begining of the missing translations's list
    missing_translations_string = '\n'.join(missing_translations) # Tranform the list to string
    f.write(missing_translations_string)

没关系,但在第二个CSV文件(应用程序内部使用的CSV文件)中的每个单词之间存在不需要的空格,例如:

"key","message","comment","context"
"Code","Code",,
"Auction,    Mission","Vente,    Mission",, 
"Activity    Type","Type    activité",,

我尝试了很多解决方案,在使用split()函数创建文件后删除每个空格,但它不起作用...
你能帮帮我吗?
谢谢你,谢谢!

fhity93d

fhity93d1#

我不能读完所有的代码,看看哪里可以插入或不可以插入额外的空格。
相反,我建议你重新考虑你的程序的结构。除了不使用Pandas和使逻辑/循环更简单之外,我建议将程序分成不同的部分:
1.创建法语→英语查找字典
1.迭代XML并以翻译和缺失结束
1.写出已翻译和缺失的文件
1.修改XML,然后将其写出
可能最大的变化将是使用一个字典来存储你已经拥有的翻译。
从这个translations.csv文件开始:

| Table              | Champ | Anglais       | Français      |
|--------------------|-------|---------------|---------------|
| HeaderTable8011405 | Code  | Code          | Code          |
| HeaderTable8011405 | ???   | Designation   | Désignation   |
| HeaderTable8011405 | ???   | River         | Rivière       |
| HeaderTable8011405 | ???   | Activity Type | Type Activité |

创建一个字典,其中每个键是法语单词,其值是英语单词:

import csv

translator: dict[str, str] = {}
with open("translations.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f, delimiter=",")
    next(reader)  # discard header
    for row in reader:
        translator[row[3]] = row[2]

print(translator)
{
    'Code':          'Code', 
    'Désignation':   'Designation', 
    'Rivière':       'River', 
    'Type Activité': 'Activity Type',
}

现在,您可以对XML的元素循环一次,并根据翻译器指令检查每个标题。我只使用BeautifulSoup来处理破碎/混乱的HTML;对于有效的XML,我喜欢标准库的ElementTree模块。我还添加了自己的逻辑,如果元素没有标题(跳过并移动到next),或者如果翻译不存在(追加到missing,然后移动到next)。
从这个input.xml文件开始:

<root>
    <string name="foobar" namecolumn="baz" /> <!-- no title -->
    <string name="description" namecolumn="true" title="Désignation" />
    <string name="description" namecolumn="true" title="Rivière" />
    <string name="description" namecolumn="true" title="Ordinateur" />
    <string name="description" namecolumn="true" title="Type Activité" />
    <string name="description" namecolumn="true" title="Vélo" />
</root>
from xml.etree import ElementTree as ET

xml_input = "input.xml"
tree = ET.parse(xml_input)
root = tree.getroot()

# A list of French titles with no English translation
missing: list[str] = []

# A list of French titles and their English translations
translated: list[tuple[str, str]] = []

for elem in root.findall("string"):
    french = elem.get("title")
    if french is None:
        continue

    english = translator.get(french)
    if english is None:
        missing.append(french)
        continue

    translated.append((french, english))

看不见的和翻译:

print(missing)
[
    'Ordinateur', 
    'Vélo'
]
print(translated)
[
    ('Désignation',   'Designation'),
    ('Rivière',       'River'),
    ('Type Activité', 'Activity Type')
]

将它们写入自己的文件看起来很简单:

| key           | message       | comment | context |
|---------------|---------------|---------|---------|
| Designation   | Désignation   |         |         |
| River         | Rivière       |         |         |
| Activity Type | Type Activité |         |         |
with open("output-missing.txt", "w", encoding="utf-8") as f:
    f.write(f"{xml_input}:\n")
    for x in missing:
        f.write(x + "\n")
input.xml:
Ordinateur
Vélo

要更改XML并保存它,我开始使用的ElememtTree类使此操作变得简单。我将复制粘贴前面的XML循环代码,删除缺失的和翻译的列表,并添加关键的elem.set("title", english)调用,用英语翻译替换title属性:

for elem in root.findall("string"):
    french = elem.get("title")
    if french is None:
        continue

    english = translator.get(french)
    if english is None:
        continue

    elem.set("title", english)

ET.indent(tree, space="  ")
tree.write("output.xml", encoding="utf-8")

这将生成最终的output.xml文件:

<root>
  <string name="foobar" namecolumn="baz" />
  <string name="description" namecolumn="true" title="Designation" />
  <string name="description" namecolumn="true" title="River" />
  <string name="description" namecolumn="true" title="Ordinateur" />
  <string name="description" namecolumn="true" title="Activity Type" />
  <string name="description" namecolumn="true" title="Vélo" />
</root>

如果你想删除没有标题或无法翻译的元素,可以使用root.remove(elem)从其父元素(根元素)中删除该元素:

for elem in root.findall("string"):
    french = elem.get("title")
    if french is None:
        root.remove(elem)
        continue

    english = translator.get(french)
    if english is None:
        root.remove(elem)
        continue

    elem.set("title", english)

ET.indent(tree, space="  ")
tree.write("output.xml", encoding="utf-8")
<root>
  <string name="description" namecolumn="true" title="Designation" />
  <string name="description" namecolumn="true" title="River" />
  <string name="description" namecolumn="true" title="Activity Type" />
</root>

相关问题