Python中的TMX(翻译记忆库交换)文件

xwbd5t1u 于 2023-05-19 发布在 Python

关注(0)|答案(3)|浏览(133)

bounty还有6小时到期。回答此问题可获得+100声望奖励。alvas正在寻找一个答案从一个有信誉的来源。

Python中是否有处理TMX（翻译记忆交换）文件的模块，如果没有，还有其他方法吗？
就目前而言，我有一个巨大的2gb文件，带有法语和英语字幕。有可能处理这样的文件吗？或者我必须分解它吗？

python-3.x

来源：https://stackoverflow.com/questions/20356149/tmxtranslation-memory-exchange-files-in-python

3条答案

按热度按时间

vojdkbi01#

正如@hurrial所说，你可以使用translate-toolkit。

安装

这个工具包只能通过pip使用。要安装它，请运行：

pip install translate-toolkit

用法

假设您有以下简单的sample.tmx文件：

<tmx version="1.4">
  <header
    creationtool="XYZTool" creationtoolversion="1.01-023"
    datatype="PlainText" segtype="sentence"
    adminlang="en-us" srclang="en"
    o-tmf="ABCTransMem"/>
  <body>
    <tu>
      <tuv xml:lang="en">
        <seg>Hello world!</seg>
      </tuv>
      <tuv xml:lang="ar">
        <seg>اهلا بالعالم!</seg>
      </tuv>
    </tu>
  </body>
</tmx>

你可以这样解析这个简单的文件：

>>> from translate.storage.tmx import tmxfile
>>>
>>> with open("sample.tmx", 'rb') as fin:
...     tmx_file = tmxfile(fin, 'en', 'ar')
>>>
>>> for node in tmx_file.unit_iter():
...     print(node.source, node.target)
Hello world! اهلا بالعالم!

有关更多信息，请从这里查看官方文档。

赞(0）回复(0）举报 2023-05-19

kgsdhlau2#

您可以查看以下链接：

预翻译：http://translate-toolkit.readthedocs.org/en/latest/commands/pretranslate.html
翻译工具包：http://en.wikipedia.org/wiki/Translate_Toolkit
翻译工具包：https://pypi.python.org/pypi/translate-toolkit
翻译API：https://github.com/translate/translate

干杯

赞(0）回复(0）举报 2023-05-19

hjzp0vay3#

下面是一个可以轻松将TMX转换为pandas dataframe的脚本：

from collections import namedtuple
import pandas as pd
from tqdm import tqdm
from bs4 import BeautifulSoup

def tmx2df(tmxfile):
    # Pick your poison for parsing XML.
    with open(tmxfile) as fin:
        content = fin.read()
        bsoup = BeautifulSoup(content, 'lxml')    # Actual TMX extraction.
    lol = [] # Keep a list of the rows to populate.
    for tu in tqdm(bsoup.find_all('tu')):
        # Parse metadata from tu
        metadata = tu.attrs
        # Parse prop
        properties = {prop.attrs['type']:prop.text for prop in tu.find_all('prop')}
        # Parse seg
        segments = {}
        # The order of the langauges might not be consistent, 
        # so keep them in some dict and unstructured first.
        for tuv in tu.find_all('tuv'):
            segment = ' '.join([seg.text for seg in tuv.find_all('seg')])
            segments[tuv.attrs['xml:lang']] = segment
        lol.append({'metadata':metadata, 'properties':properties, 'segments':segments})    # Put the list of rows into a dataframe.
    df = pd.DataFrame(lol)    # See https://stackoverflow.com/a/38231651
    return pd.concat([df.drop(['segments'], axis=1), df['segments'].apply(pd.Series)], axis=1)

赞(0）回复(0）举报 2023-05-19

我来回答

Python中的TMX(翻译记忆库交换)文件

3条答案

安装

用法

相关问题

热门标签

最新问答