django 如何在Python中获取Word文档字数？

yv5phkfx 于 2023-06-25 发布在 Go

关注(0)|答案(3)|浏览(214)

我试图得到的字数. doc. docx. odt和. pdf类型的文件。这对于. txt文件来说非常简单，但是我如何对提到的类型进行字数统计呢？
我在Ubuntu上使用python django，当用户通过系统上传文件时，我试图对文档进行字数统计。

django

来源：https://stackoverflow.com/questions/7529287/how-to-get-a-word-count-on-word-document-in-python

3条答案

按热度按时间

1tu0hz3e1#

首先，您需要读取.doc .docx .odt和.pdf。
第二，计算单词（<2.7 version）。

赞(0）回复(0）举报 2023-06-25

2nbm6dog2#

这些答案错过了MS Word &. odt的一个技巧。
每当保存.docx文件时，MS Word都会记录该文件的字数。.docx文件只是一个zip文件。访问其中的“Words”（=字数）属性很简单，并且可以使用标准库中的模块来完成：

import zipfile
import xml.etree.ElementTree as ET

total_word_count = 0
for docx_file_path in docx_file_paths:
    zin = zipfile.ZipFile(docx_file_path)
    for item in zin.infolist():
        if item.filename == 'docProps/app.xml':
            buffer = zin.read(item.filename)
            root = ET.fromstring(buffer.decode('utf-8'))
            for child in root:
                if child.tag.endswith('Words'):
                    print(f'{docx_file_path} word count {child.text}')
                    total_word_count += int(child.text)
                    
print(f'total word count all files {total_word_count}')

利与弊：主要的优点是，对于大多数文件，这将是远快于任何其他东西。
主要的缺点是你被MS Word的计数方法的各种特性所困：我对细节不是特别感兴趣，但我知道这些已经在版本中发生了变化（eidogg。可以包括或不包括文本框中的词）。但是，如果您选择分离并解析.docx文件的整个文本内容，则会出现同样的复杂情况。各种可用模块，例如python-docx，看起来做得很好，但根据我的经验，没有一个是完美的。
如果您实际上自己提取并解析.docx文件中的content.xml文件，您开始意识到其中涉及到一些令人生畏的复杂性。

.odt文件

同样，这些都是zip文件，并且在meta.xml中也可以找到类似的属性。我刚刚创建并解压了一个这样的文件，其中的meta.xml看起来像这样：

<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.3">
    <office:meta>
        <meta:creation-date>2023-06-11T18:25:09.898000000</meta:creation-date>
        <dc:date>2023-06-11T18:25:21.656000000</dc:date>
        <meta:editing-duration>PT11S</meta:editing-duration>
        <meta:editing-cycles>1</meta:editing-cycles>
        <meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="2" meta:character-count="12" meta:non-whitespace-character-count="11"/>
        <meta:generator>LibreOffice/7.4.6.2$Windows_X86_64 LibreOffice_project/5b1f5509c2decdade7fda905e3e1429a67acd63d</meta:generator>
    </office:meta>
</office:document-meta>

因此，您需要查看root['office:meta']['meta:document-statistic']，属性meta:word-count。
我不知道PDF：它们很可能需要蛮力计数。Pypdf2看起来是这样的：最简单的方法是转换为txt并以此方式计数。我不知道会错过什么。
例如，扫描的PDF可能有数百页长，但据说包含“0个单词”。或者实际上可能存在散布有真实文本内容的扫描文本……

赞(0）回复(0）举报 2023-06-25

tnkciper3#

既然你可以对.txt文件这样做，我假设你知道如何计算单词，你只需要知道如何读取各种文件类型。看看这些库：
PDF：pypdf
doc/docx：this question，python-docx
odt：examples here

赞(0）回复(0）举报 2023-06-25