将内容Tarfile读入Python -“不允许向后查找”

lymnna71 于 2023-02-06 发布在 Python

关注(0)|答案(4)|浏览(138)

我是python的新手，我在将tarfile的内容读入python时遇到了麻烦。
数据是一篇期刊文章的内容（托管在pubmed central），请参见下面的信息，并链接到我想读入Python的tarfile。
http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz
我有一个类似的. tar.gz文件的列表，我最终也会想读进去。我认为（知道）所有的tarfiles都有一个. nxml文件与它们相关联。这是我实际上感兴趣的. nxml文件的内容。欢迎对最佳方式的任何建议...
如果我把tarfile保存到我的pc上，下面是我所得到的。所有运行都如预期。

tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

我今天了解到，为了直接从pubmed centrals FTP站点访问tarfile，我必须使用urllib设置一个网络请求。下面是修改后的代码（以及我收到的stackoverflow应答的链接）：
Read contents of .tar.gz file from website into a python 3.x object

tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

然而，当我运行剩下的代码时（如下），我得到了一个错误消息（"不允许向后搜索"）。

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

代码在最后一行失败，我尝试读取与我的tarfile关联的. nxml内容。下面是我收到的实际错误消息。它意味着什么？读取/访问这些嵌入在tarfile中的. nxml文件的内容的最佳解决方案是什么？

Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed

提前感谢你的帮助。克里斯

python-3.x

来源：https://stackoverflow.com/questions/18623842/read-contents-tarfile-into-python-seeking-backwards-is-not-allowed

4条答案

按热度按时间

7gcisfzg1#

**问题：**Tar文件是交错存储的。它们的顺序是header，data，header，data，header，data，data，etc。当你用getmembers()枚举文件时，你已经读了整个文件以获得头文件。然后当你要求tarfile对象读取数据时，它试图从最后一个头文件向后搜索到第一个数据。但是如果你不关闭并重新打开urllib请求，你就不能在网络流中向后搜索。
**解决方法：**您需要下载文件，将临时副本保存到磁盘或StringIO，枚举此临时副本中的文件，然后解压缩所需的文件。

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()

赞(0）回复(0）举报 2023-02-06

abithluo2#

tl;dr：删除getmembers以保留流

with tarfile.open(fileobj=response_body, mode="r|gz") as my_tar:
    for member in my_tar: # move to the next file each loop
        current_file_contents = my_tar.extractfile(member)

这是一个非常古老的问题，但答案与大多数遇到这个问题的人想要的答案***截然不同。
如果你有一个流媒体源，你几乎肯定希望在你的模式中使用管道操作符（比如r|）：
假设整个文件是一个时间轴，其中X标记您的当前位置，*标记您要读取的第一个文件

你打开tar，你就在开始：X---*------
调用getmembers，它需要读取整个tar来告诉你所有文件的位置，所以现在你的位置在文件的末尾：---*------X
您尝试使用extractfile---*X-----返回到您的文件...但不允许向后返回，因为流是单向的（一旦您移动通过文件的一个块，最后一个块将被丢弃）

相反，您可以跳过对getmembers的调用，一次只访问一个文件：

打开焦油：X---*-----
偶然发现一个文件：x1米11米1x
使用extractfile读取：---*X----
重复，直到到达---*------X的终点

代码上的差别很小，你可以这样做：

with tarfile.open(fileobj=response_body, mode="r|gz") as my_tar:
        for member in my_tar.getmembers(): #getmembers moves through the entire Tar file
            current_file_contents = log_tar.extractfile(member)

只需删除对get_members的调用：

with tarfile.open(fileobj=response_body, mode="r|gz") as my_tar:
        for member in my_tar: # move to the next file each loop
            current_file_contents = my_tar.extractfile(member)

目前所有的答案都需要下载整个文件，而您可以分块处理它，并根据所处理对象的大小保存大量资源。
除非您不知道存档中的每个文件名就无法执行任何操作，否则没有理由丢弃流响应并等待它被写入磁盘。

赞(0）回复(0）举报 2023-02-06

6ss1mwsb3#

我在尝试requests.get文件时遇到了同样的错误，所以我将所有文件解压缩到了一个tmp目录，而不是使用BytesIO或extractfile(member)：

# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:        
    tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
    with open(os.path.join(t, fn)) as payload:
        print(payload.read())

赞(0）回复(0）举报 2023-02-06

c8ib6hqw4#

一个非常简单的解决方案是更改tarfile读取文件的方式，而不是：

tfile = tarfile.open(tarfile_name)

变更为：

with tarfile.open(fileobj=f, mode='r:*') as tar:

而重要的部分是在模式中放入"："。
你可以检查this answer以及阅读更多关于它

赞(0）回复(0）举报 2023-02-06

我来回答

将内容Tarfile读入Python -“不允许向后查找”

4条答案

相关问题

热门标签

最新问答