我试图在python中打开一个100 GB的文件,这是堆栈溢出归档(https://archive.org/download/stackexchange zip文件名stackoverflow.com-Posts.7z内容Posts.xml)2008年至今的帖子(2023).问题是这个文件太大了,电脑处理它有很多困难,可能需要几个小时或几天来检查它.我必须把xml文件在csv,但我只对2022年的帖子感兴趣,这可能在100 GB的总容量中不到5GB。我试图“剪切”文件的70%,即在70%的内容后开始阅读,因为最新的数据在文件的底部,但我不能。我该如何做到这一点?我当然知道第一个70 GB是不需要的,因为这是2019年的大小。ChatGPT未能帮助我。
所以我一开始就是这样分析的
'''
This script takes a raw xml file as input and outputs
an almost clean csv file
'''
import re, csv
from bs4 import BeautifulSoup
# file paths
IN = 'C:\\Users\\.....\\Desktop\\StackOverflow archive\\Posts.xml' #raw xml file
OUT = 'C:\\Users\\.....\\Desktop\\StackOverflow archive\\almostCleanPosts.csv' #almost clean csv file
# regex patterns
TAG_PATTERN = re.compile('<|>')
NEW_LINE = re.compile('\n')
BODY_1 = re.compile('\n?<pre(.*?)><code>(.*?)<\/code><\/pre>\n?|<div class=".*<div class=".*>', flags=re.S)
BODY_2 = re.compile('\s?<code>\w*<\/code>.\s?|\s?<a href=.*\">|<\/a>')
BODY_3 = re.compile('\s?<\w*>|<\/\w*>(.|\s)?')
BODY_4 = re.compile(' |http(s?)://\S*|<img src.*">|<br/>*|br>|<br />*|<ol.*">|/p>|/strong>|/li>|<key>|</key>|/code>|/a>|#?&\w\w\w?;?')
# method to clean instances using regex
def cleanup(dirty):
clean = re.sub(NEW_LINE,' ', dirty)
clean = re.sub(BODY_1,' ', clean)
clean = re.sub(BODY_2,' ', clean)
clean = re.sub(BODY_3,' ', clean)
clean = re.sub(BODY_4,' ', clean)
return clean
open(OUT, 'w').close() # erase file berfore writing
with open(IN, 'r', encoding='utf-8') as inFile, open(OUT, 'a') as outFile:
outFileWriter = csv.writer(outFile, delimiter=';')
outFileWriter.writerow(['id', 'post', 'title', 'tags']) # write header
for post in inFile:
soup = BeautifulSoup(post, "lxml") # get xml tree
line = soup.find("row")
newRow = [] # new list to append as a row to the output file
try:
#date = line.attrs['creationdate'] #decomment to get 2022
#if date.startswith("2022"):
typeId = line.attrs['posttypeid']
newRow.append(line.attrs['id']) # first field
if typeId == '1': #is a question
newRow.append(cleanup(line.attrs['body'])) # second field
newRow.append(line.attrs['title']) # third field
newRow.append(re.sub(TAG_PATTERN,' ', line.attrs['tags'])) # fourth field
elif typeId == '2': #is an answer, title and tags attributes are missing
newRow.append(cleanup(line.attrs['body'])) # second field
# leave title and tag fields empty
newRow.append('') # third field
newRow.append('') # fourth field
outFileWriter.writerow(newRow) # append cleaned row tothe file
except:
pass
我希望从2022年开始以csv格式获得所有帖子,请帮助我。
1条答案
按热度按时间iezvtpos1#
下面是使用PowerShell的解决方案。