csv Python,大文件

6qqygrtg  于 2023-04-03  发布在  Python
关注(0)|答案(1)|浏览(130)

我试图在python中打开一个100 GB的文件,这是堆栈溢出归档(https://archive.org/download/stackexchange zip文件名stackoverflow.com-Posts.7z内容Posts.xml)2008年至今的帖子(2023).问题是这个文件太大了,电脑处理它有很多困难,可能需要几个小时或几天来检查它.我必须把xml文件在csv,但我只对2022年的帖子感兴趣,这可能在100 GB的总容量中不到5GB。我试图“剪切”文件的70%,即在70%的内容后开始阅读,因为最新的数据在文件的底部,但我不能。我该如何做到这一点?我当然知道第一个70 GB是不需要的,因为这是2019年的大小。ChatGPT未能帮助我。
所以我一开始就是这样分析的

'''
    This script takes a raw xml file as input and outputs
    an almost clean csv file
'''

import re, csv
from bs4 import BeautifulSoup

# file paths
IN = 'C:\\Users\\.....\\Desktop\\StackOverflow archive\\Posts.xml' #raw xml file
OUT = 'C:\\Users\\.....\\Desktop\\StackOverflow archive\\almostCleanPosts.csv' #almost clean csv file

# regex patterns
TAG_PATTERN = re.compile('<|>')
NEW_LINE = re.compile('\n')
BODY_1 = re.compile('\n?<pre(.*?)><code>(.*?)<\/code><\/pre>\n?|<div class=".*<div class=".*>', flags=re.S)
BODY_2 = re.compile('\s?<code>\w*<\/code>.\s?|\s?<a href=.*\">|<\/a>')
BODY_3 = re.compile('\s?<\w*>|<\/\w*>(.|\s)?')
BODY_4 = re.compile('&nbsp;|http(s?)://\S*|<img src.*">|<br/>*|br>|<br />*|<ol.*">|/p>|/strong>|/li>|&lt;key>|&lt;/key>|/code>|/a>|#?&\w\w\w?;?')

# method to clean instances using regex
def cleanup(dirty):
    clean = re.sub(NEW_LINE,' ', dirty)    
    clean = re.sub(BODY_1,' ', clean)
    clean = re.sub(BODY_2,' ', clean)
    clean = re.sub(BODY_3,' ', clean)
    clean = re.sub(BODY_4,' ', clean)
    return clean

open(OUT, 'w').close() # erase file berfore writing
with open(IN, 'r', encoding='utf-8') as inFile, open(OUT, 'a') as outFile: 
    outFileWriter = csv.writer(outFile, delimiter=';')
    outFileWriter.writerow(['id', 'post', 'title', 'tags']) # write header
    
    for post in inFile:
        soup = BeautifulSoup(post, "lxml") # get xml tree
        line = soup.find("row")
        newRow = [] # new list to append as a row to the output file
        try:
            #date = line.attrs['creationdate'] #decomment to get 2022
            #if date.startswith("2022"):
                typeId = line.attrs['posttypeid']
                newRow.append(line.attrs['id']) # first field 

                if typeId == '1': #is a question
                    newRow.append(cleanup(line.attrs['body'])) # second field
                    newRow.append(line.attrs['title']) # third field
                    newRow.append(re.sub(TAG_PATTERN,' ', line.attrs['tags'])) # fourth field

                elif typeId == '2': #is an answer, title and tags attributes are missing      
                    newRow.append(cleanup(line.attrs['body'])) # second field 
                    # leave title and tag fields empty
                    newRow.append('') # third field
                    newRow.append('') # fourth field
                
                outFileWriter.writerow(newRow) # append cleaned row tothe file
            
        except:
            pass

我希望从2022年开始以csv格式获得所有帖子,请帮助我。

iezvtpos

iezvtpos1#

下面是使用PowerShell的解决方案。

using assembly System.Xml
using assembly System.Xml.Linq

$FILENAME = "c:\temp\posts.xml"

$reader = [System.Xml.XmlReader]::Create($FILENAME)
$lines = 0
write-Host "START"
while($reader.EOF -eq $False)
{
   $lines++
   if($lines -eq 100) { break}
   if ($reader.Name -ne "row")
   {
      $reader.ReadToFollowing("row") | out-null
   }
   if ($reader.EOF -eq $False)
   {
       $element = [System.Xml.Linq.XElement]::ReadFrom($reader)
       Write-Host "ID = " $element.Attribute("Id").Value
       Write-Host "PostTypeId = " $element.Attribute("PostTypeId").Value
       Write-Host "ParentId = " $element.Attribute("ParentId").Value
       Write-Host "CreationDate = " $element.Attribute("CreationDate").Value
       Write-Host "Score = " $element.Attribute("Score").Value
       Write-Host "Body = " $element.Attribute("Body").Value
       Write-Host "OwnerUserId = " $element.Attribute("OwnerUserId").Value
       Write-Host "LastActivityDate = " $element.Attribute("LastActivityDate").Value
       Write-Host "CommentCount = " $element.Attribute("CommentCount").Value
       Write-Host "ContentLicense = " $element.Attribute("ContentLicense").Value
   }
}

相关问题