用Pandas读取Excel XML .xls文件

gz5pxeao  于 2023-01-21  发布在  其他
关注(0)|答案(5)|浏览(187)

我知道以前提出过很多问题,但是给出的解决方案都不能用于下面提供的可重复示例。
我试图从www.example.com读取.xls文件http://www.eia.gov/coal/data.cfm#production--特别是历史详细煤炭产量数据(1983-2013)coalpublic2012.xls文件,它可以通过下拉菜单免费获得。Pandas无法读取它。
相比之下,最近一年可用的文件(2013)coalpublic2013.xls文件工作正常:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

但是接下来十年的X1 M3 N1 X文件(2004-2012)没有加载。我用Excel看过这些文件,它们打开了,没有损坏。
我从Pandas那里得到的错误是:

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

我也尝试过其他各种方法:

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

没有用。我的Pandas版本:0.17.0
我也把这个作为bug提交到panda github issues列表中。

cgvd09ve

cgvd09ve1#

您可以通过编程方式转换此Excel XML文件。只有Python和Pandas。

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
mnemlml8

mnemlml82#

问题是,虽然2013年的数据是一个真正的Excel文件,但2012年的数据是一个XML文档,Python似乎不支持这种文档。我认为最好的办法是在Excel中打开它,然后将副本保存为适当的Excel文件或CSV。

lyfkaqu1

lyfkaqu13#

您可以通过编程方式转换此Excel XML文件。安装了Windows和Office。
1.在记事本中创建ExcelToCsv. vbs脚本:

if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit

1.将Excel XML文件转换为CSV:
$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1
1.打开带有Pandas的CSV文件
>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)
参考:Faster way to read Excel files to pandas dataframe

6rvt4ljy

6rvt4ljy4#

下面是我对@jrovegno方法的更新(从“Python Cookbook 2nd Edition”复制而来),因为该代码在标题行中添加了空格,不够通用:

import pandas as pd
import xml.sax

def xml_to_dfs(path):
    """Read Excel XML file at path and return list of DataFrames"""

    class ExcelXMLHandler(xml.sax.handler.ContentHandler):
        def __init__(self):
            self.tables = []
            self.chars = []

        def characters(self, content):
            self.chars.append(content)

        def startElement(self, name, attrs):
            if name == "Table":
                self.rows = []
            elif name == "Row":
                self.cells = []
            elif name == "Data":
                self.chars = []

        def endElement(self, name):
            if name == "Table":
                self.tables.append(self.rows)
            elif name == "Row":
                self.rows.append(self.cells)
            elif name == "Data":
                self.cells.append("".join(self.chars))

    exh = ExcelXMLHandler()
    xml.sax.parse(path, exh)
    return [pd.DataFrame(table[1:], columns=table[0]) for table in exh.tables]

基本上,我的XML看起来是这样结构的:

<Worksheet>
    <Table>
        <Row>
            <Cell>
                <Data>  # appears redundant with <Cell>
bttbmeg0

bttbmeg05#

@JBWhitmore我运行了以下代码:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

这将成功读取文件而不会出现任何错误。但是,它会以所提到的精确格式提供所有数据。因此,您可能需要做额外的工作,以便在成功阅读数据后处理数据。

相关问题