尝试使用ijson解析大型JSON文件时出现异常

2lpgd968  于 2023-01-14  发布在  其他
关注(0)|答案(1)|浏览(224)

我尝试使用ijson解析一个大的JSON文件(16GB),但总是得到以下错误:

Exception has occurred: IncompleteJSONError
lexical error: invalid char in json text.
          venue" : {          "type" : NumberInt(0)      },       "yea
                     (right here) ------^
  File "C:\pyth\dblp_parser.py", line 14, in <module>
    for record in ijson.items(f, 'item', use_float=True):

我的代码如下:

with open("dblpv13.json", "rb") as f:
    for record in ijson.items(f, 'records.item', use_float=True):
        paper_id = record["_id"] #_id is only for test
        paper_id_tab.append(paper_id)

我的json文件的一部分如下所示:

{
    "_id" : "53e99784b7602d9701f3f636",
    "title" : "Flatlined",
    "authors" : [
        {
            "_id" : "53f58b15dabfaece00f8046d",
            "name" : "Peter J. Denning",
            "org" : "ACM Education Board",
            "gid" : "5b86c72de1cd8e14a3c2b772",
            "oid" : "544bd99545ce266baef0668a",
            "orgid" : "5f71b2811c455f439fe3c58a"
        }
    ],
    "venue" : {
        "_id" : "555036f57cea80f954169e28",
        "raw" : "Commun. ACM",
        "raw_zh" : null,
        "publisher" : null,
        "type" : NumberInt(0)
    },
    "year" : NumberInt(2002),
    "keywords" : [
        "linear scale",
        "false dichotomy"
    ],
    "n_citation" : NumberInt(7),
    "page_start" : "15",
    "page_end" : "19",
    "lang" : "en",
    "volume" : "45",
    "issue" : "6",
    "issn" : "",
    "isbn" : "",
    "doi" : "10.1145/508448.508463",
    "pdf" : "",
    "url" : [
        "http://doi.acm.org/10.1145/508448.508463"
    ],
    "abstract" : "Our propensity to create linear scales between opposing alternatives creates false dichotomies that hamper our thinking and limit our action."
},

我试图逐项填写records,但总是出现同样的错误。我完全被阻塞了。请问,有人能帮助我吗?

lb3vh1jj

lb3vh1jj1#

同样的问题也发生在我的数据集上,ijson无法处理,我通过创建另一个数据集,然后用ijson解析新的数据集来克服这个问题,方法很简单:以简单读取方式读取原始数据集;删除“NumberInt(“和“)",将结果写入新的json文件。代码如下所示。

f=open('dblpv13_clean.json')
with open('dblpv13.json','r',errors='ignore') as myFile:
  for line in myFile:
    line=line.replace("NumberInt(","").replace(")","")
    f.write(line)
f.close()

现在可以用ijson解析新数据集,如下所示。

with open('dblpv13_clean.json', "r",errors='ignore') as f:
  for i, element in enumerate(ijson.items(f, "item")):
     do something....

相关问题