csv 将具有不同分隔符的mixpanel Json数据更改为dataframe

i7uq4tfw  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(90)

我在处理从Mixpanel获取的有关“ProductDetails”事件的数据时遇到了问题。数据结构如下所示:

[{'event': 'ProductDetails',
  'properties': {'time': 1682992328,
   'distinct_id': '7328f204-92b8-cef5eacc8f6f',
   '$app_build_number': '57',
   '$app_version_string': '1.11.1',
    '$screen_width': 1080,
   'eventName': 'ProductDetails',
   'mp_country_code': 'US',
   'mp_processing_time_ms': 1683017566066,
   'timestamp': '1683017528230',
   '{ProductName=FL 100 TABS., ProductId=9a6cfdde-e6bf34b669ec, source=list}': ''}},
 {'event': 'ProductDetails',
  'properties': {'time': 1683723803,
   'distinct_id': 'bcbf4f42--ee1567f4a14e',
   '$app_build_number': '57',
   '$app_release': '57',
   '$screen_height': 2316,
   'eventName': 'ProductDetails',
   'mp_country_code': 'US',
   'mp_processing_time_ms': 1683749007744,
   'timestamp': '1683749003851',
   '{ProductName=ADVANCE 24 TABLETS, ProductId=a6b60bcd-73ed2321b8af, source=list}': ''}},
{'event': 'ProductDetails',
  'properties': {'time': 1683814991,
   'distinct_id': '66bfd1d5--9fb70ccc153a',
   '$app_build_number': '57',
   '$app_release': '57',
      '$screen_dpi': 480,
   '$screen_height': 2158,
   '$screen_width': 1080,
   '$wifi': True,
   'eventName': 'ProductDetails',
   'mp_country_code': 'CA',
   'mp_lib': 'android',
   'mp_processing_time_ms': 1683840191858,
   'timestamp': '1683843722541',
   '{ProductName= Tav 30 CAP., ProductId=a830804e-80e8a03d3dae, source=list}': ''}}]

字符串
我尝试使用提供的代码将此数据转换为CSV文件,但标题不正确,并且丢失了一些数据。我怀疑问题在于每个对象的最后一部分:
'{ProductName= Tav 30 CAP.,ProductId=a830804e-80e8a03d3dae,source=list}':''
下面是尝试将数据转换为CSV的修改后的代码:

import csv
# Flatten the data
flattened_data = []
for item in data:
    properties = item['properties']
    flattened_item = {}
    for key, value in properties.items():
        if isinstance(value, dict):
            flattened_item.update(value)
        else:
            flattened_item[key] = value
    flattened_data.append(flattened_item)

# Extract column headers
column_names = set()
for item in flattened_data:
    column_names.update(item.keys())
column_names.discard('')

# Write data to CSV
with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=column_names)
    writer.writeheader()
    writer.writerows(flattened_data)


我需要协助正确转换数据到一个准确的标题和完整的数据CSV文件。

dw1jzc5e

dw1jzc5e1#

虽然从技术上讲是嵌套的,但看起来你只需要处理一个一层深的子对象。对于这样简单的数据(看起来),我推荐一种非常直接和明确的方法。
对于像这样的对象:

[
    {
        "event": "Foo",
        "properties": {
            "time": 1,
            "distinct_id": "a",
            "$screen_width": 1000,
            "{ProductName=X, ProductId=xx, source=list}": ""
        }
    },
    {
        "event": "Foo",
        "properties": {
            "time": 2,
            "distinct_id": "b",
            "$screen_height": 2000,
            "{ProductName=Y, ProductId=yy, source=list}": ""
        }
    },
    {
        "event": "Bar",
        "properties": {
            "time": 3,
            "distinct_id": "c",
            "$screen_dpi": 100,
            "{ProductName= Z, ProductId=zz, source=list}": ""
        }
    }
]

字符串
将properties对象中的键值对“提升”到主对象:

with open('input.json', encoding='utf-8') as f:
    data = json.load(f)

# "Lift" properties up to level of event
for x in data:
    x.update(**x["properties"])
    del x["properties"]


x.update(**x["properties"])表示“获取属性下的所有键值对,并将它们添加到主(事件)字典中”。
一个事件,如:

{
    "event": "Foo",
    "properties": {
        "time": 1,
        "distinct_id": "a",
        "$screen_width": 1000,
        "{ProductName=X, ProductId=xx, source=list}": ""
    }
}


变成:

{
    "event": "Foo",
    "properties": {
        "time": 1,
        "distinct_id": "a",
        "$screen_width": 1000,
        "{ProductName=X, ProductId=xx, source=list}": ""
    }
    "time": 1,
    "distinct_id": "a",
    "$screen_width": 1000,
    "{ProductName=X, ProductId=xx, source=list}": ""
}


然后删除原始属性键值对(del x["properties"]),我们有:

{
    "event": "Foo",
    "time": 1,
    "distinct_id": "a",
    "$screen_width": 1000,
    "{ProductName=X, ProductId=xx, source=list}": ""
}


然后继续处理奇怪的"{ProductName=X, ProductId=xx, ...}": ""键值对。我做了一些检查,以确保字符串基本上符合解析器的预期:

# Parse "{ProductName=X, ProductId=xx, ...}" key into constituent key-value pairs
for i, x in enumerate(data, start=1):
    for key in list(x.keys()):
        if "ProductName" not in key:
            continue

        key_ = key  # copy to preserve original key for deleting at end

        if key_[0] != "{" or key_[-1] != "}":
            print(f'error: event {i}, expected "{key_}" to be bracketed with {{...}}')
            continue

        key_ = key_[1:-1]

        if key_.count("=") - key_.count(",") != 1:
            print(f'error: event {i}, expected "{key_}" to contain one more equal sign than comma')
            continue

        for kv_pair in key_.split(","):
            k, v = kv_pair.split("=")
            x[k.strip()] = v.strip()

        del x[key]


所有这些都会产生dicts列表:

[
    {'event': 'Foo', 'time': 1, 'distinct_id': 'a', '$screen_width': 1000, 'ProductName': 'X', 'ProductId': 'xx', 'source': 'list'},
    {'event': 'Foo', 'time': 2, 'distinct_id': 'b', '$screen_height': 2000, 'ProductName': 'Y', 'ProductId': 'yy', 'source': 'list'},
    {'event': 'Bar', 'time': 3, 'distinct_id': 'c', '$screen_dpi': 100, 'ProductName': 'Z', 'ProductId': 'zz', 'source': 'list'},
]


该列表中的不同字典有不同的键,所以我们需要再循环一次来收集所有键的最终集合:

final_keys = {}
for x in data:
    final_keys.update(**x)

print(list(final_keys.keys()))


打印:

['event', 'time', 'distinct_id', '$screen_width', 'ProductName', 'ProductId', 'source', '$screen_height', '$screen_dpi']


我使用了一个dict来尽可能地保持密钥的顺序。如果你需要一个特定的顺序,最简单的方法是硬编码一个字段名列表,比如:

final_keys = ['event', 'time', 'distinct_id', '$screen_width', '$screen_height', ...]


我们可以使用csv模块的DictWriter及其 restval 关键字来指示它如何处理任何没有所有键的dict(在我的示例中,基于您的示例,没有一个dict拥有所有键),并给予final_keys的整个dict(不需要调用其.keys()方法):

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=final_keys, restval="NULL")
    writer.writeheader()
    writer.writerows(data)


并生成以下CSV:

| event | time | distinct_id | $screen_width | ProductName | ProductId | source | $screen_height | $screen_dpi |
|-------|------|-------------|---------------|-------------|-----------|--------|----------------|-------------|
| Foo   | 1    | a           | 1000          | X           | xx        | list   | NULL           | NULL        |
| Foo   | 2    | b           | NULL          | Y           | yy        | list   | 2000           | NULL        |
| Bar   | 3    | c           | NULL          | Z           | zz        | list   | NULL           | 100         |

相关问题