我有一个CSV文件,其中包含一个名为click_id的列,我想使用此click_id在大型Apache日志文件(大约3GB)中搜索相应的日志条目。找到匹配的日志条目后,我需要从日志条目中提取用户代理和其他信息。我还想对类似的日志条目进行分组和计数,并将结果写入另一个CSV文件。
在Python中完成这个任务最有效和最可靠的方法是什么?处理大尺寸日志文件并确保脚本高效运行而不耗尽内存或导致其他性能问题的最佳方法是什么?
这是我已经尝试过的,但是已经3天了,它还在运行。
import csv
from collections import defaultdict
from user_agents import parse
clickid_list = []
device_list = []
with open('data.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
# check if click_id column is not blank or null
if row[29] != "" and row[29] != "null" and row[29] != "click_id":
clickid_list.append(row[29])
matched_lines_count = defaultdict(int)
def log_file_generator(filename, chunk_size=200 * 1024 * 1024):
with open(filename, 'r') as file:
while True:
chunk = file.readlines(chunk_size)
if not chunk:
break
yield chunk
for chunk in log_file_generator('data.log'):
for line in chunk:
for gclid in clickid_list:
if gclid in line:
string = "'" + str(line) + "'"
user_agent = parse(string)
device = user_agent.device.family
device_brand = user_agent.device.brand
device_model = user_agent.device.model
os = user_agent.os.family
os_version = user_agent.os.version
browser= user_agent.browser.family
browser_version= user_agent.browser.version
if device in matched_lines_count:
matched_lines_count[device]["count"] += 1
print(matched_lines_count[device]["count"])
else:
matched_lines_count[device] = {"count": 1, "os": os,"os_version": os_version,"browser": browser,"browser_version": browser_version,"device_brand": device_brand,"device_model": device_model}
# sort garne
sorted_matched_lines_count = sorted(matched_lines_count.items(), key=lambda x: x[1]['count'], reverse=True)
with open("test_op.csv", "a", newline="") as file:
writer = csv.writer(file)
writer.writerows([["Device", "Count", "OS","OS version","Browser","Browser version","device_brand","device model"]])
for line, count in sorted_matched_lines_count:
# if count['count'] >= 20:
# print(f"Matched Line: {line} | Count: {count['count']} | OS: {count['os']}")
# write the data to a CSV file
writer.writerow([line,count['count'],count['os'],count['os_version'],count['browser'],count['browser_version'],count['device_brand'],count['device_model']])
日志示例:
127.0.0.1 - - [03/Nov/2022:06:50:20 +0000] "GET /access?click_id=12345678925455 HTTP/1.1" 200 39913 "-" "Mozilla/5.0 (Linux; Android 11; SM-A107F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Mobile Safari/537.36"
127.0.0.1 - - [03/Nov/2022:06:50:22 +0000] "GET /access?click_id=123456789 HTTP/1.1" 200 39914 "-" "Mozilla/5.0 (Linux; Android 11; SM-A705FN) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Mobile Safari/537.36"
预期结果。
我是Python的新手,任何代码示例或相关库或工具的指针都将非常感谢。
谢谢大家!
1条答案
按热度按时间pqwbnv8z1#
你可以使用PySpark,然后你有大的日期。也可以减少日期可以使用Pandas。PySpark是类似的然后Pandas。