检测txt文件中的数据,字符串解析并输出为csv文件

clj7thdc  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(114)

这是我的代码。我正在使用代码检测一堆文本文件在一个文件夹中,然后字符串解析数据输出为csv文件。你能给予我一些提示吗?我在努力奋斗。
我的代码的第一步是检测数据在txt文件中的位置。我发现所有的数据都以“读取”开头,然后我找到了每个文件中数据的起始行。在那之后,我就如何将数据输出导出到csv文件而苦苦挣扎。

import os
import argparse
import csv
from typing import List

def validate_directory(path):
    if os.path.isdir(path):
        return path
    else:
        raise NotADirectoryError(path)

def get_data_from_file(file) -> List[str]:
    ignore_list = ["Read Segment", "Read Disk", "Read a line", "Read in"]
    data = []
    with open(file, "r", encoding="latin1") as f:
        try:
            lines = f.readlines()
        except Exception as e:
            print(f"Unable to process {file}: {e}")
            return []
        for line_number, line in enumerate(lines, start=1):
            if not any(variation in line for variation in ignore_list):
                if line.strip().startswith("Read ") and not line.strip().startswith("Read ("): # TODO: fix this with better regex
                    data.append(f'Found "Read" at line {line_number} in {file}')
                    print(f'Found "Read" at {file}:{line_number}')
                    print(lines[line_number-1])
    return data

def list_read_data(directory_path: str) -> List[str]:
    total_data = []
    for root, _, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".txt"):
                data = get_data_from_file(os.path.join(root, file_name))
                total_data.extend(data)

    return total_data

def write_results_to_csv(output_file: str, data: List[str]):
    with open(output_file, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(["Results"])
        for line in data:
            writer.writerow([line])

def main(directory_path: str, output_file: str):
    data = list_read_data(directory_path)
    write_results_to_csv(output_file, data)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Process the 2020Model folder for input data."
    )
    parser.add_argument(
        "--directory", type=validate_directory, help="folder to be processed"
    )
    parser.add_argument("--output", type=str, help="Output file name (e.g., outputfile.csv)", default="outputfile.csv")

    args = parser.parse_args()
    main(os.path.abspath(args.directory), args.output)

字符串
下面是我理想的csv输出数据:
| 一九八六年|一九八六年|一九八七年|一九八八年|一九八九年|一九九○年|一九九一年|一九九二年|一九九三年|小行星| 1994 |
| --|--|--|--|--|--|--|--|--|--| ------------ |
| 三六九六二|三七八五六|四一九七一|小行星40838| 44640.87| 42826.34| 44883.03| 43077.59| 45006.49|四六七八九| 46789 |
你能给予我一些提示吗:

  • 把字符串parse放在哪里?
  • 如何输出为CSV文件。

下面是一个示例txt文件:

Select Year(2007-2025)
Read TotPkSav
/2007     2008     2009     2010     2011     2012     2013     2014     2015     2016     2017     2018     2019     2020     2021     2022     2023     2024     2025 
   00       27       53       78      108      133      151      161      169      177      186      195      205      216      229      242      257      273      288

eulz3vhy

eulz3vhy1#

如果你所有的文件看起来都像这4行,那么我建议你把文件变成一个行列表,而不是尝试遍历这些行。我还建议只使用glob和recursive=True,避免尝试遍历树。
因为它在for循环中读取文件,所以任何具有错误属性的文件都可以通过continue-ing跳过,进入循环中的下一个文件:

all_rows: list[list[str]] = []

for fname in glob.glob("**/*.txt", recursive=True):
    with open(fname, encoding="iso-8859-1") as f:
        print(f"reading {fname}")
        lines = [x.strip() for x in list(f)]

        if len(lines) != 4:
            print(f'skipping {fname} with too few lines"')
            continue

        line2 = lines[1]
        if line2[:4] != "Read" or line2[:6] == "Read (":
            print(f'skipping {fname} with line2 = "{line2}"')
            continue

        line3, line4 = lines[2:4]

        if line3[0] == "/":
            line3 = line3[1:]

        header = [x for x in line3.split(" ") if x]
        data = [x for x in line4.split(" ") if x]
      
        all_rows.append(header)
        all_rows.append(data)

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Result"])
    writer.writerows(all_rows)

字符串
我模仿了几个文件,并将它们分布在我的树中:

- .
 - a
    input3.txt
 - b
    foo.txt
   input1.txt
   input2.txt
   main.py


当我从树的根运行这个程序时,我得到:

reading input1.txt
reading input2.txt
skipping input2.txt with line2 = "Read (TotPkSav)"
reading a/input3.txt
reading b/foo.txt
skipping b/foo.txt with too few lines"


输出.csv看起来像:

| Result |
|--------|
| 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00   | 27   | 53   | 78   | 108  | 133  | 151  | 161  | 169  | 177  | 186  | 195  | 205  | 216  | 229  | 242  | 257  | 273  | 288  |
| 2099 | 2098 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | 2025 |
| 00   | 27   | 53   | 78   | 108  | 133  | 151  | 161  | 169  | 177  | 186  | 195  | 205  | 216  | 229  | 242  | 257  | 273  | 288  |

相关问题