python-3.x 从文件列表中提取信息并写入日志文件

ozxc1zmp  于 2023-06-25  发布在  Python
关注(0)|答案(1)|浏览(154)

我正在运行一个修改过的Autodock程序,需要编译结果。
我有一个文件夹,其中包含数百个名为“compound_1.pdbqt”、“compound_2.pdbqt”等的 *.pdbqt文件。每个文件都有这样的结构:

MODEL 1
REMARK minimizedAffinity -7.11687565
REMARK CNNscore 0.573647082
REMARK CNNaffinity 5.82644749
REMARK  11 active torsions:
#Lots of text here
MODEL 2
REMARK minimizedAffinity -6.61898327
REMARK CNNscore 0.55260396
REMARK CNNaffinity 5.86855984
REMARK  11 active torsions:
#Lots of text here
#Repeat with 10 to 20 model

我想将文件夹中每个化合物的“MODEL”、“minimizedAffinity”、“CNNscore”和“CNNaffinity”提取到一个分隔的文本文件中,如下所示:

Compound Model minimizedAffinity CNNscore CNNaffinity 
1 1 -7.11687565 0.573647082 5.82644749
1 2 -6.61898327 0.55260396 5.86855984

目前我被困在这个脚本

#! /usr/bin/env python

import sys
import glob

files = glob.glob('**/*.pdbqt', 
                   recursive = True)
for file in files:
    word1 = 'MODEL'
    word2 = 'minimizedAffinity'
    word3 = 'CNNscore'
    word4 = 'CNNaffinity'
    print(file)
    with open(file) as fp:
        # read all lines in a list
        lines = fp.readlines()
        for line in lines:
        # check if string present on a current line
            if line.find(word1) != -1:
                print('Line:', line)
            if line.find(word2) != -1:
                print('Line:', line)
            if line.find(word3) != -1:
                print('Line:', line)
            if line.find(word4) != -1:
                print('Line:', line)
txu3uszq

txu3uszq1#

看看Pawpaw,这是一个专门为构建词法解析器而设计的框架,它可以分割文本并将结果收集到可搜索的树中。以下是解决您的化合物问题的方法:

代码:

import sys
import os.path
import fnmatch
import typing

import regex
import pawpaw

# Build pawpaw parser
re = regex.compile(r'(?<=^|\n)(?=MODEL \d+)', regex.DOTALL)
splitter = pawpaw.arborform.Split(re)

pat = r"""
(?P<model>
    MODEL\ 
    (?<tag>\d+)
    (?:\n
        (?<remark>
            REMARK\ 
            (?<tag>[^\s]+)\ 
            (?<value>[^\n]+)
        )
    )+
    (?:\n
        (?>!=REMARK)
        (?<text>.+)
    )?
)+
"""
re = regex.compile(pat, regex.VERBOSE | regex.DOTALL)
extractor = pawpaw.arborform.Extract(re)
con = pawpaw.arborform.Connectors.Delegate(extractor)
splitter.connections.append(con)

# Prints using fixed-width for visibility: change to delimited if needed
def dump_row(cols: list) -> None:
    print(*(f'{v: <18}' for v in cols))  

# Select desired remark columns
desired_remarks = ['minimizedAffinity', 'CNNscore', 'CNNaffinity']

# Headers
headers = ['Compound', 'Model']
headers.extend(desired_remarks)
dump_row(headers)

# Create rows from compound file
def compound_vals(compound: str, ito: pawpaw.Ito) -> typing.Iterable[list[str]]:
    for model in ito.children:
        vals = [compound]
        vals.append(str(model.find('*[d:tag]')))
        for dr in desired_remarks:
            vals.append(str(model.find(f'*[d:remark]/*[d:tag]&[s:{dr}]/>[d:value]')))
        yield vals

# Read files and dump contents of each
for path in os.scandir(os.path.join(sys.path[0])):
    if path.is_file() and fnmatch.fnmatch(path.name, 'compound_*.pdbqt'):
        compound = path.name.split('_', 1)[-1].split('.', 1)[0]  # compound number
        with open(os.path.join(sys.path[0], path)) as f:
            ito = pawpaw.Ito(f.read(), desc='all')
            ito.children.add(*splitter(ito))
            for vals in compound_vals(compound, ito):
                dump_row(vals)

输出:

Compound           Model              minimizedAffinity  CNNscore           CNNaffinity       
1                  1                  -7.11687565        0.573647082        5.82644749        
1                  2                  -6.61898327        0.55260396         5.86855984        
2                  1                  -7.11687565        0.573647082        5.82644749        
2                  2                  -6.61898327        0.55260396         5.86855984

注意:这段代码使用python的print转储输出(而不是保存到文件中),以便于在这里查看结果。

相关问题