linux 如何使用python从这个.mol2文件中提取数据？

but5z9lq 于 2023-05-22 发布在 Linux

关注(0)|答案(1)|浏览(229)

我正在使用.mol2文件（分子结构文件）进行一些工作。在我的这个文件中，我有一些分子，每个文件的标题中列出了一些属性。如何提取标题中的信息并创建包含所有数据的表格或csv文件（属性作为表格列标题，每个分子名称作为行标题）？一个分子的文件格式如this图像所示。重申一下，我有一个包含超过850，000个这些“文件”的大文件-所有这些文件都是这种格式。理想情况下，该脚本将使用在命令行中执行的python脚本运行，但如果您知道另一种方法（Linux？）所以请随意分享。
谢谢！

linux

来源：https://stackoverflow.com/questions/59297430/how-to-extract-data-from-this-mol2-file-using-python

1条答案

按热度按时间

3hvapo4f1#

import os
import pandas as pd

def get_data_dict(file_path: str, to_replace:str="##########") -> dict:
    """get data headers from a mol2 file (assumes header lines start with `to_replace`)
    :parameter
        - file_path:
          path to the file of interest
        - to_replace:
          the 'marker' that specifies the header
    :return
        - data_dict
          dict containing the header parts as keys and their values as value
    """
    data_dict = {}
    # open and read file until all header info is read
    with open(file_path, "r") as mol2_file:
        for line in mol2_file:
            # contains header info
            if line.startswith(to_replace):
                key = None
                value = None
                for i in line.split(":"):
                    i = i.strip()
                    if to_replace in i:
                        key = i.replace(to_replace, "").strip()
                    else:
                        value = i
                data_dict[key] = [value]
            # header ended
            else:
                return data_dict

all_data = pd.DataFrame()
# path to the directory containing all files
parent_path = "/PARENT/PATH"
for file in os.listdir(parent_path):
    data_df = pd.DataFrame.from_dict(get_data_dict(os.path.join(parent_path, file)))
    all_data = pd.concat([all_data, data_df])

这会将所有头文件保存在pandas DataFrame中，然后您可以根据自己的喜好对其进行操作（例如：保存为csv或执行其他操作）。对于那些来这里寻找如何解析mol2文件的解决方案的人来说，你可以用biopandas来做。

赞(0）回复(0）举报 2023-05-22

我来回答

linux 如何使用python从这个.mol2文件中提取数据？

1条答案

相关问题

热门标签

最新问答