我是Python新手，继承了一些Python脚本，从旧软件系统整理CSV转储文件，以便导入到新系统。我有一个很大的人物 Dataframe ，我需要提取相关人物的 Dataframe 子集，并通过删除空值、不存在的人物和配对匹配来整理它。
已经编写了一个函数来执行此任务，但运行时只删除了少量行，并保留了许多配对匹配和不存在的人员。有人能帮我找到错误的地方吗？
该函数定义如下：

def extract_relations(df):
    fields = ['REF', 'PERSON',
              'RELATED_PERSON', 'RELATED_CODE']
    df = df[fields].copy()
    nan_value = float("NaN")
    df.replace('', nan_value, inplace=True)
    all_customers = df['PERSON'].astype(str).tolist()
    df.dropna(subset=['RELATED_PERSON'], inplace=True)
    rel_customers = df['RELATED_PERSON'].astype(str).tolist()

     # For each row in the dataframe
    for idx, row in df.iterrows():
        # Split the string by double-colon
        splitrel = re.split(':{2}', df.at[idx, 'RELATED_PERSON'])
        splitcode = re.split(':{2}', df.at[idx, 'RELATED_CODE'])

        # for each related person number
        for i in splitrel:
            # check if the related person has already been covered
            # on the list previously with mirroring relation
            exists = str(i) in rel_customers[:idx]
            # check if the related member is the same as the member
            same = str(i) == str(df.at[idx, 'PERSON'])
            # check if the related member is a member being migrated
            active = str(i) not in all_customers

            # if any of the above are true, remove the relation number and code 
            if exists or same or active:
                lstindex = splitrel.index(i)
                splitrel.pop(lstindex)
                splitcode.pop(lstindex)
                #del splitrel[lstindex]
                #del splitcode[lstindex]
                
            # Join concatenanted rows back up
            df.at[idx, 'RELATED_PERSON'] = '::'.join(splitrel)
            df.at[idx, 'RELATED_CODE'] = '::'.join(splitcode)

        # If no relations remain, drop the row
        if df.at[idx, 'RELATED_PERSON'] == '':
            df = df.drop(idx)

我已经编辑了将生成的 Dataframe 写入.txt文件的代码，但在查看输出后，发现“exists or same or active”中的大多数行没有被删除。此外，一些有效条目正在被删除。我希望我的 Dataframe 比实际小50%左右！
注意：我试过用“del splitrel”代替“splitrel.pop”（在if语句中注解掉），但对输出没有影响。
我在网上读到过，以这种方式删除项目可能会导致 Dataframe 的“移位”，因为您正在删除的行/索引号随后会发生变化，但未删除的行数让我认为这不是这里的问题。
任何帮助将不胜感激！
根据要求，创建示例 Dataframe 的代码复制粘贴如下：

df = pd.DataFrame({'REF':'NK213','PERSON':[18,20,91,92,95,105,122,138,323,324,14208,14871,14984,15902,19253,35378,37580,47225,201391],'RELATED_PERSON':['14208','14871','14984','105','15071','','14016','136','324','323','','','','9995','19253::47225','35378','38181::38461::38462','','201391'],'RELATED_CODE':['2','2','2','2','2','','2','2','2','2','','','','2','6::6','6','6::6::6','','6']})

运行函数后的输出为：

| 参考|人员|相关人员|相关代码|
| - ------|- ------|- ------|- ------|
| NK213|十八|小行星14208|第二章|
| NK213|二十个|小行星14871|第二章|
| NK213|九十一|小行星14984|第二章|
| NK213|九十二|一百零五|第二章|
| NK213|小行星19253|小行星47225|六个|
| NK213|小行星37580|小行星38461|六个|

预期：

1.不应返回最后一行，因为人员38461不存在。
1.下列“匹配对”中的一个本应保留在输出中，但两个都被删除了。
| 参考|人员|相关人员|相关代码|
| - ------|- ------|- ------|- ------|
| NK213|三二三|三二四|第二章|
| NK213|三二四|三二三|第二章|

import pandas as pd df = pd.DataFrame({'REF':'NK213','PERSON':[18,20,91,92,95,105,122,138,323,324,14208,14871,14984,15902,19253,35378,37580,47225,201391],'RELATED_PERSON':['14208','14871','14984','105','15071','','14016','136','324','323','','','','9995','19253::47225','35378','38181::38461::38462','','201391'],'RELATED_CODE':['2','2','2','2','2','','2','2','2','2','','','','2','6::6','6','6::6::6','','6']}) def extract_relations(dataframe): fields = ['REF', 'PERSON', 'RELATED_PERSON', 'RELATED_CODE'] df = dataframe[fields].copy() #Up to this point, the code is most similar to what you have posted. #We first generate a list of related persons with corresponding code. df['RELATED_PERSON']=df['RELATED_PERSON'].str.split("::") df['RELATED_CODE']=df['RELATED_CODE'].str.split("::") #Expand rows of persons with related persons. df = df.explode(['RELATED_PERSON','RELATED_CODE'],ignore_index=True) #Filter the rows in order to consider only the cases #where the related person is in the 'PERSON' column. df = df[df['RELATED_PERSON'].isin(df['PERSON'].astype('str'))] #Filter the rows in order to consider only the cases #where the related person is not the same as the person in consideration. df = df[df['RELATED_PERSON'] != df['PERSON'].astype('str')] return df.reset_index(drop=True) extract_relations(df)

1条答案

按热度按时间

llew8vvj1#

尊敬的dms_paul先生或女士：
恐怕我在试图理解生成所需 Dataframe 所需的多个标准时遇到了一些麻烦。从这个意义上说，我假设您希望生成一个人员表，其中每个人员至少有一个相关人员出现在表本身的“人员”列中，条件是相关人员不是他们自己。
如果是这种情况，我希望下面的代码能有所帮助。
输入：
| 指标|参考|人员|相关人员|相关代码|
| - ------|- ------|- ------|- ------|- ------|
| 无|NK213|十八|小行星14208|第二章|
| 1个|NK213|二十个|小行星14871||
| 第二章|NK213|九十一|小行星14984|第二章|
| 三个|NK213|九十二|一百零五|第二章|
| 四个|NK213|九十五|小行星15071|第二章|
| 五个|NK213|一百零五|||
| 六个|NK213|一百二十二|小行星14016|第二章|
| 七|NK213|一百三十八|一百三十六|第二章|
| 八个|NK213|三二三|三二四|第二章|
| 九|NK213|三二四|三二三|第二章|
| 十个|NK213|小行星14208|||
| 十一|NK213|小行星14871|||
| 十二|NK213|小行星14984|||
| 十三|NK213|小行星15902|九九九五|第二章|
| 十四|NK213|小行星19253|小行星19253：：47225|6时6分|
| 十五|NK213|小行星353|小行星353|六个|
| 十六|NK213|小行星37580|小行星38181：：38461：：38462|六点六点六|
| 十七|NK213|小行星47225|||
| 十八|NK213|小行星2013|小行星2013|六个|
代码：

输出：
| 指标|参考|人员|相关人员|相关代码|
| - ------|- ------|- ------|- ------|- ------|
| 无|NK213|十八|小行星14208|第二章|
| 1个|NK213|二十个|小行星14871|第二章|
| 第二章|NK213|九十一|小行星14984|第二章|
| 三个|NK213|九十二|一百零五|第二章|
| 四个|NK213|三二三|三二四|第二章|
| 五个|NK213|三二四|三二三|第二章|
| 六个|NK213|小行星19253|小行星47225|六个|
你忠实的，
路易斯·马丁斯。

赞(0）回复(0）举报 2023-03-16

Python：从基于多个条件的 Dataframe 中删除行

1条答案

相关问题

热门标签

最新问答