Python中的剩余Pandas数据框

o75abkj4  于 2023-01-01  发布在  Python
关注(0)|答案(2)|浏览(126)

我有2个Pandas数据框。
例如:
1.第一个数据框包含100名学生的Student IDStudents Name
1.第二个数据框也包含20名学生的Student IDStudents Name。这20名学生也在第一个数据框中。我需要将其余80名学生放在另一个数据框中?
我可以使用哪个功能?
pd.merge能够解决这个问题吗?

owfi6suc

owfi6suc1#

让我把你的第一个DF简化为3个学生,第二个DF简化为1个学生。

df1 = pd.DataFrame({"ID":[1, 2, 3], "Name":["a", "b", "c"]})
df2 = pd.DataFrame({"ID":[3], "Name":["c"]})

new_df = pd.concat([df1, df2]).drop_duplicates(keep=False)

pd.concat将连接 Dataframe ,并使用drop_duplicates(keep=False)删除重复值。

3b6akqbq

3b6akqbq2#

另一种方法是通过过滤第一个数据框中不存在于第二个数据框中的行来选择学生姓名和ID。
为了生成测试数据,我使用了一个名为names的包。由于您已经有了这两个 Dataframe ,所以不必像我一样导入names包,也不必定义“虚拟”测试数据。创建这些部分只是为了测试实现。换句话说,您可以直接跳到“问题解决方案”部分。
下面的代码包含两种解决问题的方法。我添加了一些注解来帮助您理解这两种解决方案之间的区别。

# == Necessary Imports =========================================================
# Uncoment the line that starts with `!pip`, to install the `names` package.
# You don't have to install this library, since you already have the student
# names. This package can generate random names, and is only used to
# create some "dummy" data to test our solution.
# NOTE: if you're running this code outside a Jupyter notebook,
#       remove the `!` from the `pip install` command before running it.
#       The `!` sign enables you to run terminal commands from within Jupyter
#       and is not required if you're not running this code inside a Jupyter notebook.

# !pip install names
import pandas as pd
import numpy as np

# This next import is only needed to generate some dummy test data. Since you'll
# be using real data that already exists, you don't have to import the `names`
# package. Therefore, you can comment this line.
import names

# == Creating Dummy Data =======================================================
# You don't have to execute these next few lines of code that are being used
# to create the `df` and `df2` data frames. They're only used to create some
# data to show how you could solve your problem.
#
# In order to test our implementation, we'll Generate 2 data frames that have
# 2 columns: `'Students Name'` and `'Student ID'`. The first data frame (`df`)
# will have 100 rows with unique names and IDs. The second data frame (`df2`)
# will contain 20 of these 100 rows from the first data frame.
# To generate the names, we'll be using the `names.get_full_name()` function.
# For the IDs, we'll use the range function, that yields monotonically
# increasing numbers from 0 to 99.

# Number of rows to be generated.
nrows = 100
df = pd.DataFrame(
    {
        'Student ID': range(nrows),
        'Students Name': [names.get_full_name() for _ in range(nrows)],
    }
)

# To create the second data frame, that contains the names and IDs of 20 students
# from the first data frame we've just defined, we'll use the `numpy.random.choice`
# function, that enables us to select a certain amount of names with or without
# repeating values. In this case, we'll set `replace=False` to avoid selecting
# the same name multiple times. Then, we'll merge this new data frame with the
# original data, to obtain these 20 students ID as well.
# Finally, we'll sort values by ID just to organize students by ID in an ascending
# order, like the first data frame does. This last step is optional and does not
# impact the end results.
df2 = pd.DataFrame(
    {'Students Name': np.random.choice(df['Students Name'], replace=False, size=20)}
).merge(df, on='Students Name', how='inner').sort_values('Student ID')

# == Problem Solution ==========================================================
# I'm assuming you're looking for the names and IDs of students that exist on
# the first data frame that holds the 100 students, but do not exist on the
# second data frame that contains the information of 20 out of these 100
# students.

# -- OPTION 1 ------------------------------------------------------------------
# This option selects the students that are on the first Data Frame and not on
# the second data frame, based on the ID of each student.
# This solution won't work if you have data frames that assigns different IDs
# for the same student. In these cases, you might consider using the
# 'OPTION 2' approach instead.
result = df[~df['Student ID'].isin(df2['Student ID'])]

# Checking whether the `result` data frame contains the information about
# 80 students, as expected.
print(
    f'Number of students that do not exist on the second data frame: {result.shape[0]:,}'
)
# Prints:
#
# "Number of students that do not exist on the second data frame: 80"

# -- OPTION 2 ------------------------------------------------------------------
# This option selects the students that are on the first Data Frame and not on
# the second based on the name OR ID of each student.
# The `|` character signifies an OR condition, therefore it filters the data frame
# when at least one of the conditions is True.
# The `~` sign on the start of each condition represents an negation of some
# condition.
result = df[
    (~df['Student ID'].isin(df2['Student ID']))
    | (~df['Students Name'].isin(df2['Students Name']))
]

# Checking whether the `result` data frame contains the information about
# 80 students, as expected.
print(
    f'Number of students that do not exist on the second data frame: {result.shape[0]:,}'
)
# Prints:
#
# "Number of students that do not exist on the second data frame: 80"

下面是从解决方案生成的result Dataframe 的外观:

相关问题