
o75abkj4  于 2023-01-01  发布在  Python

1.第一个数据框包含100名学生的Student IDStudents Name
1.第二个数据框也包含20名学生的Student IDStudents Name。这20名学生也在第一个数据框中。我需要将其余80名学生放在另一个数据框中?




df1 = pd.DataFrame({"ID":[1, 2, 3], "Name":["a", "b", "c"]})
df2 = pd.DataFrame({"ID":[3], "Name":["c"]})

new_df = pd.concat([df1, df2]).drop_duplicates(keep=False)

pd.concat将连接 Dataframe ,并使用drop_duplicates(keep=False)删除重复值。



为了生成测试数据,我使用了一个名为names的包。由于您已经有了这两个 Dataframe ,所以不必像我一样导入names包,也不必定义“虚拟”测试数据。创建这些部分只是为了测试实现。换句话说,您可以直接跳到“问题解决方案”部分。

# == Necessary Imports =========================================================
# Uncoment the line that starts with `!pip`, to install the `names` package.
# You don't have to install this library, since you already have the student
# names. This package can generate random names, and is only used to
# create some "dummy" data to test our solution.
# NOTE: if you're running this code outside a Jupyter notebook,
#       remove the `!` from the `pip install` command before running it.
#       The `!` sign enables you to run terminal commands from within Jupyter
#       and is not required if you're not running this code inside a Jupyter notebook.

# !pip install names
import pandas as pd
import numpy as np

# This next import is only needed to generate some dummy test data. Since you'll
# be using real data that already exists, you don't have to import the `names`
# package. Therefore, you can comment this line.
import names

# == Creating Dummy Data =======================================================
# You don't have to execute these next few lines of code that are being used
# to create the `df` and `df2` data frames. They're only used to create some
# data to show how you could solve your problem.
# In order to test our implementation, we'll Generate 2 data frames that have
# 2 columns: `'Students Name'` and `'Student ID'`. The first data frame (`df`)
# will have 100 rows with unique names and IDs. The second data frame (`df2`)
# will contain 20 of these 100 rows from the first data frame.
# To generate the names, we'll be using the `names.get_full_name()` function.
# For the IDs, we'll use the range function, that yields monotonically
# increasing numbers from 0 to 99.

# Number of rows to be generated.
nrows = 100
df = pd.DataFrame(
        'Student ID': range(nrows),
        'Students Name': [names.get_full_name() for _ in range(nrows)],

# To create the second data frame, that contains the names and IDs of 20 students
# from the first data frame we've just defined, we'll use the `numpy.random.choice`
# function, that enables us to select a certain amount of names with or without
# repeating values. In this case, we'll set `replace=False` to avoid selecting
# the same name multiple times. Then, we'll merge this new data frame with the
# original data, to obtain these 20 students ID as well.
# Finally, we'll sort values by ID just to organize students by ID in an ascending
# order, like the first data frame does. This last step is optional and does not
# impact the end results.
df2 = pd.DataFrame(
    {'Students Name': np.random.choice(df['Students Name'], replace=False, size=20)}
).merge(df, on='Students Name', how='inner').sort_values('Student ID')

# == Problem Solution ==========================================================
# I'm assuming you're looking for the names and IDs of students that exist on
# the first data frame that holds the 100 students, but do not exist on the
# second data frame that contains the information of 20 out of these 100
# students.

# -- OPTION 1 ------------------------------------------------------------------
# This option selects the students that are on the first Data Frame and not on
# the second data frame, based on the ID of each student.
# This solution won't work if you have data frames that assigns different IDs
# for the same student. In these cases, you might consider using the
# 'OPTION 2' approach instead.
result = df[~df['Student ID'].isin(df2['Student ID'])]

# Checking whether the `result` data frame contains the information about
# 80 students, as expected.
    f'Number of students that do not exist on the second data frame: {result.shape[0]:,}'
# Prints:
# "Number of students that do not exist on the second data frame: 80"

# -- OPTION 2 ------------------------------------------------------------------
# This option selects the students that are on the first Data Frame and not on
# the second based on the name OR ID of each student.
# The `|` character signifies an OR condition, therefore it filters the data frame
# when at least one of the conditions is True.
# The `~` sign on the start of each condition represents an negation of some
# condition.
result = df[
    (~df['Student ID'].isin(df2['Student ID']))
    | (~df['Students Name'].isin(df2['Students Name']))

# Checking whether the `result` data frame contains the information about
# 80 students, as expected.
    f'Number of students that do not exist on the second data frame: {result.shape[0]:,}'
# Prints:
# "Number of students that do not exist on the second data frame: 80"

下面是从解决方案生成的result Dataframe 的外观:
