Pandas合并101只

nxagd54h  于 2022-09-21  发布在  其他
关注(0)|答案(8)|浏览(185)
  • 如何对Pandas进行(INNER|(LEFT|RIGHT|FULL)OUTER)JOIN
  • 如何在合并后为缺少的行添加NAN?
  • 合并后如何摆脱NAN?
  • 我可以在索引上合并吗?
  • 如何合并多个DataFrame?
  • 与Pandas交叉加入
  • mergejoinconcatupdate?谁?什么?为什么?!

..。还有更多。我见过这些反复出现的问题,询问Pandas合并功能的各个方面。今天,关于Merge及其各种用例的大部分信息都分散在数十篇措辞拙劣、无法搜索的帖子中。这里的目的是为子孙后代整理一些更重要的观点。

这篇问答是关于Pandas常见习语的一系列有用的用户指南中的下一篇(参见this post on pivotingthis post on concatenation,我将在后面讨论它们)。

请注意,这篇文章并不是要取代the documentation,所以请也读一读!其中一些例子就是从那里取来的。

目录

为了便于访问。

w8rqjzmb

w8rqjzmb1#

这篇文章旨在为读者提供一本关于与Pandas合并的入门读物,如何使用它,以及何时不使用它。

特别是,以下是这篇帖子将经历的:

  • 基础-连接类型(左、右、外、内)

  • 合并不同列名

  • 合并多列

  • 避免在输出中出现重复的合并键列

这篇帖子(以及我在这个帖子上的其他帖子)不会经历的:

  • 与业绩有关的讨论和时间安排(目前)。最值得注意的是,在适当的地方提到了更好的替代方案。
  • 处理后缀、删除多余的列、重命名输出以及其他特定用例。还有其他(阅读:更好的)帖子来处理这个问题,所以想办法吧!
    除非另有说明,否则大多数示例在演示各种功能时默认使用内连接操作。

此外,这里的所有DataFrame都可以复制和复制,这样您就可以玩它们了。此外,请参阅this post以了解如何从剪贴板中读取DataFrame。

最后,连接操作的所有可视化表示都是使用Google Drawing手绘的。灵感来自here

说够了--教我怎么用merge就行了!

设置与基础

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

为简单起见,键列具有相同的名称(目前)。

内部联接

这一点以及即将公布的数字都遵循这一惯例:

  • 蓝色**表示合并结果中存在的行
  • 红色**表示从结果中排除(即删除)的行
  • 绿色**表示结果中替换为NaNs的缺失值

要执行内部联接,请在左边的DataFrame上调用merge,并将右边的DataFrame和联接键(至少)指定为参数。

left.merge(right, on='key')

# Or, if you want to be explicit

# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

这只返回leftright中共享公用键(在本例中为“B”和“D”)的行。

左外部联接,或左联接由表示

这可以通过指定how='left'来执行。

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

请仔细注意此处放置的NaN。如果指定how='left',则只使用left中的密钥,而right中缺少的数据将替换为NaN。

同样,对于右外部联接,或右联接,它是...

...指定how='right'

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

这里使用了right中的密钥,并用NaN替换了left中缺失的数据。

最后,对于完全外连接,由

指定how='outer'

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

这使用了两个帧中的关键点,并为这两个帧中缺少的行插入了NAN。

文档很好地总结了这些不同的合并:

其他联接--左排除、右排除、全排除/反联接

如果您需要左排除JOIN右排除JOIN两个步骤。

对于左排除联接,表示为

首先执行左外部联接,然后只过滤来自left的行(排除来自右侧的所有行),

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

哪里,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

类似地,对于排除权限的联接,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

最后,如果需要进行只保留来自左侧或右侧的键的合并,而不是同时保留这两个键(IOW,执行反联接),

你可以用类似的方式来做这个-

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

键列名称不同

如果键列的名称不同-例如,left具有keyLeft,而right具有keyRight而不是key-则必须指定left_onright_on作为参数,而不是on

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

输出避免重复键列

合并left中的keyLeftright中的keyRight时,如果您只需要输出中的keyLeftkeyRight中的一个(而不是两者),则可以首先设置索引作为初步步骤。

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

将其与前面命令的输出(即left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')的输出)进行对比,您会注意到缺少keyLeft。您可以根据将哪个帧的索引设置为键来确定要保留哪一列。比方说,当执行某些外部联接操作时,这可能很重要。

只合并DataFrames中的一列

例如,考虑

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

如果您只需要合并“NEWCOL”(不合并任何其他列),您通常可以在合并之前只对列进行子集:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

如果您正在执行左外部联接,则性能更好的解决方案将涉及map


# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))

left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

如前所述,这类似于,但比

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

多列合并

要联接多个列,请为on(或left_onright_on,视情况而定)指定列表。

left.merge(right, on=['key1', 'key2'] ...)

或者,如果名字不同,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

其他有用的merge*操作和函数

本部分仅涵盖最基本的内容,旨在满足您的胃口。有关更多示例和用例,请参阅documentation on merge, join, and concat以及指向功能规范的链接。

继续阅读

跳到Pandas合并101中的其他主题继续学习:

yzuktlbb

yzuktlbb2#

pd.concat([df0, df1], kwargs)的补充视觉视图。请注意,kwarg axis=0axis=1的含义不像df.mean()df.apply(func)那样直观

tcomlyy6

tcomlyy63#

加入101

这些动画可能会更好地从视觉上解释你。片酬:Garrick Aden-Buie tidyexplain repo

内联接

外联接或全联接

右联

左联接

ou6hu8tu

ou6hu8tu4#

In this answer, I will consider practical examples.

The first one, is of pandas.concat.

The second one, of merging dataframes from the index of one and the column of another one.

1. pandas.concat

Considering the following DataFrames with the same column names:

Preco2018with size (8784, 5)

Preco 2019with size (8760, 5)

That have the same column names.

You can combine them using pandas.concat, by simply

import pandas as pd

frames = [Preco2018, Preco2019]

df_merged = pd.concat(frames)

Which results in a DataFrame with the following size (17544, 5)

If you want to visualize, it ends up working like this

(Source)

2. Merge by Column and Index

In this part, I will consider a specific case: If one wants to merge the index of one dataframe and the column of another dataframe.

Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date Data, which is of type datetime64[ns].

And the dataframe Price that has one column with the price and the index corresponds to the dates

In this specific case, to merge them, one uses pd.merge

merged = pd.merge(Price, Geo, left_index=True, right_on='Data')

Which results in the following dataframe

yshpjwxd

yshpjwxd5#

This post will go through the following topics:

  • Merging with index under different conditions

  • options for index-based joins: merge, join, concat

  • merging on indexes

  • merging on index of one, column of other

  • effectively using named indexes to simplify merging syntax

BACK TO TOP

Index-based joins

TL;DR

There are a few options, some simpler than others depending on the use case.

  • DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)

  • supports inner/left/right/full

  • can only join two at a time

  • supports column-column, index-column, index-index joins

  • DataFrame.join (join on index)

  • supports inner/left (default)/right/full

  • can join multiple DataFrames at a time

  • supports index-index joins

  • pd.concat (joins on index)

  • supports inner/full (default)

  • can join multiple DataFrames at a time

  • supports index-index joins

Index to index joins

Setup & Basics

import pandas as pd
import numpy as np

np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)}, 
                    index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame(data={'value': np.random.randn(4)},  
                     index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, aninner join on indexwould look like this:

left.merge(right, left_index=True, right_index=True)

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Other joins follow similar syntax.

Notable Alternatives

1.**DataFrame.join**defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

          value_x   value_y
 idxkey                    
 B      -0.402655  0.543843
 D      -0.524349  0.013135

Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

left.join(right)
 ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

         leftvalue     value
 idxkey                     
 B       -0.402655  0.543843
 D       -0.524349  0.013135

1.**pd.concat**joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..

pd.concat([left, right], axis=1, sort=False, join='inner')

            value     value
 idxkey                    
 B      -0.402655  0.543843
 D      -0.524349  0.013135

For more information on concat, see this post.

Index to Column joins

To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.

join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.

Effectively using Named Index [pandas >= 0.23]

If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

gg0vcinb

gg0vcinb6#

This post will go through the following topics:

  • how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
  • merging on unique keys
  • merging on non-unqiue keys

BACK TO TOP

Generalizing to multiple DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

df1.merge(df2, ...).merge(df3, ...)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.


# Setup.

np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note: the "key" column values are unique, so the index is unique.

A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway merge on unique keys

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that**pd.concat joins DataFrames on the index**.


# Merge on `key` column. You'll need to set the index before concatenating

pd.concat(
    [df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Merge on `key` index.

pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).

Multiway merge on keys with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).


# Join on `key` column. Set as the index first.

# For inner join. For left join, omit the "how" argument.

A.set_index('key').join([B2, C2], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Join on `key` index.

A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

hc2pp10m

hc2pp10m7#

Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:


# pip install pyjanitor

import pandas as pd
import janitor 

left.conditional_join(right, ('value', 'value', '>'))

   left           right
    key     value   key     value
0     A  1.764052     D -0.977278
1     A  1.764052     F -0.151357
2     A  1.764052     E  0.950088
3     B  0.400157     D -0.977278
4     B  0.400157     F -0.151357
5     C  0.978738     D -0.977278
6     C  0.978738     F -0.151357
7     C  0.978738     E  0.950088
8     D  2.240893     D -0.977278
9     D  2.240893     F -0.151357
10    D  2.240893     E  0.950088
11    D  2.240893     B  1.867558

left.conditional_join(right, ('value', 'value', '<'))

  left           right
   key     value   key     value
0    A  1.764052     B  1.867558
1    B  0.400157     E  0.950088
2    B  0.400157     B  1.867558
3    C  0.978738     B  1.867558

The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.

Performance wise, this is better than a naive cross join:

np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000), 
                   'end':np.random.randint(100000, size=1_000)})

dd.head()

   value
0  68268
1  43567
2  42613
3  45891
4  21243

df.head()

   start    end
0  71915  47005
1  64284  44913
2  13377  96626
3  75823  38673
4  29151    575

%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)

A.equals(out)
True
ig9co6j1

ig9co6j18#

I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:

The setup:

names1 = [{'A':'Jack', 'B':'Jill'}]

names2 = [{'C':'Tommy', 'D':'Tammy'}]

df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', 1)

This creates a dummy X column, merges on the X, and then drops it to produce

df_merged:

A     B      C      D
0  Jack  Jill  Tommy  Tammy

相关问题