- 如何对Pandas进行(
INNER
|(LEFT
|RIGHT
|FULL
)OUTER
)JOIN
? - 如何在合并后为缺少的行添加NAN?
- 合并后如何摆脱NAN?
- 我可以在索引上合并吗?
- 如何合并多个DataFrame?
- 与Pandas交叉加入
merge
?join
?concat
?update
?谁?什么?为什么?!
..。还有更多。我见过这些反复出现的问题,询问Pandas合并功能的各个方面。今天,关于Merge及其各种用例的大部分信息都分散在数十篇措辞拙劣、无法搜索的帖子中。这里的目的是为子孙后代整理一些更重要的观点。
这篇问答是关于Pandas常见习语的一系列有用的用户指南中的下一篇(参见this post on pivoting和this post on concatenation,我将在后面讨论它们)。
请注意,这篇文章并不是要取代the documentation,所以请也读一读!其中一些例子就是从那里取来的。
目录
为了便于访问。
8条答案
按热度按时间w8rqjzmb1#
这篇文章旨在为读者提供一本关于与Pandas合并的入门读物,如何使用它,以及何时不使用它。
特别是,以下是这篇帖子将经历的:
基础-连接类型(左、右、外、内)
合并不同列名
合并多列
避免在输出中出现重复的合并键列
这篇帖子(以及我在这个帖子上的其他帖子)不会经历的:
注除非另有说明,否则大多数示例在演示各种功能时默认使用内连接操作。
此外,这里的所有DataFrame都可以复制和复制,这样您就可以玩它们了。此外,请参阅this post以了解如何从剪贴板中读取DataFrame。
最后,连接操作的所有可视化表示都是使用Google Drawing手绘的。灵感来自here。
说够了--教我怎么用
merge
就行了!设置与基础
为简单起见,键列具有相同的名称(目前)。
内部联接由
注这一点以及即将公布的数字都遵循这一惯例:
NaN
s的缺失值要执行内部联接,请在左边的DataFrame上调用
merge
,并将右边的DataFrame和联接键(至少)指定为参数。这只返回
left
和right
中共享公用键(在本例中为“B”和“D”)的行。左外部联接,或左联接由表示
这可以通过指定
how='left'
来执行。请仔细注意此处放置的NaN。如果指定
how='left'
,则只使用left
中的密钥,而right
中缺少的数据将替换为NaN。同样,对于右外部联接,或右联接,它是...
...指定
how='right'
:这里使用了
right
中的密钥,并用NaN替换了left
中缺失的数据。最后,对于完全外连接,由
指定
how='outer'
。这使用了两个帧中的关键点,并为这两个帧中缺少的行插入了NAN。
文档很好地总结了这些不同的合并:
其他联接--左排除、右排除、全排除/反联接
如果您需要左排除JOIN和右排除JOIN两个步骤。
对于左排除联接,表示为
首先执行左外部联接,然后只过滤来自
left
的行(排除来自右侧的所有行),哪里,
类似地,对于排除权限的联接,
最后,如果需要进行只保留来自左侧或右侧的键的合并,而不是同时保留这两个键(IOW,执行反联接),
你可以用类似的方式来做这个-
键列名称不同
如果键列的名称不同-例如,
left
具有keyLeft
,而right
具有keyRight
而不是key
-则必须指定left_on
和right_on
作为参数,而不是on
:输出避免重复键列
合并
left
中的keyLeft
和right
中的keyRight
时,如果您只需要输出中的keyLeft
或keyRight
中的一个(而不是两者),则可以首先设置索引作为初步步骤。将其与前面命令的输出(即
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
的输出)进行对比,您会注意到缺少keyLeft
。您可以根据将哪个帧的索引设置为键来确定要保留哪一列。比方说,当执行某些外部联接操作时,这可能很重要。只合并
DataFrames
中的一列例如,考虑
如果您只需要合并“NEWCOL”(不合并任何其他列),您通常可以在合并之前只对列进行子集:
如果您正在执行左外部联接,则性能更好的解决方案将涉及
map
:如前所述,这类似于,但比
多列合并
要联接多个列,请为
on
(或left_on
和right_on
,视情况而定)指定列表。或者,如果名字不同,
其他有用的
merge*
操作和函数merge
,DataFrame.update
和DataFrame.combine_first
在某些情况下也用于用一个DataFrame更新另一个DataFrame。pd.merge_ordered
是有序连接的有用函数。pd.merge_asof
(读:merge_asof)对于近似联接非常有用。本部分仅涵盖最基本的内容,旨在满足您的胃口。有关更多示例和用例,请参阅documentation on
merge
,join
, andconcat
以及指向功能规范的链接。继续阅读
跳到Pandas合并101中的其他主题继续学习:
Merging basics - basic types of joins*
Index-based joins
Generalizing to multiple DataFrames
Cross join
你在这里。
yzuktlbb2#
pd.concat([df0, df1], kwargs)
的补充视觉视图。请注意,kwargaxis=0
或axis=1
的含义不像df.mean()
或df.apply(func)
那样直观tcomlyy63#
加入101
这些动画可能会更好地从视觉上解释你。片酬:Garrick Aden-Buie tidyexplain repo
内联接
外联接或全联接
右联
左联接
ou6hu8tu4#
In this answer, I will consider practical examples.
The first one, is of
pandas.concat
.The second one, of merging dataframes from the index of one and the column of another one.
1.
pandas.concat
Considering the following
DataFrames
with the same column names:Preco2018with size (8784, 5)
Preco 2019with size (8760, 5)
That have the same column names.
You can combine them using
pandas.concat
, by simplyWhich results in a DataFrame with the following size (17544, 5)
If you want to visualize, it ends up working like this
(Source)
2. Merge by Column and Index
In this part, I will consider a specific case: If one wants to merge the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe
Geo
with 54 columns, being one of the columns the DateData
, which is of typedatetime64[ns]
.And the dataframe
Price
that has one column with the price and the index corresponds to the datesIn this specific case, to merge them, one uses
pd.merge
Which results in the following dataframe
yshpjwxd5#
This post will go through the following topics:
Merging with index under different conditions
options for index-based joins:
merge
,join
,concat
merging on indexes
merging on index of one, column of other
effectively using named indexes to simplify merging syntax
BACK TO TOP
Index-based joins
TL;DR
There are a few options, some simpler than others depending on the use case.
DataFrame.merge
withleft_index
andright_index
(orleft_on
andright_on
using named indexes)supports inner/left/right/full
can only join two at a time
supports column-column, index-column, index-index joins
DataFrame.join
(join on index)supports inner/left (default)/right/full
can join multiple DataFrames at a time
supports index-index joins
pd.concat
(joins on index)supports inner/full (default)
can join multiple DataFrames at a time
supports index-index joins
Index to index joins
Setup & Basics
Typically, aninner join on indexwould look like this:
Other joins follow similar syntax.
Notable Alternatives
1.**
DataFrame.join
**defaults to joins on the index.DataFrame.join
does a LEFT OUTER JOIN by default, sohow='inner'
is necessary here.Note that I needed to specify the
lsuffix
andrsuffix
arguments sincejoin
would otherwise error out:Since the column names are the same. This would not be a problem if they were differently named.
1.**
pd.concat
**joins on the index and can join two or more DataFrames at once. It does a full outer join by default, sohow='inner'
is required here..For more information on
concat
, see this post.Index to Column joins
To perform an inner join using index of left, column of right, you will use
DataFrame.merge
a combination ofleft_index=True
andright_on=...
.Other joins follow a similar structure. Note that only
merge
can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.join
andconcat
are not capable of mixed merges. You will need to set the index as a pre-step usingDataFrame.set_index
.Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23,
DataFrame.merge
allows you to specify the index name toon
(orleft_on
andright_on
as necessary).For the previous example of merging with the index of left, column of right, you can use
left_on
with the index name of left:Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
you are here
gg0vcinb6#
This post will go through the following topics:
merge
has shortcomings here)BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining
merge
calls:However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce
pd.concat
for multi-way joins on unique keys, andDataFrame.join
for multi-way joins on non-unique keys. First, the setup.Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use
pd.concat
. Note that**pd.concat
joins DataFrames on the index**.Omit
join='inner'
for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, usejoin
, described below).Multiway merge on keys with duplicates
concat
is fast, but has its shortcomings. It cannot handle duplicates.In this situation, we can use
join
since it can handle non-unique keys (note thatjoin
joins DataFrames on their index; it callsmerge
under the hood and does a LEFT OUTER JOIN unless otherwise specified).Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames*
Cross join
you are here
hc2pp10m7#
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of
(>, <, >=, <=, !=)
. In the example above, a MultiIndex column is returned, because of overlaps in the column names.Performance wise, this is better than a naive cross join:
ig9co6j18#
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed
cross-join
I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:The setup:
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged: