One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the For convenience, we will start by redefining the In [1]: Relational Algebra¶The behavior implemented in Pandas implements several of these fundamental building-blocks in the Categories of Joins¶The One-to-one joins¶Perhaps the simplest type of merge expresion is the one-to-one join, which is in many ways very similar to the column-wise concatenation seen in
Combining Datasets: Concat & Append. As a concrete example, consider the following two In [2]: df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'], 'group': ['Accounting', 'Engineering', 'Engineering', 'HR']}) df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'], 'hire_date': [2004, 2008, 2012, 2014]}) display('df1', 'df2') Out[2]: df1
df2
To combine this information into a single In [3]: df3 = pd.merge(df1, df2) df3 Out[3]:
The Many-to-one joins¶Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For
the many-to-one case, the resulting In [4]: df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'], 'supervisor': ['Carly', 'Guido', 'Steve']}) display('df3', 'df4', 'pd.merge(df3, df4)') Out[4]: df3
df4
pd.merge(df3, df4)
The resulting Many-to-many joins¶Many-to-many joins are a bit confusing
conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps most clear with a concrete example. Consider the following, where we have a In [5]: df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']}) display('df1', 'df5', "pd.merge(df1, df5)")
Out[5]: df1
df5
pd.merge(df1, df5)
These three types of joins can be used with other Pandas tools to implement a wide array of functionality. But in practice, datasets are rarely as clean as the one we're working with here. In the following section we'll consider some of the options provided by Specification of the Merge Key¶We've already seen the default behavior of The on keyword¶Most simply, you can explicitly specify the name of the key column using the In [6]: display('df1', 'df2', "pd.merge(df1, df2, on='employee')") Out[6]: df1
df2
pd.merge(df1, df2, on='employee')
This option works only if both the left and right The left_on and right_on keywords¶At times you may wish to merge two datasets with different column names; for example, we may have a
dataset in which the employee name is labeled as "name" rather than "employee". In this case, we can use the In [7]: df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'salary': [70000, 80000, 120000, 90000]}) display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")') Out[7]: df1
df3
pd.merge(df1, df3, left_on="employee", right_on="name")
The result has a redundant column that we can drop if desired–for example, by using the In [8]: pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1) Out[8]:
The left_index and right_index keywords¶Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this: In [9]: df1a = df1.set_index('employee') df2a = df2.set_index('employee') display('df1a', 'df2a') Out[9]: df1a
df2a
You can use the index as the key for merging by specifying the In [10]: display('df1a', 'df2a', "pd.merge(df1a, df2a, left_index=True, right_index=True)") Out[10]: df1a
df2a
pd.merge(df1a, df2a, left_index=True, right_index=True)
For convenience, In [11]: display('df1a', 'df2a', 'df1a.join(df2a)') Out[11]: df1a
df2a
df1a.join(df2a)
If you'd like to mix indices and columns, you can combine In [12]: display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')") Out[12]: df1a
df3
pd.merge(df1a, df3, left_index=True, right_on='name')
All of these options also work with multiple indices and/or multiple columns; the interface for this behavior is very intuitive. For more information on this, see the "Merge, Join, and Concatenate" section of the Pandas documentation. Specifying Set Arithmetic for Joins¶In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other. Consider this example: In [13]: df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']}, columns=['name', 'food']) df7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']}, columns=['name', 'drink']) display('df6', 'df7', 'pd.merge(df6, df7)') Out[13]: df6
df7
pd.merge(df6, df7)
Here we have merged two datasets that have only a single "name" entry in common: Mary. By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the In [14]: pd.merge(df6, df7, how='inner') Out[14]:
Other options for the In [15]: display('df6', 'df7', "pd.merge(df6, df7, how='outer')") Out[15]: df6
df7
pd.merge(df6, df7, how='outer')
The left join and right join return joins over the left entries and right entries, respectively. For example: In [16]: display('df6', 'df7', "pd.merge(df6, df7, how='left')") Out[16]: df6
df7
pd.merge(df6, df7, how='left')
The output rows now correspond to the entries in the left input. Using All of these options can be applied straightforwardly to any of the preceding join types. Overlapping Column Names: The suffixes Keyword¶Finally, you may end up in a case where your two input In [17]: df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [1, 2, 3, 4]}) df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'], 'rank': [3, 1, 4, 2]}) display('df8', 'df9', 'pd.merge(df8, df9, on="name")') Out[17]: df8
df9
pd.merge(df8, df9, on="name")
Because the output would have two conflicting column names, the merge function automatically appends a suffix In [18]: display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])') Out[18]: df8
df9
pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])
These suffixes work in any of the possible join patterns, and work also if there are multiple overlapping columns. Example: US States Data¶Merge and join operations come up most often when combining data from different sources. Here we will consider an example of some data about US states and their populations. The data files can be found at http://github.com/jakevdp/data-USstates/: In [19]: # Following are shell commands to download the data # !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv # !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv # !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv Let's take a look at the three datasets, using the Pandas In [20]: pop = pd.read_csv('data/state-population.csv') areas = pd.read_csv('data/state-areas.csv') abbrevs = pd.read_csv('data/state-abbrevs.csv') display('pop.head()', 'areas.head()', 'abbrevs.head()') Out[20]: pop.head()
areas.head()
abbrevs.head()
Given this information, say we want to compute a relatively straightforward result: rank US states and territories by their 2010 population density. We clearly have the data here to find this result, but we'll have to combine the datasets to find the result. We'll start with a many-to-one merge that will give us the full state name within the population In [21]: merged = pd.merge(pop, abbrevs, how='outer', left_on='state/region', right_on='abbreviation') merged = merged.drop('abbreviation', 1) # drop duplicate info merged.head() Out[21]:
Let's double-check whether there were any mismatches here, which we can do by looking for rows with nulls: Out[22]: state/region False ages False year False population True state True dtype: bool Some of the In [23]: merged[merged['population'].isnull()].head() Out[23]:
It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source. More importantly, we see also that some of the new In [24]: merged.loc[merged['state'].isnull(), 'state/region'].unique()
Out[24]: array(['PR', 'USA'], dtype=object) We can quickly infer the issue: our population data includes entries for Puerto Rico (PR) and the United States as a whole (USA), while these entries do not appear in the state abbreviation key. We can fix these quickly by filling in appropriate entries: In [25]: merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico' merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States' merged.isnull().any() Out[25]: state/region False ages False year False population True state False dtype: bool No more nulls in the Now we can
merge the result with the area data using a similar procedure. Examining our results, we will want to join on the In [26]: final = pd.merge(merged, areas, on='state', how='left') final.head() Out[26]:
Again, let's check for nulls to see if there were any mismatches: Out[27]: state/region False ages False year False population True state False area (sq. mi) True dtype: bool There are nulls in the In [28]: final['state'][final['area (sq. mi)'].isnull()].unique() Out[28]: array(['United States'], dtype=object) We see that our In [29]: final.dropna(inplace=True) final.head() Out[29]:
Now we have all the data we need. To answer the question of interest, let's first select the portion of the data corresponding with the year 2000, and the total population. We'll use the In [30]: data2010 = final.query("year == 2010 & ages == 'total'") data2010.head() Out[30]:
Now let's compute the population density and display it in order. We'll start by re-indexing our data on the state, and then compute the result: In [31]: data2010.set_index('state', inplace=True) density = data2010['population'] / data2010['area (sq. mi)'] In [32]: density.sort_values(ascending=False, inplace=True) density.head() Out[32]: state District of Columbia 8898.897059 Puerto Rico 1058.665149 New Jersey 1009.253268 Rhode Island 681.339159 Connecticut 645.600649 dtype: float64 The result is a ranking of US states plus Washington, DC, and Puerto Rico in order of their 2010 population density, in residents per square mile. We can see that by far the densest region in this dataset is Washington, DC (i.e., the District of Columbia); among states, the densest is New Jersey. We can also check the end of the list: Out[33]: state South Dakota 10.583512 North Dakota 9.537565 Montana 6.736171 Wyoming 5.768079 Alaska 1.087509 dtype: float64 We see that the least dense state, by far, is Alaska, averaging slightly over one resident per square mile. This type of messy data merging is a common task when trying to answer questions using real-world data sources. I hope that this example has given you an idea of the ways you can combine tools we've covered in order to gain insight from your data! Can you merge DataFrames with different columns?It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name. axis: 0 refers to the row axis and1 refers the column axis.
How do you merge datasets based on two columns?Example 1: Combine Data by Two ID Columns Using merge() Function. In Example 1, I'll illustrate how to apply the merge function to combine data frames based on multiple ID columns. For this, we have to specify the by argument of the merge function to be equal to a vector of ID column names (i.e. by = c(“ID1”, “ID2”)).
How do I merge DataFrames with different column names?Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.
How do I merge datasets together?To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.
|