I’m still playing around with Covid vaccine data, this time exploring how the number of doses varies week by week. I want to know how many more (or less) vaccines have been done on a given day compared to that same day last week. Show We’ll be using Pandas in this blog post, so let’s first install that library and import it: Install Pandas
Import module
And now let’s create a DataFrame containing a subset of the data that I’m working with: Table 1. Outputdatevaccinations2021-01-11 165844.0 2021-01-18 208641.0 2021-01-25 281725.0 2021-02-01 352935.0 2021-02-08 356291.0 I’ve filtered the data to include the data for 5 Mondays start from the beginning of January. In the real data set we have the data for every day from January until today. I want to add a column that shows the number of vaccinations done on the previous week. So:
and so on. It took me a while to find the function to do this, but it turns out that 2 is what we want. This function shifts the index by the desired number of periods. We’ll pass in 3 because we want to shift every row down by 1:
Note I should note that for the real DataFrame I had data for every day and therefore wanted to compare the data from 7 rows earlier in the DataFrame,which meant I had to use 4. If we wanted to compare the data from 3 days earlier, we could use 5, and so on.Table 2. OutputvaccinationsNaN 165844.0 208641.0 281725.0 352935.0 We can then add a new column to our DataFrame using the following code:
2021-01-11 165844.0 NaN 2021-01-18 208641.0 165844.0 2021-01-25 281725.0 208641.0 2021-02-01 352935.0 281725.0 2021-02-08 356291.0 352935.0 If we then wanted to compute the difference between this week and last week, we could compute this by writing the following code:
Or we could use the 6 function, which achieves the same thing: Table 4. OutputdatevaccinationslastWeekVaccinationslastWeekVaccinationsDiff2021-01-11 165844.0 NaN NaN 2021-01-18 208641.0 165844.0 42797.0 2021-01-25 281725.0 208641.0 73084.0 2021-02-01 352935.0 281725.0 71210.0 2021-02-08 356291.0 352935.0 3356.0 We can also compute the percentage change between the weeks using 7: Table 5. OutputdatevaccinationslastWeekVaccinationslastWeekVaccinationsDifflastWeekVaccinationsChange2021-01-11 165844.0 NaN NaN NaN 2021-01-18 208641.0 165844.0 42797.0 0.2580557632473892 2021-01-25 281725.0 208641.0 73084.0 0.3502858977861494 2021-02-01 352935.0 281725.0 71210.0 0.25276422042772206 2021-02-08 356291.0 352935.0 3356.0 0.009508833071245393 With these functions I’ve been able to do exactly what I wanted and could then create a chart based on this data.
About the authorI'm currently working on real-time user-facing analytics with Apache Pinot at StarTree. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also I co-authored the O'Reilly Graph Algorithms Book with Amy Hodler. How do I compare rows in pandas?Steps to Compare Values Between two Pandas DataFrames. Step 1: Prepare the datasets to be compared. To start, let's say that you have the following two datasets that you want to compare: ... . Step 2: Create the two DataFrames. ... . Step 3: Compare the values between the two Pandas DataFrames.. What does diff () do in pandas?Pandas DataFrame diff() Method
The diff() method returns a DataFrame with the difference between the values for each row and, by default, the previous row. Which row to compare with can be specified with the periods parameter.
What is diff () in Python?diff(arr[, n[, axis]]) function is used when we calculate the n-th order discrete difference along the given axis. The first order difference is given by out[i] = arr[i+1] – arr[i] along the given axis. If we have to calculate higher differences, we are using diff recursively. Syntax: numpy.diff()
How do I compare two column values in pandas?By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.
|