1 00:00:00,005 --> 00:00:02,009 - [Instructor] Data frames make it easy to perform 2 00:00:02,009 --> 00:00:06,004 all kinds of manipulation on the data that they contain. 3 00:00:06,004 --> 00:00:07,005 So for this example, 4 00:00:07,005 --> 00:00:11,004 I'm going to focus on several of the more common operations. 5 00:00:11,004 --> 00:00:15,000 Let's open up pandas_manipulate. 6 00:00:15,000 --> 00:00:17,000 And you can see in my example file, 7 00:00:17,000 --> 00:00:21,006 I already have some code that reads the Inventory.csv file 8 00:00:21,006 --> 00:00:24,000 into a data frame. 9 00:00:24,000 --> 00:00:28,000 So for the first example, let's create a new column of data. 10 00:00:28,000 --> 00:00:30,004 We did this earlier in the course manually, 11 00:00:30,004 --> 00:00:31,006 but now we're going to accomplish 12 00:00:31,006 --> 00:00:33,009 the same thing using pandas. 13 00:00:33,009 --> 00:00:36,008 We're going to add a column named margin 14 00:00:36,008 --> 00:00:39,005 that represents the difference between the consumer price 15 00:00:39,005 --> 00:00:42,001 and the wholesale price for each product. 16 00:00:42,001 --> 00:00:45,002 So all I need to do is name the new column 17 00:00:45,002 --> 00:00:47,006 and use a formula to calculate the difference 18 00:00:47,006 --> 00:00:49,004 between the other two columns. 19 00:00:49,004 --> 00:00:51,006 So watch how easy this is. 20 00:00:51,006 --> 00:00:54,008 I'm just simply going to on the data frame make a new column. 21 00:00:54,008 --> 00:00:58,006 I'm going to call that margin. 22 00:00:58,006 --> 00:01:01,006 And I'm going to set that equal to the difference 23 00:01:01,006 --> 00:01:07,007 between the consumer price. 24 00:01:07,007 --> 00:01:10,001 And I'm going to subtract off 25 00:01:10,001 --> 00:01:15,003 the column for the wholesale price. 26 00:01:15,003 --> 00:01:16,007 And that's all there is to it. 27 00:01:16,007 --> 00:01:19,001 This will create a new column named margin 28 00:01:19,001 --> 00:01:23,004 and calculate the difference for each row in the data set. 29 00:01:23,004 --> 00:01:26,000 And let's also print out the shape of the data 30 00:01:26,000 --> 00:01:39,000 along with the first five rows. 31 00:01:39,000 --> 00:01:43,008 All right, so let's go ahead and save our code 32 00:01:43,008 --> 00:01:51,009 and let's run this. 33 00:01:51,009 --> 00:01:54,005 All right, and you can see that now we have a new column. 34 00:01:54,005 --> 00:01:58,002 So the shape has changed from being 50 columns, 35 00:01:58,002 --> 00:02:00,007 or, I'm sorry, 50 rows and five columns 36 00:02:00,007 --> 00:02:03,002 to 50 rows and six columns. 37 00:02:03,002 --> 00:02:05,007 And we can see that the new margin column 38 00:02:05,007 --> 00:02:08,000 is here in the result. 39 00:02:08,000 --> 00:02:13,000 So we've modified the shape of our data structure. 40 00:02:13,000 --> 00:02:14,008 We can also modify the contents 41 00:02:14,008 --> 00:02:17,006 of a column directly in place. 42 00:02:17,006 --> 00:02:19,006 Now, there's a variety of ways to do this, 43 00:02:19,006 --> 00:02:21,007 and the method I'm going to use 44 00:02:21,007 --> 00:02:24,005 is by using the apply function. 45 00:02:24,005 --> 00:02:28,007 So the apply function allows you to apply another function 46 00:02:28,007 --> 00:02:31,005 to each cell in a particular column. 47 00:02:31,005 --> 00:02:34,004 So suppose I wanted to modify the cells 48 00:02:34,004 --> 00:02:36,007 in the category column. 49 00:02:36,007 --> 00:02:39,000 Let's create some code to do that. 50 00:02:39,000 --> 00:02:42,002 And I'm going to comment out the previous example. 51 00:02:42,002 --> 00:02:44,007 So on the data frame, 52 00:02:44,007 --> 00:02:49,009 I'm going to specify I want to work on the category column. 53 00:02:49,009 --> 00:02:54,003 And what I'm going to do is assign that to the result 54 00:02:54,003 --> 00:02:59,005 of the apply function. 55 00:02:59,005 --> 00:03:03,000 Now, the argument to the apply function is any function 56 00:03:03,000 --> 00:03:07,003 I want to call for each cell in the column. 57 00:03:07,003 --> 00:03:09,002 So let's just make this simple. 58 00:03:09,002 --> 00:03:11,003 Let's convert each string 59 00:03:11,003 --> 00:03:14,000 in the category column to uppercase. 60 00:03:14,000 --> 00:03:16,006 To do that, I'm going to just supply 61 00:03:16,006 --> 00:03:21,006 an inline lambda function for that. 62 00:03:21,006 --> 00:03:24,003 And I'm going to call x.upper. 63 00:03:24,003 --> 00:03:26,008 So the argument to my lambda function, 64 00:03:26,008 --> 00:03:27,009 there's going to be one argument. 65 00:03:27,009 --> 00:03:29,004 It's going to be the string, 66 00:03:29,004 --> 00:03:32,000 which is the value of each cell in the column. 67 00:03:32,000 --> 00:03:33,001 And since it's a string, 68 00:03:33,001 --> 00:03:35,000 I can use the python upper function 69 00:03:35,000 --> 00:03:37,003 to convert it to uppercase. 70 00:03:37,003 --> 00:03:39,001 And let's also print the data frame 71 00:03:39,001 --> 00:03:43,000 when we're done with that. 72 00:03:43,000 --> 00:03:45,006 All right, so let's go ahead and save. 73 00:03:45,006 --> 00:03:50,000 And now let's run. 74 00:03:50,000 --> 00:03:54,009 And here you can see in the result in my category column, 75 00:03:54,009 --> 00:03:59,009 all of the strings have now been converted to uppercase. 76 00:03:59,009 --> 00:04:02,001 If you want to rename the columns on the data frame, 77 00:04:02,001 --> 00:04:04,002 that's also pretty easy. 78 00:04:04,002 --> 00:04:06,005 We can use the rename function for this. 79 00:04:06,005 --> 00:04:09,000 And again, there are a few ways to call this function, 80 00:04:09,000 --> 00:04:10,007 depending on your use case, 81 00:04:10,007 --> 00:04:13,008 but I'll demonstrate how to rename specific columns. 82 00:04:13,008 --> 00:04:17,003 So let's assume we want to rename wholesale price 83 00:04:17,003 --> 00:04:20,009 and consumer price to just wholesale and consumer. 84 00:04:20,009 --> 00:04:22,008 We can use the rename function 85 00:04:22,008 --> 00:04:25,000 and pass a dictionary of old 86 00:04:25,000 --> 00:04:28,002 and new names using the columns parameter. 87 00:04:28,002 --> 00:04:29,004 So on the data frame, 88 00:04:29,004 --> 00:04:33,007 I'm going to call the rename function, 89 00:04:33,007 --> 00:04:37,002 and I'm going to specify that I'm renaming columns. 90 00:04:37,002 --> 00:04:38,005 And inside this object, 91 00:04:38,005 --> 00:04:42,002 I'm going to give key value pairs 92 00:04:42,002 --> 00:04:47,006 and the key is going to be the column I want to be renamed 93 00:04:47,006 --> 00:04:51,005 paired with what the new name is. 94 00:04:51,005 --> 00:04:54,002 So let's do that for wholesale price, 95 00:04:54,002 --> 00:04:58,005 and let's also do consumer price. 96 00:04:58,005 --> 00:05:04,009 We will rename that to just consumer. 97 00:05:04,009 --> 00:05:06,006 And then I have to specify that I want this 98 00:05:06,006 --> 00:05:08,004 to happen in place. 99 00:05:08,004 --> 00:05:12,008 So I'm going to set the in place argument to be true. 100 00:05:12,008 --> 00:05:14,006 And then once that's done, 101 00:05:14,006 --> 00:05:15,009 we'll just call the head function 102 00:05:15,009 --> 00:05:20,003 to print out the first five lines. 103 00:05:20,003 --> 00:05:23,009 Let me comment out the previous example. 104 00:05:23,009 --> 00:05:31,009 All right, so let's go ahead and save this and let's run it. 105 00:05:31,009 --> 00:05:37,008 And oh, whoop, I have to actually print the head, 106 00:05:37,008 --> 00:05:40,006 not just call it. 107 00:05:40,006 --> 00:05:46,006 All right, let's try that again. 108 00:05:46,006 --> 00:05:50,006 All right, and so now we can see that the wholesale price 109 00:05:50,006 --> 00:05:54,007 and consumer price columns have been properly renamed. 110 00:05:54,007 --> 00:05:55,005 All right, one more thing. 111 00:05:55,005 --> 00:05:59,001 Let's finally look at how to delete a column of data. 112 00:05:59,001 --> 00:06:00,007 So to delete a column of data, 113 00:06:00,007 --> 00:06:02,007 we use the drop function 114 00:06:02,007 --> 00:06:06,002 with the name of the column that we no longer want. 115 00:06:06,002 --> 00:06:10,005 So let's comment on this. 116 00:06:10,005 --> 00:06:13,008 So on the data frame, I'm going to use the drop function. 117 00:06:13,008 --> 00:06:17,008 So let's drop the margin column that we've already made. 118 00:06:17,008 --> 00:06:19,000 And in order to do that, 119 00:06:19,000 --> 00:06:20,008 I'll have to re-enable the code 120 00:06:20,008 --> 00:06:26,008 that creates the margin column. 121 00:06:26,008 --> 00:06:29,000 All right, and in the drop function, 122 00:06:29,000 --> 00:06:30,004 I have to specify, once again, 123 00:06:30,004 --> 00:06:33,007 I want this to happen in place. 124 00:06:33,007 --> 00:06:35,003 I don't want to create a new data frame. 125 00:06:35,003 --> 00:06:37,003 So I'm going to set in place equals to true. 126 00:06:37,003 --> 00:06:40,003 And then I have to specify what's called the axis. 127 00:06:40,003 --> 00:06:42,006 The axis I'm going to set to the value of one. 128 00:06:42,006 --> 00:06:44,009 This basically means to operate on the columns, 129 00:06:44,009 --> 00:06:46,001 not the rows, 130 00:06:46,001 --> 00:06:48,006 because otherwise the pandas library doesn't know 131 00:06:48,006 --> 00:06:51,003 whether this is a row index label 132 00:06:51,003 --> 00:06:52,007 or a column index label. 133 00:06:52,007 --> 00:06:56,005 So by specifying the axis parameter with a value of one, 134 00:06:56,005 --> 00:06:58,001 I'm saying, hey, this is a column, 135 00:06:58,001 --> 00:07:02,006 so go look for the column named margin and drop that. 136 00:07:02,006 --> 00:07:03,009 And then after we've done that, 137 00:07:03,009 --> 00:07:10,005 let's go ahead and print out the head one more time. 138 00:07:10,005 --> 00:07:15,001 So let's re-enable the code that shows the margin there, 139 00:07:15,001 --> 00:07:18,006 and then we'll print it again after it's gone. 140 00:07:18,006 --> 00:07:26,003 So let's go ahead and save this and let's try it. 141 00:07:26,003 --> 00:07:27,005 All right, so in the first example, 142 00:07:27,005 --> 00:07:30,004 we can see that we created the margin column properly 143 00:07:30,004 --> 00:07:33,008 and then we can see that it's been dropped. 144 00:07:33,008 --> 00:07:35,001 All right, so this should give you some sense 145 00:07:35,001 --> 00:07:36,001 of how powerful 146 00:07:36,001 --> 00:07:39,004 and flexible data frames are for working with data content. 147 00:07:39,004 --> 00:07:41,004 And I would suggest maybe spending some time here 148 00:07:41,004 --> 00:07:42,006 referring to the docs 149 00:07:42,006 --> 00:07:44,009 and trying out some of your own experiments 150 00:07:44,009 --> 00:07:46,000 before moving on.