1 00:00:00,005 --> 00:00:01,007 - [Instructor] In this example, we're going 2 00:00:01,007 --> 00:00:04,002 to take a closer look at the data frame structure 3 00:00:04,002 --> 00:00:06,005 and how we can inspect its content. 4 00:00:06,005 --> 00:00:08,003 And again, just as a reminder here, 5 00:00:08,003 --> 00:00:11,000 data frames are a fairly large subject 6 00:00:11,000 --> 00:00:13,004 and I'm not going to cover all the features here. 7 00:00:13,004 --> 00:00:15,004 I'm just going to focus on some of the basics 8 00:00:15,004 --> 00:00:17,002 and I would suggest you check out 9 00:00:17,002 --> 00:00:19,002 some of the other courses here in the library 10 00:00:19,002 --> 00:00:21,000 to learn more about Pandas 11 00:00:21,000 --> 00:00:23,005 and data frames if you're interested. 12 00:00:23,005 --> 00:00:25,006 So let's open up our example file. 13 00:00:25,006 --> 00:00:29,007 It's going to be pandas_explore, 14 00:00:29,007 --> 00:00:31,000 and I already have some code 15 00:00:31,000 --> 00:00:34,002 to load the financial sample data 16 00:00:34,002 --> 00:00:36,004 which will give us a data frame. 17 00:00:36,004 --> 00:00:40,008 So let's start off by examining the data frames shape 18 00:00:40,008 --> 00:00:43,001 which will give us the dimensions of the data 19 00:00:43,001 --> 00:00:45,004 and to do that, 20 00:00:45,004 --> 00:00:49,004 I'll print out the data frames shape. 21 00:00:49,004 --> 00:00:54,009 And let's also use the describe function 22 00:00:54,009 --> 00:00:56,004 which will give us 23 00:00:56,004 --> 00:01:01,001 some detailed summary information about the data. 24 00:01:01,001 --> 00:01:03,007 All right, so let's go ahead and save 25 00:01:03,007 --> 00:01:07,003 and let's collapse this down so we have some more room. 26 00:01:07,003 --> 00:01:09,002 All right, and let's run this, 27 00:01:09,002 --> 00:01:14,003 so run this in the terminal. 28 00:01:14,003 --> 00:01:16,002 Okay, and so we can see here in the output 29 00:01:16,002 --> 00:01:22,003 that the data is 700 rows by 16 columns 30 00:01:22,003 --> 00:01:25,000 and the describe function gives us 31 00:01:25,000 --> 00:01:27,009 some summary about the data. 32 00:01:27,009 --> 00:01:30,009 Now by default, Pandas only includes 33 00:01:30,009 --> 00:01:35,000 the columns that have numerical data in the describe output, 34 00:01:35,000 --> 00:01:37,005 but you can include the other columns if you want to 35 00:01:37,005 --> 00:01:38,007 and you can see right here in the middle, 36 00:01:38,007 --> 00:01:40,004 we have these ellipses. 37 00:01:40,004 --> 00:01:44,005 That's because Pandas can't fit all of the columns 38 00:01:44,005 --> 00:01:48,003 in the amount of width that the terminal has here, 39 00:01:48,003 --> 00:01:50,006 and we'll go back and fix that in a moment. 40 00:01:50,006 --> 00:01:52,004 Let's also output the first 41 00:01:52,004 --> 00:01:54,006 and last five rows of the data 42 00:01:54,006 --> 00:01:57,000 so we can see the actual data 43 00:01:57,000 --> 00:02:00,004 and we'll use the head and the tail functions for that. 44 00:02:00,004 --> 00:02:03,003 So I'll use the head function 45 00:02:03,003 --> 00:02:05,006 and I want to see five rows, 46 00:02:05,006 --> 00:02:08,005 and then I'll use the tail function 47 00:02:08,005 --> 00:02:11,009 to get the last five rows. 48 00:02:11,009 --> 00:02:16,006 All right, let's comment previous functions out 49 00:02:16,006 --> 00:02:22,003 and then let's run this again. 50 00:02:22,003 --> 00:02:27,003 All right, so now we can see the first 51 00:02:27,003 --> 00:02:30,002 and last five rows of the data 52 00:02:30,002 --> 00:02:31,007 and let's see, we can also see 53 00:02:31,007 --> 00:02:33,009 that again, we're getting some ellipses 54 00:02:33,009 --> 00:02:37,000 because it can't fit all of the rows. 55 00:02:37,000 --> 00:02:40,009 It's only showing a subset of some of the rows 56 00:02:40,009 --> 00:02:43,006 and we can also see that we've got some extra decimal points 57 00:02:43,006 --> 00:02:45,000 that we don't really need. 58 00:02:45,000 --> 00:02:48,000 Let's go back and tighten this up a little bit 59 00:02:48,000 --> 00:02:52,001 by setting some Pandas display options. 60 00:02:52,001 --> 00:02:55,000 So first on Pandas, 61 00:02:55,000 --> 00:02:58,005 I'm going to call the set_option function 62 00:02:58,005 --> 00:03:10,001 and I'm going to set the display.max.columns option to none. 63 00:03:10,001 --> 00:03:14,008 And then I'll call set_option again 64 00:03:14,008 --> 00:03:21,005 and I'll set the display.precision to two. 65 00:03:21,005 --> 00:03:24,007 All right, so now let's save that. 66 00:03:24,007 --> 00:03:28,009 Now let's run one more time 67 00:03:28,009 --> 00:03:31,007 and here we go. 68 00:03:31,007 --> 00:03:32,008 Okay, and now we can see 69 00:03:32,008 --> 00:03:35,002 that all of the columns are being displayed. 70 00:03:35,002 --> 00:03:39,002 So you can see here are the first set of columns 71 00:03:39,002 --> 00:03:40,003 and then this little backslash 72 00:03:40,003 --> 00:03:42,007 means that the output is continued on the next line, 73 00:03:42,007 --> 00:03:44,005 and that's down here. 74 00:03:44,005 --> 00:03:49,009 So we have the various columns showing the sales and so on 75 00:03:49,009 --> 00:03:51,000 and then another backslash 76 00:03:51,000 --> 00:03:55,004 which means that the data keeps on going, that's here. 77 00:03:55,004 --> 00:03:57,002 And then we have the same output, 78 00:03:57,002 --> 00:03:58,007 only in this case it's the tail. 79 00:03:58,007 --> 00:04:01,003 So these are the last five rows 80 00:04:01,003 --> 00:04:02,009 and again, now we're seeing 81 00:04:02,009 --> 00:04:05,001 a continuation of all the different columns, 82 00:04:05,001 --> 00:04:07,003 and you'll also notice that the display.precision 83 00:04:07,003 --> 00:04:11,000 has been set to two decimal places. 84 00:04:11,000 --> 00:04:12,002 All right, let's try out some functions 85 00:04:12,002 --> 00:04:17,003 for inspecting the actual data itself. 86 00:04:17,003 --> 00:04:21,004 Let's comment these out. 87 00:04:21,004 --> 00:04:23,004 So let's try inspecting some of the data. 88 00:04:23,004 --> 00:04:26,007 The value_counts function can be used to see how many 89 00:04:26,007 --> 00:04:31,006 of each data value are present in a given column of data. 90 00:04:31,006 --> 00:04:33,001 So let's use that function to see 91 00:04:33,001 --> 00:04:37,002 how many times each product type appears in the dataset. 92 00:04:37,002 --> 00:04:41,001 So I'm going to print out the data frame 93 00:04:41,001 --> 00:04:42,000 and what I'm going to do 94 00:04:42,000 --> 00:04:47,007 is I'm going to ask for the Product column, 95 00:04:47,007 --> 00:04:48,009 and on the Product column 96 00:04:48,009 --> 00:04:53,002 I'm going to call the function value_counts. 97 00:04:53,002 --> 00:04:56,001 And let's also use the unique function 98 00:04:56,001 --> 00:04:58,004 to determine all of the unique values 99 00:04:58,004 --> 00:05:00,003 that appear in the column 100 00:05:00,003 --> 00:05:03,009 and once again, we'll use the Product column for this. 101 00:05:03,009 --> 00:05:10,004 So in the Product column, 102 00:05:10,004 --> 00:05:17,000 we're going to call unique, all right. 103 00:05:17,000 --> 00:05:18,009 Okay, let's go ahead. 104 00:05:18,009 --> 00:05:23,004 Let's also use some basic math functions. 105 00:05:23,004 --> 00:05:26,001 Let's use the min, max, 106 00:05:26,001 --> 00:05:30,000 and sum functions to get more information out of our table. 107 00:05:30,000 --> 00:05:31,007 So I want to see what the min 108 00:05:31,007 --> 00:05:34,007 and max values are for the Profit column, 109 00:05:34,007 --> 00:05:39,002 so let's print out on the data frame. 110 00:05:39,002 --> 00:05:42,008 We'll look for the Profit column 111 00:05:42,008 --> 00:05:45,006 and I'll call max, 112 00:05:45,006 --> 00:05:50,005 and then we'll do the same thing with min. 113 00:05:50,005 --> 00:05:55,004 All right, and let's also get the sum 114 00:05:55,004 --> 00:06:01,008 of that column as well. 115 00:06:01,008 --> 00:06:06,000 Oops, there we go. 116 00:06:06,000 --> 00:06:08,003 Okay, so let's save 117 00:06:08,003 --> 00:06:14,003 and let's run this. 118 00:06:14,003 --> 00:06:17,001 Okay, so we have the value_counts 119 00:06:17,001 --> 00:06:20,000 so that's showing us the Product column 120 00:06:20,000 --> 00:06:22,008 and we can see the individual product counts. 121 00:06:22,008 --> 00:06:24,004 So this is the number of times 122 00:06:24,004 --> 00:06:28,004 that this product value appears in the Product column. 123 00:06:28,004 --> 00:06:30,009 So we can see that the most common one is the Paseo, 124 00:06:30,009 --> 00:06:35,004 followed by the Velo, followed by VTT, and so on, 125 00:06:35,004 --> 00:06:38,007 and then it also gives us the data type 126 00:06:38,007 --> 00:06:41,000 which is an int64. 127 00:06:41,000 --> 00:06:43,008 Then we have the output of the unique function. 128 00:06:43,008 --> 00:06:45,003 That's this list right here, 129 00:06:45,003 --> 00:06:47,005 so these are all the unique values 130 00:06:47,005 --> 00:06:50,006 that appear in that column. 131 00:06:50,006 --> 00:06:57,008 Then we have the max of that column which is 262,000ish 132 00:06:57,008 --> 00:06:59,002 and then we have the min 133 00:06:59,002 --> 00:07:02,008 which is minus 40,000ish, 134 00:07:02,008 --> 00:07:05,000 and then we have the sum of that column 135 00:07:05,000 --> 00:07:08,004 which again, looks like 1., what's that? 136 00:07:08,004 --> 00:07:13,000 Looks like 16 million, almost 17 million. 137 00:07:13,000 --> 00:07:15,009 Okay, but suppose I only want to see the sum 138 00:07:15,009 --> 00:07:20,000 of the profits for a specific product, 139 00:07:20,000 --> 00:07:23,000 Pandas gives me a way to query the data frame 140 00:07:23,000 --> 00:07:26,003 using a function called L-O-C. 141 00:07:26,003 --> 00:07:27,009 And this isn't the only way to do this, 142 00:07:27,009 --> 00:07:29,009 but it's a fairly common way. 143 00:07:29,009 --> 00:07:33,009 So let's update the code to only show the total profits 144 00:07:33,009 --> 00:07:36,008 for the Carretera product. 145 00:07:36,008 --> 00:07:39,007 So I'm going to print out the result 146 00:07:39,007 --> 00:07:41,008 and on the data frame, 147 00:07:41,008 --> 00:07:45,006 I'm going to use the loc operator 148 00:07:45,006 --> 00:07:49,001 and inside that I'm going to use a query statement. 149 00:07:49,001 --> 00:07:52,009 So I'm going to query on the product 150 00:07:52,009 --> 00:08:00,007 where the product is equal to the Carretera string. 151 00:08:00,007 --> 00:08:03,006 And for that set of rows 152 00:08:03,006 --> 00:08:05,002 where the product is equal to Carretera, 153 00:08:05,002 --> 00:08:10,004 I want to get the Profit column 154 00:08:10,004 --> 00:08:11,007 and what I'm going to do is call 155 00:08:11,007 --> 00:08:16,003 the sum function on that result. 156 00:08:16,003 --> 00:08:19,006 So this statement will give me the sum 157 00:08:19,006 --> 00:08:25,002 of this column that meets this criteria. 158 00:08:25,002 --> 00:08:26,001 All right, so let's go ahead 159 00:08:26,001 --> 00:08:31,002 and comment out our previous code, 160 00:08:31,002 --> 00:08:37,005 and let's save and let's run, 161 00:08:37,005 --> 00:08:41,000 and we can see that this is the total sum 162 00:08:41,000 --> 00:08:45,009 for all of the profits where the product is Carretera. 163 00:08:45,009 --> 00:08:48,006 In fact, now we know enough about Pandas 164 00:08:48,006 --> 00:08:53,003 to get a breakout of the profits for each of the products, 165 00:08:53,003 --> 00:08:57,004 so let's write some code to create that output. 166 00:08:57,004 --> 00:08:59,007 So what I'm going to do first is use 167 00:08:59,007 --> 00:09:03,008 the Python locale module to set the format of the output 168 00:09:03,008 --> 00:09:06,006 as currency in the user's locale 169 00:09:06,006 --> 00:09:09,001 and you can see that if I scroll back up to the top, 170 00:09:09,001 --> 00:09:11,007 I've imported the Python locale module 171 00:09:11,007 --> 00:09:13,006 from the standard library. 172 00:09:13,006 --> 00:09:17,009 So what I'm going to do first is on the locale, 173 00:09:17,009 --> 00:09:20,005 I'm going to call set locale 174 00:09:20,005 --> 00:09:27,004 and I'm going to use the locale.LC_ALL option 175 00:09:27,004 --> 00:09:30,001 to set all of the locale options 176 00:09:30,001 --> 00:09:33,007 to my local machine locale. 177 00:09:33,007 --> 00:09:36,005 So this will affect how the currency is output. 178 00:09:36,005 --> 00:09:40,000 Then I'm going to loop over each individual product name. 179 00:09:40,000 --> 00:09:41,001 Well, we know how to do that 180 00:09:41,001 --> 00:09:49,000 because we know how to use the unique function, right? 181 00:09:49,000 --> 00:09:55,008 So for each product name in the Product column 182 00:09:55,008 --> 00:09:59,005 and we're going to get the unique values 183 00:09:59,005 --> 00:10:03,000 from that product column, 184 00:10:03,000 --> 00:10:07,002 what I'm going to do is declare 185 00:10:07,002 --> 00:10:09,004 a variable called total, 186 00:10:09,004 --> 00:10:17,001 and I'm going to get from my LOC operator, 187 00:10:17,001 --> 00:10:19,001 I'm going to ask for the Product column 188 00:10:19,001 --> 00:10:22,005 where product is equal to the product name 189 00:10:22,005 --> 00:10:25,007 that I'm currently looking at. 190 00:10:25,007 --> 00:10:28,009 I want the Profit column 191 00:10:28,009 --> 00:10:33,004 and I'm going to get the sum of the Profit column. 192 00:10:33,004 --> 00:10:35,004 So whereas in this case, 193 00:10:35,004 --> 00:10:38,001 we were only interested in a hard-coded string, 194 00:10:38,001 --> 00:10:39,003 now I'm going to loop 195 00:10:39,003 --> 00:10:44,001 over all of the unique values in that column 196 00:10:44,001 --> 00:10:47,009 and then I'll just print out the result. 197 00:10:47,009 --> 00:10:53,004 So let's print out that the Profits for, 198 00:10:53,004 --> 00:10:58,004 and we'll put the product name in the output. 199 00:10:58,004 --> 00:11:01,002 And then I'll use my locale 200 00:11:01,002 --> 00:11:07,006 and I'll call the currency function on the total, 201 00:11:07,006 --> 00:11:10,001 and I want grouping to be true. 202 00:11:10,001 --> 00:11:13,002 So this will use whatever the locale uses 203 00:11:13,002 --> 00:11:16,002 for grouping hundreds and thousands and so on 204 00:11:16,002 --> 00:11:18,003 and now in the US that's commas, 205 00:11:18,003 --> 00:11:20,004 but it might be different in your locale. 206 00:11:20,004 --> 00:11:24,000 It might be a period character or something else. 207 00:11:24,000 --> 00:11:27,000 All right, so let's go ahead and save that 208 00:11:27,000 --> 00:11:31,005 and let's comment out this previous example, 209 00:11:31,005 --> 00:11:35,004 and let's run it. 210 00:11:35,004 --> 00:11:37,003 All right, and we can see in the output 211 00:11:37,003 --> 00:11:39,003 that we have all the products 212 00:11:39,003 --> 00:11:41,009 and we have the totals for each, 213 00:11:41,009 --> 00:11:44,001 and it's in the local machine's currency 214 00:11:44,001 --> 00:11:45,008 which in this case is dollars 215 00:11:45,008 --> 00:11:48,003 grouped by hundreds and thousands 216 00:11:48,003 --> 00:11:51,000 and millions using commas, right? 217 00:11:51,000 --> 00:11:53,007 So you can see that I've just scratched the surface 218 00:11:53,007 --> 00:11:55,000 of using data frames here, 219 00:11:55,000 --> 00:11:56,008 but I'm sure this gives you an idea 220 00:11:56,008 --> 00:11:59,004 of the kind of power that Pandas gives you 221 00:11:59,004 --> 00:12:03,000 when working with datasets from Excel or CSV files.