1 00:00:00,004 --> 00:00:02,007 - [Instructor] The Pandas Library has built-in support 2 00:00:02,007 --> 00:00:06,008 for reading and writing a variety of data file types. 3 00:00:06,008 --> 00:00:10,008 In this example, we'll see how to read Excel and CSV files. 4 00:00:10,008 --> 00:00:14,003 So here in my pandas_read.py file, 5 00:00:14,003 --> 00:00:19,000 you can see I've already imported pandas as the pd alias, 6 00:00:19,000 --> 00:00:21,002 and this is pretty common among Pandas users, 7 00:00:21,002 --> 00:00:23,004 you'll probably see this code quite a bit. 8 00:00:23,004 --> 00:00:26,003 That makes it easier to refer to the library. 9 00:00:26,003 --> 00:00:28,000 So for our first example, 10 00:00:28,000 --> 00:00:32,000 let's read the content of our inventory.csv file, 11 00:00:32,000 --> 00:00:33,006 which you can see is located right here 12 00:00:33,006 --> 00:00:35,001 in the same directory. 13 00:00:35,001 --> 00:00:37,000 So to do this, you just need to use 14 00:00:37,000 --> 00:00:41,006 the built-in read_csv function that Pandas provides. 15 00:00:41,006 --> 00:00:44,007 That's going to build and return a data frame object. 16 00:00:44,007 --> 00:00:46,007 So I'm going to create a variable called df 17 00:00:46,007 --> 00:00:48,001 for the data frame. 18 00:00:48,001 --> 00:00:52,005 And on pandas, I'm going to call read_csv 19 00:00:52,005 --> 00:00:58,002 and specify the name of Inventory.csv. 20 00:00:58,002 --> 00:01:01,001 And then I'm just going to print out the data frame. 21 00:01:01,001 --> 00:01:02,003 So the result of this function 22 00:01:02,003 --> 00:01:04,007 will be a filled out data frame 23 00:01:04,007 --> 00:01:08,002 that contains the CSV file contents. 24 00:01:08,002 --> 00:01:14,005 So let's go ahead and run what we have in the terminal. 25 00:01:14,005 --> 00:01:18,002 And sure enough, you can see that when we run the code, 26 00:01:18,002 --> 00:01:21,009 the output is a nicely formatted table of data, 27 00:01:21,009 --> 00:01:25,004 which is the contents of the CSV file. 28 00:01:25,004 --> 00:01:27,004 So Pandas has created columns 29 00:01:27,004 --> 00:01:30,006 with the same names as the first row in the file, 30 00:01:30,006 --> 00:01:33,000 and has also created rows 31 00:01:33,000 --> 00:01:36,006 with numeric indexes starting at 0. 32 00:01:36,006 --> 00:01:41,002 And we can see that all 50 rows of data are present. 33 00:01:41,002 --> 00:01:43,007 So I mean, that's pretty simple, right? 34 00:01:43,007 --> 00:01:45,008 But we can also customize the process. 35 00:01:45,008 --> 00:01:47,007 So, for example, I can specify 36 00:01:47,007 --> 00:01:51,006 that I only want to read a certain number of rows, 37 00:01:51,006 --> 00:01:54,004 and I can specify that I want to skip a number of rows 38 00:01:54,004 --> 00:01:56,004 if I'm working with a large file. 39 00:01:56,004 --> 00:01:58,005 So let's go back to the code, 40 00:01:58,005 --> 00:02:01,001 and let's just read a portion of the file 41 00:02:01,001 --> 00:02:06,003 so I'm going to copy this example and paste it here. 42 00:02:06,003 --> 00:02:09,009 I'm going to specify that I'm going to skip rows 43 00:02:09,009 --> 00:02:12,005 by using the skiprows parameter. 44 00:02:12,005 --> 00:02:15,002 I'm going to skip the first 15 rows, 45 00:02:15,002 --> 00:02:18,002 and then I only want to read 15 rows 46 00:02:18,002 --> 00:02:20,004 so I'm going to use the nrows parameter 47 00:02:20,004 --> 00:02:24,002 to specify that I want to read 15 rows of data. 48 00:02:24,002 --> 00:02:27,008 And let's comment out the previous example. 49 00:02:27,008 --> 00:02:30,003 Now, but the problem here is that if I do this, 50 00:02:30,003 --> 00:02:32,005 I'm going to skip the header row, 51 00:02:32,005 --> 00:02:33,009 and that's going to cause an issue. 52 00:02:33,009 --> 00:02:38,001 So watch as I run this. 53 00:02:38,001 --> 00:02:40,007 Right, we can see that the headers are all messed up, right? 54 00:02:40,007 --> 00:02:41,009 It doesn't know what the headers are 55 00:02:41,009 --> 00:02:44,000 because we skipped over them. 56 00:02:44,000 --> 00:02:47,004 So to fix that, I can specify a function 57 00:02:47,004 --> 00:02:51,002 that returns true if a line should be skipped 58 00:02:51,002 --> 00:02:53,000 and false otherwise. 59 00:02:53,000 --> 00:02:55,008 And I'm going to use an inline lambda function 60 00:02:55,008 --> 00:02:57,002 to achieve this. 61 00:02:57,002 --> 00:02:59,003 So let's go ahead back here. 62 00:02:59,003 --> 00:03:03,003 And what I'm going to do is I'm going to write skiprows= 63 00:03:03,003 --> 00:03:06,001 and I'm going to write lambda x. 64 00:03:06,001 --> 00:03:09,006 So if x is greater than or equal to 1 65 00:03:09,006 --> 00:03:13,000 and x is less than 15, right? 66 00:03:13,000 --> 00:03:15,000 Then that's going to return true, 67 00:03:15,000 --> 00:03:16,003 otherwise it's going to return false. 68 00:03:16,003 --> 00:03:19,002 So this is a big Boolean operation right here. 69 00:03:19,002 --> 00:03:22,002 So it's going to skip the first row, 70 00:03:22,002 --> 00:03:24,000 but not the 0 width row, 71 00:03:24,000 --> 00:03:26,003 which has all of my headers in it. 72 00:03:26,003 --> 00:03:32,005 So let's save this and let's run it again. 73 00:03:32,005 --> 00:03:34,005 And now you can see that the data is correct, 74 00:03:34,005 --> 00:03:35,008 the headers are all correct, 75 00:03:35,008 --> 00:03:37,005 and I've skipped 15 rows, 76 00:03:37,005 --> 00:03:40,009 and I've only read in 15 rows of data. 77 00:03:40,009 --> 00:03:46,002 All right, let's try some of this with an Excel file 78 00:03:46,002 --> 00:03:49,009 and let's comment out the previous example. 79 00:03:49,009 --> 00:03:53,000 So I'm going to read the financial sample file 80 00:03:53,000 --> 00:03:54,002 that we've been using in the course, 81 00:03:54,002 --> 00:03:56,003 which, as we saw earlier, is pretty large, right? 82 00:03:56,003 --> 00:03:58,003 That's 700 rows of data. 83 00:03:58,003 --> 00:04:00,001 So to read the Excel file, 84 00:04:00,001 --> 00:04:02,004 similar to using read_csv, 85 00:04:02,004 --> 00:04:05,006 Pandas provides a read_excel function. 86 00:04:05,006 --> 00:04:07,006 So I'm going to go ahead and call that, 87 00:04:07,006 --> 00:04:09,006 so that's going to give me a data frame, 88 00:04:09,006 --> 00:04:14,005 I'm going to call read_excel. 89 00:04:14,005 --> 00:04:21,004 And I want to read the FinancialSample.xlsx file. 90 00:04:21,004 --> 00:04:23,001 I can also specify some nice things. 91 00:04:23,001 --> 00:04:27,000 I want to specify that I'm going to use only certain columns, 92 00:04:27,000 --> 00:04:32,004 and I'm interested in columns A:E and column H, 93 00:04:32,004 --> 00:04:37,000 and I'm only interested in 15 rows. 94 00:04:37,000 --> 00:04:39,007 All right, so let's print out the data frame 95 00:04:39,007 --> 00:04:40,005 when we do that, 96 00:04:40,005 --> 00:04:45,003 and let's also print out 97 00:04:45,003 --> 00:04:47,005 the data frame types, 98 00:04:47,005 --> 00:04:51,002 so we'll see what types of data are stored in the table. 99 00:04:51,002 --> 00:04:57,006 So let's run this. 100 00:04:57,006 --> 00:04:58,007 And oh, whoops, 101 00:04:58,007 --> 00:05:06,000 looks like I've got 102 00:05:06,000 --> 00:05:07,001 this wrong property. 103 00:05:07,001 --> 00:05:09,008 I believe it's actually dtypes. 104 00:05:09,008 --> 00:05:10,006 There we go. 105 00:05:10,006 --> 00:05:14,008 Let's try this again. 106 00:05:14,008 --> 00:05:15,006 Yep, there we go. 107 00:05:15,006 --> 00:05:18,000 All right, so you can see that when I run this, 108 00:05:18,000 --> 00:05:20,004 we have the columns, right? 109 00:05:20,004 --> 00:05:21,008 So here's all the columns, 110 00:05:21,008 --> 00:05:24,000 we've got 15 rows of data, 111 00:05:24,000 --> 00:05:26,002 and you can also see that I'm only getting 112 00:05:26,002 --> 00:05:31,003 columns A:E and column H. 113 00:05:31,003 --> 00:05:33,008 And we have the data types. 114 00:05:33,008 --> 00:05:37,001 So the text columns are objects, 115 00:05:37,001 --> 00:05:41,009 and these columns are float64s. 116 00:05:41,009 --> 00:05:45,000 We can also get information about an Excel file 117 00:05:45,000 --> 00:05:47,002 by using the ExcelFile class 118 00:05:47,002 --> 00:05:50,003 that is part of the Pandas module. 119 00:05:50,003 --> 00:05:54,006 So let's go ahead and comment this out. 120 00:05:54,006 --> 00:05:58,003 So in this case, I'm going to get a file result, 121 00:05:58,003 --> 00:05:59,008 and on the Pandas object 122 00:05:59,008 --> 00:06:04,008 I'm going to use the ExcelFile class, 123 00:06:04,008 --> 00:06:11,007 and I'm going to pass in the FinancialSample file. 124 00:06:11,007 --> 00:06:14,004 So this will give me a reference to the Excel file, 125 00:06:14,004 --> 00:06:15,006 and then I could do things 126 00:06:15,006 --> 00:06:17,009 like print out some information about the file. 127 00:06:17,009 --> 00:06:20,008 So I can examine the sheet names, for example. 128 00:06:20,008 --> 00:06:23,005 So the sheet names are:, 129 00:06:23,005 --> 00:06:27,009 and I'm going to write file.sheet_names. 130 00:06:27,009 --> 00:06:32,007 All right, let's go ahead and try that, 131 00:06:32,007 --> 00:06:36,002 and let's run. 132 00:06:36,002 --> 00:06:37,002 And we can see, sure enough, 133 00:06:37,002 --> 00:06:40,009 we have these SalesData and Sheet2 sheets. 134 00:06:40,009 --> 00:06:44,001 So, of course, once we've loaded the data into a data frame, 135 00:06:44,001 --> 00:06:46,007 we're going to want to perform some operations on it, 136 00:06:46,007 --> 00:06:50,000 and we will see how to do that later in the chapter.