1
00:00:00,004 --> 00:00:02,007
- [Instructor] The Pandas Library has built-in support

2
00:00:02,007 --> 00:00:06,008
for reading and writing a variety of data file types.

3
00:00:06,008 --> 00:00:10,008
In this example, we'll see how to read Excel and CSV files.

4
00:00:10,008 --> 00:00:14,003
So here in my pandas_read.py file,

5
00:00:14,003 --> 00:00:19,000
you can see I've already imported pandas as the pd alias,

6
00:00:19,000 --> 00:00:21,002
and this is pretty common among Pandas users,

7
00:00:21,002 --> 00:00:23,004
you'll probably see this code quite a bit.

8
00:00:23,004 --> 00:00:26,003
That makes it easier to refer to the library.

9
00:00:26,003 --> 00:00:28,000
So for our first example,

10
00:00:28,000 --> 00:00:32,000
let's read the content of our inventory.csv file,

11
00:00:32,000 --> 00:00:33,006
which you can see is located right here

12
00:00:33,006 --> 00:00:35,001
in the same directory.

13
00:00:35,001 --> 00:00:37,000
So to do this, you just need to use

14
00:00:37,000 --> 00:00:41,006
the built-in read_csv function that Pandas provides.

15
00:00:41,006 --> 00:00:44,007
That's going to build and return a data frame object.

16
00:00:44,007 --> 00:00:46,007
So I'm going to create a variable called df

17
00:00:46,007 --> 00:00:48,001
for the data frame.

18
00:00:48,001 --> 00:00:52,005
And on pandas, I'm going to call read_csv

19
00:00:52,005 --> 00:00:58,002
and specify the name of Inventory.csv.

20
00:00:58,002 --> 00:01:01,001
And then I'm just going to print out the data frame.

21
00:01:01,001 --> 00:01:02,003
So the result of this function

22
00:01:02,003 --> 00:01:04,007
will be a filled out data frame

23
00:01:04,007 --> 00:01:08,002
that contains the CSV file contents.

24
00:01:08,002 --> 00:01:14,005
So let's go ahead and run what we have in the terminal.

25
00:01:14,005 --> 00:01:18,002
And sure enough, you can see that when we run the code,

26
00:01:18,002 --> 00:01:21,009
the output is a nicely formatted table of data,

27
00:01:21,009 --> 00:01:25,004
which is the contents of the CSV file.

28
00:01:25,004 --> 00:01:27,004
So Pandas has created columns

29
00:01:27,004 --> 00:01:30,006
with the same names as the first row in the file,

30
00:01:30,006 --> 00:01:33,000
and has also created rows

31
00:01:33,000 --> 00:01:36,006
with numeric indexes starting at 0.

32
00:01:36,006 --> 00:01:41,002
And we can see that all 50 rows of data are present.

33
00:01:41,002 --> 00:01:43,007
So I mean, that's pretty simple, right?

34
00:01:43,007 --> 00:01:45,008
But we can also customize the process.

35
00:01:45,008 --> 00:01:47,007
So, for example, I can specify

36
00:01:47,007 --> 00:01:51,006
that I only want to read a certain number of rows,

37
00:01:51,006 --> 00:01:54,004
and I can specify that I want to skip a number of rows

38
00:01:54,004 --> 00:01:56,004
if I'm working with a large file.

39
00:01:56,004 --> 00:01:58,005
So let's go back to the code,

40
00:01:58,005 --> 00:02:01,001
and let's just read a portion of the file

41
00:02:01,001 --> 00:02:06,003
so I'm going to copy this example and paste it here.

42
00:02:06,003 --> 00:02:09,009
I'm going to specify that I'm going to skip rows

43
00:02:09,009 --> 00:02:12,005
by using the skiprows parameter.

44
00:02:12,005 --> 00:02:15,002
I'm going to skip the first 15 rows,

45
00:02:15,002 --> 00:02:18,002
and then I only want to read 15 rows

46
00:02:18,002 --> 00:02:20,004
so I'm going to use the nrows parameter

47
00:02:20,004 --> 00:02:24,002
to specify that I want to read 15 rows of data.

48
00:02:24,002 --> 00:02:27,008
And let's comment out the previous example.

49
00:02:27,008 --> 00:02:30,003
Now, but the problem here is that if I do this,

50
00:02:30,003 --> 00:02:32,005
I'm going to skip the header row,

51
00:02:32,005 --> 00:02:33,009
and that's going to cause an issue.

52
00:02:33,009 --> 00:02:38,001
So watch as I run this.

53
00:02:38,001 --> 00:02:40,007
Right, we can see that the headers are all messed up, right?

54
00:02:40,007 --> 00:02:41,009
It doesn't know what the headers are

55
00:02:41,009 --> 00:02:44,000
because we skipped over them.

56
00:02:44,000 --> 00:02:47,004
So to fix that, I can specify a function

57
00:02:47,004 --> 00:02:51,002
that returns true if a line should be skipped

58
00:02:51,002 --> 00:02:53,000
and false otherwise.

59
00:02:53,000 --> 00:02:55,008
And I'm going to use an inline lambda function

60
00:02:55,008 --> 00:02:57,002
to achieve this.

61
00:02:57,002 --> 00:02:59,003
So let's go ahead back here.

62
00:02:59,003 --> 00:03:03,003
And what I'm going to do is I'm going to write skiprows=

63
00:03:03,003 --> 00:03:06,001
and I'm going to write lambda x.

64
00:03:06,001 --> 00:03:09,006
So if x is greater than or equal to 1

65
00:03:09,006 --> 00:03:13,000
and x is less than 15, right?

66
00:03:13,000 --> 00:03:15,000
Then that's going to return true,

67
00:03:15,000 --> 00:03:16,003
otherwise it's going to return false.

68
00:03:16,003 --> 00:03:19,002
So this is a big Boolean operation right here.

69
00:03:19,002 --> 00:03:22,002
So it's going to skip the first row,

70
00:03:22,002 --> 00:03:24,000
but not the 0 width row,

71
00:03:24,000 --> 00:03:26,003
which has all of my headers in it.

72
00:03:26,003 --> 00:03:32,005
So let's save this and let's run it again.

73
00:03:32,005 --> 00:03:34,005
And now you can see that the data is correct,

74
00:03:34,005 --> 00:03:35,008
the headers are all correct,

75
00:03:35,008 --> 00:03:37,005
and I've skipped 15 rows,

76
00:03:37,005 --> 00:03:40,009
and I've only read in 15 rows of data.

77
00:03:40,009 --> 00:03:46,002
All right, let's try some of this with an Excel file

78
00:03:46,002 --> 00:03:49,009
and let's comment out the previous example.

79
00:03:49,009 --> 00:03:53,000
So I'm going to read the financial sample file

80
00:03:53,000 --> 00:03:54,002
that we've been using in the course,

81
00:03:54,002 --> 00:03:56,003
which, as we saw earlier, is pretty large, right?

82
00:03:56,003 --> 00:03:58,003
That's 700 rows of data.

83
00:03:58,003 --> 00:04:00,001
So to read the Excel file,

84
00:04:00,001 --> 00:04:02,004
similar to using read_csv,

85
00:04:02,004 --> 00:04:05,006
Pandas provides a read_excel function.

86
00:04:05,006 --> 00:04:07,006
So I'm going to go ahead and call that,

87
00:04:07,006 --> 00:04:09,006
so that's going to give me a data frame,

88
00:04:09,006 --> 00:04:14,005
I'm going to call read_excel.

89
00:04:14,005 --> 00:04:21,004
And I want to read the FinancialSample.xlsx file.

90
00:04:21,004 --> 00:04:23,001
I can also specify some nice things.

91
00:04:23,001 --> 00:04:27,000
I want to specify that I'm going to use only certain columns,

92
00:04:27,000 --> 00:04:32,004
and I'm interested in columns A:E and column H,

93
00:04:32,004 --> 00:04:37,000
and I'm only interested in 15 rows.

94
00:04:37,000 --> 00:04:39,007
All right, so let's print out the data frame

95
00:04:39,007 --> 00:04:40,005
when we do that,

96
00:04:40,005 --> 00:04:45,003
and let's also print out

97
00:04:45,003 --> 00:04:47,005
the data frame types,

98
00:04:47,005 --> 00:04:51,002
so we'll see what types of data are stored in the table.

99
00:04:51,002 --> 00:04:57,006
So let's run this.

100
00:04:57,006 --> 00:04:58,007
And oh, whoops,

101
00:04:58,007 --> 00:05:06,000
looks like I've got

102
00:05:06,000 --> 00:05:07,001
this wrong property.

103
00:05:07,001 --> 00:05:09,008
I believe it's actually dtypes.

104
00:05:09,008 --> 00:05:10,006
There we go.

105
00:05:10,006 --> 00:05:14,008
Let's try this again.

106
00:05:14,008 --> 00:05:15,006
Yep, there we go.

107
00:05:15,006 --> 00:05:18,000
All right, so you can see that when I run this,

108
00:05:18,000 --> 00:05:20,004
we have the columns, right?

109
00:05:20,004 --> 00:05:21,008
So here's all the columns,

110
00:05:21,008 --> 00:05:24,000
we've got 15 rows of data,

111
00:05:24,000 --> 00:05:26,002
and you can also see that I'm only getting

112
00:05:26,002 --> 00:05:31,003
columns A:E and column H.

113
00:05:31,003 --> 00:05:33,008
And we have the data types.

114
00:05:33,008 --> 00:05:37,001
So the text columns are objects,

115
00:05:37,001 --> 00:05:41,009
and these columns are float64s.

116
00:05:41,009 --> 00:05:45,000
We can also get information about an Excel file

117
00:05:45,000 --> 00:05:47,002
by using the ExcelFile class

118
00:05:47,002 --> 00:05:50,003
that is part of the Pandas module.

119
00:05:50,003 --> 00:05:54,006
So let's go ahead and comment this out.

120
00:05:54,006 --> 00:05:58,003
So in this case, I'm going to get a file result,

121
00:05:58,003 --> 00:05:59,008
and on the Pandas object

122
00:05:59,008 --> 00:06:04,008
I'm going to use the ExcelFile class,

123
00:06:04,008 --> 00:06:11,007
and I'm going to pass in the FinancialSample file.

124
00:06:11,007 --> 00:06:14,004
So this will give me a reference to the Excel file,

125
00:06:14,004 --> 00:06:15,006
and then I could do things

126
00:06:15,006 --> 00:06:17,009
like print out some information about the file.

127
00:06:17,009 --> 00:06:20,008
So I can examine the sheet names, for example.

128
00:06:20,008 --> 00:06:23,005
So the sheet names are:,

129
00:06:23,005 --> 00:06:27,009
and I'm going to write file.sheet_names.

130
00:06:27,009 --> 00:06:32,007
All right, let's go ahead and try that,

131
00:06:32,007 --> 00:06:36,002
and let's run.

132
00:06:36,002 --> 00:06:37,002
And we can see, sure enough,

133
00:06:37,002 --> 00:06:40,009
we have these SalesData and Sheet2 sheets.

134
00:06:40,009 --> 00:06:44,001
So, of course, once we've loaded the data into a data frame,

135
00:06:44,001 --> 00:06:46,007
we're going to want to perform some operations on it,

136
00:06:46,007 --> 00:06:50,000
and we will see how to do that later in the chapter.