Python数据分析基础

Exercise 1
Exercise 2 – Getting and Knowing your Data
Exercise 3 – Getting and Knowing your Data
结语

Exercise 1

Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data

Step 2. Download the dataset to your computer and unzip it.

Step 3. Use the tsv file and assign it to a dataframe called food

代码如下：

# 从上面网址下载数据集并解压，然后将其读取并赋值给food
import pandas as pd
import numpy as np
food = pd.read_csv('en.openfoodfacts.org.products.tsv', sep='\t')

输出结果如下：

D:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3020: DtypeWarning: Columns (0,3,5,19,20,24,25,26,27,28,36,37,38,39,48) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Step 4. See the first 5 entries

代码如下：

# 取前五条数据
food.head()

输出结果如下：

	code	url	creator	created_t	created_datetime	last_modified_t	last_modified_datetime	product_name	generic_name	quantity	…	fruits-vegetables-nuts_100g	fruits-vegetables-nuts-estimate_100g	collagen-meat-protein-ratio_100g	cocoa_100g	chlorophyl_100g	carbon-footprint_100g	nutrition-score-fr_100g	nutrition-score-uk_100g	glycemic-index_100g	water-hardness_100g
0	3087	http://world-en.openfoodfacts.org/product/0000…	openfoodfacts-contributors	1474103866	2016-09-17T09:17:46Z	1474103893	2016-09-17T09:18:13Z	Farine de blé noir	NaN	1kg	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	4530	http://world-en.openfoodfacts.org/product/0000…	usda-ndb-import	1489069957	2017-03-09T14:32:37Z	1489069957	2017-03-09T14:32:37Z	Banana Chips Sweetened (Whole)	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	14.0	14.0	NaN	NaN
2	4559	http://world-en.openfoodfacts.org/product/0000…	usda-ndb-import	1489069957	2017-03-09T14:32:37Z	1489069957	2017-03-09T14:32:37Z	Peanuts	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	0.0	0.0	NaN	NaN
3	16087	http://world-en.openfoodfacts.org/product/0000…	usda-ndb-import	1489055731	2017-03-09T10:35:31Z	1489055731	2017-03-09T10:35:31Z	Organic Salted Nut Mix	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	12.0	12.0	NaN	NaN
4	16094	http://world-en.openfoodfacts.org/product/0000…	usda-ndb-import	1489055653	2017-03-09T10:34:13Z	1489055653	2017-03-09T10:34:13Z	Organic Polenta	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 163 columns

Step 5. What is the number of observations in the dataset?

代码如下：

# 该数据集有多少条数据
food.shape[0]

输出结果如下：

Step 6. What is the number of columns in the dataset?

代码如下：

# 该数据集有多少列
food.shape[1]
food.info()

输出结果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB

Step 7. Print the name of all the columns.

代码如下：

# 所有的列名
food.columns

输出结果如下：

Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name',
       'generic_name', 'quantity',
       ...
       'fruits-vegetables-nuts_100g', 'fruits-vegetables-nuts-estimate_100g',
       'collagen-meat-protein-ratio_100g', 'cocoa_100g', 'chlorophyl_100g',
       'carbon-footprint_100g', 'nutrition-score-fr_100g',
       'nutrition-score-uk_100g', 'glycemic-index_100g',
       'water-hardness_100g'],
      dtype='object', length=163)

Step 8. What is the name of 105th column?

代码如下：

# 第105列的列名
food.columns[104]

输出结果如下：

'-glucose_100g'

Step 9. What is the type of the observations of the 105th column?

代码如下：

# 第105列的数据类型
food.dtypes[food.columns[104]]

输出结果如下：

dtype('float64')

Step 10. How is the dataset indexed?

代码如下：

# 该数据集是怎样索引的
food.index

输出结果如下：

RangeIndex(start=0, stop=356027, step=1)

Step 11. What is the product name of the 19th observation?

代码如下：

# 第19条数据的product_name是什么
food.values[18][7]

输出结果如下：

'Lotus Organic Brown Jasmine Rice'

后面两行代码效果一样，用的函数不一样。

food.iloc[18]['product_name']

'Lotus Organic Brown Jasmine Rice'

food.loc[18]['product_name']

'Lotus Organic Brown Jasmine Rice'

Exercise 2 – Getting and Knowing your Data

This time we are going to pull data directly from the internet.

Step 1. Import the necessary libraries

代码如下：

# 导入要用的库
import pandas as pd
import numpy as np

Step 2. Import the dataset from this

address

.

Step 3. Assign it to a variable called chipo.

代码如下：

# 导入数据集，并赋给变量chipo
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\t')

Step 4. See the first 10 entries

代码如下：

# 取前十条数据
chipo.head(10)

输出结果如下：

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans…	$16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou…	$10.98
6	3	1	Side of Chips	NaN	$1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables…	$11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch…	$9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto…	$9.25

Step 5. What is the number of observations in the dataset?

代码如下：

# Solution 1
# 求数据集的观察数
chipo.shape[0]  # entries <=4622 observations

输出结果如下：

下面代码效果和上面类似。

# Solution 2

chipo.info()  # entries <=4622 observations

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.6+ KB

Step 6. What is the number of columns in the dataset?

代码如下：

# 求数据集的列数
chipo.shape[1]

输出结果如下：

Step 7. Print the name of all the columns.

代码如下：

# 打印数据集的所有列名
chipo.columns

输出结果如下：

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

Step 8. How is the dataset indexed?

代码如下：

# 数据集是如何索引的
chipo.index

输出结果如下：

RangeIndex(start=0, stop=4622, step=1)

Step 9. Which was the most-ordered item?

代码如下：

# 求最多订单的商品
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下：

	order_id	quantity
item_name
Chicken Bowl	713926	761

Step 10. For the most-ordered item, how many items were ordered?

代码入下：

# 最多订购的商品被订购次数，同上题
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下：

	order_id	quantity
item_name
Chicken Bowl	713926	761

Step 11. What was the most ordered item in the choice_description column?

代码如下：

# choice_description 列中订购最多的商品是什么？
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下：

	order_id	quantity
choice_description
[Diet Coke]	123455	159

Step 12. How many items were orderd in total?

代码如下：

# 总共订购了多少件商品？
total_items_orders = chipo.quantity.sum()
total_items_orders

输出结果如下：

Step 13. Turn the item price into a float

Step 13.a. Check the item price type

代码如下：

# 把tiem price转化为浮点型并检查
chipo.item_price.dtype

输出结果如下：

dtype('O')

Step 13.b. Create a lambda function and change the type of item price

代码如下：

dollarizer = lambda x: float(x[1:-1])
chipo.item_price = chipo.item_price.apply(dollarizer)

Step 13.c. Check the item price type

代码如下：

chipo.item_price.dtype

输出结果如下：

dtype('float64')

Step 14. How much was the revenue for the period in the dataset?

代码如下：

# 数据集中该时期的收入是多少？
revenue = (chipo['quantity'] * chipo['item_price']).sum()
print('Revenue was:  $' + str(np.round(revenue, 2))) 
# round() 方法返回浮点数x的四舍五入值。2表示保留小数点后两位数

输出结果如下：

Revenue was:  $39237.02

Step 15. How many orders were made in the period?

代码如下：

# 期间有多少订单？
orders = chipo.order_id.value_counts().count()
# value_counts()查看chipo中有哪些不同的值，并计算每个值有多少个重复值,count()计算有多少不同的值
orders

输出结果如下：

Step 16. What is the average revenue amount per order?

代码如下：

# Solution 1
# 每个订单的平均收入是多少？
chipo['revenue'] = (chipo['quantity'] * chipo['item_price'])
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped.mean()['revenue']

输出结果如下：

21.394231188658654

下面代码和上面作用相同。

# Solution 2
chipo.groupby(by=['order_id']).sum().mean()['revenue']

21.394231188658654

Step 17. How many different items are sold?

代码如下：

# 售出多少种不同的商品？
sold_items = chipo.item_name.value_counts().count()
sold_items

输出结果如下：

Exercise 3 – Getting and Knowing your Data

This time we are going to pull data directly from the internet.

（后面的题和前面类似，这里不做解释，英文直接做吧）

Step 1. Import the necessary libraries

代码如下

import pandas as pd

Step 2. Import the dataset from this

address

.

Step 3. Assign it to a variable called users and use the ‘user_id’ as index

代码入戏：

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_table(url, sep='|', index_col = 'user_id')
# sep/delimiter 用于对行中各字段进行拆分的字符序列

Step 4. See the first 25 entries

代码如下：

users.head(25)

输出结果如下：

	age	gender	occupation	zip_code
user_id
1	24	M	technician	85711
2	53	F	other	94043
3	23	M	writer	32067
4	24	M	technician	43537
5	33	F	other	15213
6	42	M	executive	98101
7	57	M	administrator	91344
8	36	M	administrator	05201
9	29	M	student	01002
10	53	M	lawyer	90703
11	39	F	other	30329
12	28	F	other	06405
13	47	M	educator	29206
14	45	M	scientist	55106
15	49	F	educator	97301
16	21	M	entertainment	10309
17	30	M	programmer	06355
18	35	F	other	37212
19	40	M	librarian	02138
20	42	F	homemaker	95660
21	26	M	writer	30068
22	25	M	writer	40206
23	30	F	artist	48197
24	21	F	artist	94533
25	39	M	engineer	55107

Step 5. See the last 10 entries

代码如下：

users.tail(10)

输出结果如下：

	age	gender	occupation	zip_code
user_id
934	61	M	engineer	22902
935	42	M	doctor	66221
936	24	M	other	32789
937	48	M	educator	98072
938	38	F	technician	55038
939	26	F	student	33319
940	32	M	administrator	02215
941	20	M	student	97229
942	48	F	librarian	78209
943	22	M	student	77841

Step 6. What is the number of observations in the dataset?

代码入戏：

users.shape[0]

输出结果如下：

Step 7. What is the number of columns in the dataset?

代码如下：

users.shape[1]

输出结果如下：

Step 8. Print the name of all the columns.

代码如下：

users.columns

输出结果如下：

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

Step 9. How is the dataset indexed?

代码入戏：

users.index

输出结果如下：

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)

Step 10. What is the data type of each column?

代码入下：

users.dtypes

输出结果如下：

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

Step 11. Print only the occupation column

代码入下：

users['occupation']  # 或者users.occupation

输出结果如下：

user_id
1         technician
2              other
3             writer
4         technician
5              other
6          executive
7      administrator
8      administrator
9            student
10            lawyer
11             other
12             other
13          educator
14         scientist
15          educator
16     entertainment
17        programmer
18             other
19         librarian
20         homemaker
21            writer
22            writer
23            artist
24            artist
25          engineer
26          engineer
27         librarian
28            writer
29        programmer
30           student
           ...      
914            other
915    entertainment
916         engineer
917          student
918        scientist
919            other
920           artist
921          student
922    administrator
923          student
924            other
925         salesman
926    entertainment
927       programmer
928          student
929        scientist
930        scientist
931         educator
932         educator
933          student
934         engineer
935           doctor
936            other
937         educator
938       technician
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

Step 12. How many different occupations there are in this dataset?

代码如下：

users.occupation.nunique()

输出结果如下：

Step 13. What is the most frequent occupation?

代码如下：

users.occupation.value_counts().head()

输出结果如下：

student          196
other            105
educator          95
administrator     79
engineer          67
Name: occupation, dtype: int64

Step 14. Summarize the DataFrame.

代码如下：

users.describe()

输出结果如下：

	age
count	943.000000
mean	34.051962
std	12.192740
min	7.000000
25%	25.000000
50%	31.000000
75%	43.000000
max	73.000000

Step 15. Summarize all the columns

代码如下：

users.describe(include='all') #Notice: By default, only the numeric columns are returned.

输出结果如下：

	age	gender	occupation	zip_code
count	943.000000	943	943	943
unique	NaN	2	21	795
top	NaN	M	student	55414
freq	NaN	670	196	9
mean	34.051962	NaN	NaN	NaN
std	12.192740	NaN	NaN	NaN
min	7.000000	NaN	NaN	NaN
25%	25.000000	NaN	NaN	NaN
50%	31.000000	NaN	NaN	NaN
75%	43.000000	NaN	NaN	NaN
max	73.000000	NaN	NaN	NaN

Step 16. Summarize only the occupation column

代码如下：

users.occupation.describe()

输出结果如下：

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

Step 17. What is the mean age of users?

代码如下：

round(users.age.mean())

输出结果如下：

Step 18. What is the age with least occurrence?

代码如下：

users.age.value_counts().tail()  #7, 10, 11, 66 and 73 years -> only 1 occurrence

输出结果如下：

11    1
10    1
73    1
66    1
7     1
Name: age, dtype: int64

结语

本文使用anaconda的jupyter notebook编写。后面文章也会继续使用，真的好用，推荐给大家，一起学习呀！！！

原文链接：https://blog.csdn.net/baobaobao0000/article/details/118468349

Python数据分析基础

Exercise 1

Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data

Step 2. Download the dataset to your computer and unzip it.

Step 3. Use the tsv file and assign it to a dataframe called food

Step 4. See the first 5 entries

Step 5. What is the number of observations in the dataset?

Step 6. What is the number of columns in the dataset?

Step 7. Print the name of all the columns.

Step 8. What is the name of 105th column?

Step 9. What is the type of the observations of the 105th column?

Step 10. How is the dataset indexed?

Step 11. What is the product name of the 19th observation?

Exercise 2 – Getting and Knowing your Data

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address .

Step 3. Assign it to a variable called chipo.

Step 4. See the first 10 entries

Step 5. What is the number of observations in the dataset?

Step 6. What is the number of columns in the dataset?

Step 7. Print the name of all the columns.

Step 8. How is the dataset indexed?

Step 9. Which was the most-ordered item?

Step 10. For the most-ordered item, how many items were ordered?

Step 11. What was the most ordered item in the choice_description column?

Step 12. How many items were orderd in total?

Step 13. Turn the item price into a float

Step 13.a. Check the item price type

Step 13.b. Create a lambda function and change the type of item price

Step 13.c. Check the item price type

Step 14. How much was the revenue for the period in the dataset?

Step 15. How many orders were made in the period?

Step 16. What is the average revenue amount per order?

Step 17. How many different items are sold?

Exercise 3 – Getting and Knowing your Data

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address .

Step 3. Assign it to a variable called users and use the ‘user_id’ as index

Step 4. See the first 25 entries

Step 5. See the last 10 entries

Step 6. What is the number of observations in the dataset?

Step 7. What is the number of columns in the dataset?

Step 8. Print the name of all the columns.

Step 9. How is the dataset indexed?

Step 10. What is the data type of each column?

Step 11. Print only the occupation column

Step 12. How many different occupations there are in this dataset?

Step 13. What is the most frequent occupation?

Step 14. Summarize the DataFrame.

Step 15. Summarize all the columns

Step 16. Summarize only the occupation column

Step 17. What is the mean age of users?

Step 18. What is the age with least occurrence?

结语

你可能也喜欢

Step 2. Import the dataset from this

address

.

Step 2. Import the dataset from this

address

.