Python数据分析pandas入门练习题(三)

  • Post author:
  • Post category:python


Python数据分析基础



Exercise 1



Step 1. Go to https://www.kaggle.com/openfoodfacts/world-food-facts/data



Step 2. Download the dataset to your computer and unzip it.



Step 3. Use the tsv file and assign it to a dataframe called food

代码如下:

# 从上面网址下载数据集并解压,然后将其读取并赋值给food
import pandas as pd
import numpy as np
food = pd.read_csv('en.openfoodfacts.org.products.tsv', sep='\t')

输出结果如下:

D:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3020: DtypeWarning: Columns (0,3,5,19,20,24,25,26,27,28,36,37,38,39,48) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)



Step 4. See the first 5 entries

代码如下:

# 取前五条数据
food.head()

输出结果如下:

code url creator created_t created_datetime last_modified_t last_modified_datetime product_name generic_name quantity fruits-vegetables-nuts_100g fruits-vegetables-nuts-estimate_100g collagen-meat-protein-ratio_100g cocoa_100g chlorophyl_100g carbon-footprint_100g nutrition-score-fr_100g nutrition-score-uk_100g glycemic-index_100g water-hardness_100g
0 3087 http://world-en.openfoodfacts.org/product/0000… openfoodfacts-contributors 1474103866 2016-09-17T09:17:46Z 1474103893 2016-09-17T09:18:13Z Farine de blé noir NaN 1kg NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 4530 http://world-en.openfoodfacts.org/product/0000… usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Banana Chips Sweetened (Whole) NaN NaN NaN NaN NaN NaN NaN NaN 14.0 14.0 NaN NaN
2 4559 http://world-en.openfoodfacts.org/product/0000… usda-ndb-import 1489069957 2017-03-09T14:32:37Z 1489069957 2017-03-09T14:32:37Z Peanuts NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN NaN
3 16087 http://world-en.openfoodfacts.org/product/0000… usda-ndb-import 1489055731 2017-03-09T10:35:31Z 1489055731 2017-03-09T10:35:31Z Organic Salted Nut Mix NaN NaN NaN NaN NaN NaN NaN NaN 12.0 12.0 NaN NaN
4 16094 http://world-en.openfoodfacts.org/product/0000… usda-ndb-import 1489055653 2017-03-09T10:34:13Z 1489055653 2017-03-09T10:34:13Z Organic Polenta NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 163 columns



Step 5. What is the number of observations in the dataset?

代码如下:

# 该数据集有多少条数据
food.shape[0]

输出结果如下:

356027



Step 6. What is the number of columns in the dataset?

代码如下:

# 该数据集有多少列
food.shape[1]
food.info()

输出结果如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB



Step 7. Print the name of all the columns.

代码如下:

# 所有的列名
food.columns

输出结果如下:

Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name',
       'generic_name', 'quantity',
       ...
       'fruits-vegetables-nuts_100g', 'fruits-vegetables-nuts-estimate_100g',
       'collagen-meat-protein-ratio_100g', 'cocoa_100g', 'chlorophyl_100g',
       'carbon-footprint_100g', 'nutrition-score-fr_100g',
       'nutrition-score-uk_100g', 'glycemic-index_100g',
       'water-hardness_100g'],
      dtype='object', length=163)



Step 8. What is the name of 105th column?

代码如下:

# 第105列的列名
food.columns[104]

输出结果如下:

'-glucose_100g'



Step 9. What is the type of the observations of the 105th column?

代码如下:

# 第105列的数据类型
food.dtypes[food.columns[104]]

输出结果如下:

dtype('float64')



Step 10. How is the dataset indexed?

代码如下:

# 该数据集是怎样索引的
food.index

输出结果如下:

RangeIndex(start=0, stop=356027, step=1)



Step 11. What is the product name of the 19th observation?

代码如下:

# 第19条数据的product_name是什么
food.values[18][7]

输出结果如下:

'Lotus Organic Brown Jasmine Rice'

后面两行代码效果一样,用的函数不一样。

food.iloc[18]['product_name']
'Lotus Organic Brown Jasmine Rice'
food.loc[18]['product_name']
'Lotus Organic Brown Jasmine Rice'



Exercise 2 – Getting and Knowing your Data

This time we are going to pull data directly from the internet.



Step 1. Import the necessary libraries

代码如下:

# 导入要用的库
import pandas as pd
import numpy as np



Step 2. Import the dataset from this

address

.



Step 3. Assign it to a variable called chipo.

代码如下:

# 导入数据集,并赋给变量chipo
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\t')



Step 4. See the first 10 entries

代码如下:

# 取前十条数据
chipo.head(10)

输出结果如下:

order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans… $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou… $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables… $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch… $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto… $9.25



Step 5. What is the number of observations in the dataset?

代码如下:

# Solution 1
# 求数据集的观察数
chipo.shape[0]  # entries <=4622 observations

输出结果如下:

下面代码效果和上面类似。

4622
# Solution 2

chipo.info()  # entries <=4622 observations
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
order_id              4622 non-null int64
quantity              4622 non-null int64
item_name             4622 non-null object
choice_description    3376 non-null object
item_price            4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.6+ KB



Step 6. What is the number of columns in the dataset?

代码如下:

# 求数据集的列数
chipo.shape[1]

输出结果如下:

5



Step 7. Print the name of all the columns.

代码如下:

# 打印数据集的所有列名
chipo.columns

输出结果如下:

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')



Step 8. How is the dataset indexed?

代码如下:

# 数据集是如何索引的
chipo.index

输出结果如下:

RangeIndex(start=0, stop=4622, step=1)



Step 9. Which was the most-ordered item?

代码如下:

# 求最多订单的商品
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下:

order_id quantity
item_name
Chicken Bowl 713926 761



Step 10. For the most-ordered item, how many items were ordered?

代码入下:

# 最多订购的商品被订购次数,同上题
c = chipo.groupby('item_name')
c = c.sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下:

order_id quantity
item_name
Chicken Bowl 713926 761



Step 11. What was the most ordered item in the choice_description column?

代码如下:

# choice_description 列中订购最多的商品是什么?
c = chipo.groupby('choice_description').sum()
c = c.sort_values(['quantity'], ascending=False)
c.head(1)

输出结果如下:

order_id quantity
choice_description
[Diet Coke] 123455 159



Step 12. How many items were orderd in total?

代码如下:

# 总共订购了多少件商品?
total_items_orders = chipo.quantity.sum()
total_items_orders 

输出结果如下:

4972



Step 13. Turn the item price into a float



Step 13.a. Check the item price type

代码如下:

# 把tiem price转化为浮点型并检查
chipo.item_price.dtype

输出结果如下:

dtype('O')



Step 13.b. Create a lambda function and change the type of item price

代码如下:

dollarizer = lambda x: float(x[1:-1])
chipo.item_price = chipo.item_price.apply(dollarizer)



Step 13.c. Check the item price type

代码如下:

chipo.item_price.dtype

输出结果如下:

dtype('float64')



Step 14. How much was the revenue for the period in the dataset?

代码如下:

# 数据集中该时期的收入是多少?
revenue = (chipo['quantity'] * chipo['item_price']).sum()
print('Revenue was:  $' + str(np.round(revenue, 2))) 
# round() 方法返回浮点数x的四舍五入值。2表示保留小数点后两位数

输出结果如下:

Revenue was:  $39237.02



Step 15. How many orders were made in the period?

代码如下:

# 期间有多少订单?
orders = chipo.order_id.value_counts().count()
# value_counts()查看chipo中有哪些不同的值,并计算每个值有多少个重复值,count()计算有多少不同的值
orders

输出结果如下:

1834



Step 16. What is the average revenue amount per order?

代码如下:

# Solution 1
# 每个订单的平均收入是多少?
chipo['revenue'] = (chipo['quantity'] * chipo['item_price'])
order_grouped = chipo.groupby(by=['order_id']).sum()
order_grouped.mean()['revenue']

输出结果如下:

21.394231188658654

下面代码和上面作用相同。

# Solution 2
chipo.groupby(by=['order_id']).sum().mean()['revenue']
21.394231188658654



Step 17. How many different items are sold?

代码如下:

# 售出多少种不同的商品?
sold_items = chipo.item_name.value_counts().count()
sold_items

输出结果如下:

50



Exercise 3 – Getting and Knowing your Data

This time we are going to pull data directly from the internet.

(后面的题和前面类似,这里不做解释,英文直接做吧)



Step 1. Import the necessary libraries

代码如下

import pandas as pd



Step 2. Import the dataset from this

address

.



Step 3. Assign it to a variable called users and use the ‘user_id’ as index

代码入戏:

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_table(url, sep='|', index_col = 'user_id')
# sep/delimiter 用于对行中各字段进行拆分的字符序列



Step 4. See the first 25 entries

代码如下:

users.head(25)

输出结果如下:

age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
6 42 M executive 98101
7 57 M administrator 91344
8 36 M administrator 05201
9 29 M student 01002
10 53 M lawyer 90703
11 39 F other 30329
12 28 F other 06405
13 47 M educator 29206
14 45 M scientist 55106
15 49 F educator 97301
16 21 M entertainment 10309
17 30 M programmer 06355
18 35 F other 37212
19 40 M librarian 02138
20 42 F homemaker 95660
21 26 M writer 30068
22 25 M writer 40206
23 30 F artist 48197
24 21 F artist 94533
25 39 M engineer 55107



Step 5. See the last 10 entries

代码如下:

users.tail(10)

输出结果如下:

age gender occupation zip_code
user_id
934 61 M engineer 22902
935 42 M doctor 66221
936 24 M other 32789
937 48 M educator 98072
938 38 F technician 55038
939 26 F student 33319
940 32 M administrator 02215
941 20 M student 97229
942 48 F librarian 78209
943 22 M student 77841



Step 6. What is the number of observations in the dataset?

代码入戏:

users.shape[0]

输出结果如下:

943



Step 7. What is the number of columns in the dataset?

代码如下:

users.shape[1]

输出结果如下:

4



Step 8. Print the name of all the columns.

代码如下:

users.columns

输出结果如下:

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')



Step 9. How is the dataset indexed?

代码入戏:

users.index

输出结果如下:

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)



Step 10. What is the data type of each column?

代码入下:

users.dtypes

输出结果如下:

age            int64
gender        object
occupation    object
zip_code      object
dtype: object



Step 11. Print only the occupation column

代码入下:

users['occupation']  # 或者users.occupation

输出结果如下:

user_id
1         technician
2              other
3             writer
4         technician
5              other
6          executive
7      administrator
8      administrator
9            student
10            lawyer
11             other
12             other
13          educator
14         scientist
15          educator
16     entertainment
17        programmer
18             other
19         librarian
20         homemaker
21            writer
22            writer
23            artist
24            artist
25          engineer
26          engineer
27         librarian
28            writer
29        programmer
30           student
           ...      
914            other
915    entertainment
916         engineer
917          student
918        scientist
919            other
920           artist
921          student
922    administrator
923          student
924            other
925         salesman
926    entertainment
927       programmer
928          student
929        scientist
930        scientist
931         educator
932         educator
933          student
934         engineer
935           doctor
936            other
937         educator
938       technician
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object



Step 12. How many different occupations there are in this dataset?

代码如下:

users.occupation.nunique()

输出结果如下:

21



Step 13. What is the most frequent occupation?

代码如下:

users.occupation.value_counts().head()

输出结果如下:

student          196
other            105
educator          95
administrator     79
engineer          67
Name: occupation, dtype: int64



Step 14. Summarize the DataFrame.

代码如下:

users.describe()

输出结果如下:

age
count 943.000000
mean 34.051962
std 12.192740
min 7.000000
25% 25.000000
50% 31.000000
75% 43.000000
max 73.000000



Step 15. Summarize all the columns

代码如下:

users.describe(include='all') #Notice: By default, only the numeric columns are returned.

输出结果如下:

age gender occupation zip_code
count 943.000000 943 943 943
unique NaN 2 21 795
top NaN M student 55414
freq NaN 670 196 9
mean 34.051962 NaN NaN NaN
std 12.192740 NaN NaN NaN
min 7.000000 NaN NaN NaN
25% 25.000000 NaN NaN NaN
50% 31.000000 NaN NaN NaN
75% 43.000000 NaN NaN NaN
max 73.000000 NaN NaN NaN



Step 16. Summarize only the occupation column

代码如下:

users.occupation.describe()

输出结果如下:

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object



Step 17. What is the mean age of users?

代码如下:

round(users.age.mean())

输出结果如下:

34



Step 18. What is the age with least occurrence?

代码如下:

users.age.value_counts().tail()  #7, 10, 11, 66 and 73 years -> only 1 occurrence

输出结果如下:

11    1
10    1
73    1
66    1
7     1
Name: age, dtype: int64



结语

本文使用anaconda的jupyter notebook编写。后面文章也会继续使用,真的好用,推荐给大家,一起学习呀!!!



版权声明:本文为baobaobao0000原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。