一、unique():唯一值
作用:unique()函数用于获取Series对象的唯一值。唯一性按出现顺序返回。基于哈希表的唯一,因此不排序
- 语法:Series.unique(self)
- 返回:ndarray 或 ExtensionArray作为 NumPy 数组返回的唯一值。
- 不能用于DataFrame
注意:以 NumPy 数组的形式返回唯一值。 如果是扩展数组支持的系列,则返回该类型的新 ExtensionArray,其中仅包含唯一值。 这包括
- 分类的
- 时期
- 带时区的日期时间
- 间隔
- 疏
- 整数NA
import pandas as pd
# 唯一值:.unique()
s = pd.Series(list('asdvasdcfgg'))
print("s = \n{0} \ntype(s) = {1}".format(s, type(s)))
print('-' * 200)
# 得到一个唯一值数组
sq = s.unique()
print("得到一个唯一值数组: sq = s.unique() = \n{0} \ntype(sq) = {1}".format(sq, type(sq)))
print('-' * 200)
# 通过pd.Series重新变成新的Series
s2 = pd.Series(sq)
print("s2 = \n{0} \ntype(s2) = {1}".format(s2, type(s2)))
print('-' * 200)
# 重新排序
sq.sort()
print("sq重新排序后: sq = \n{0} \ntype(sq) = {1}".format(sq, type(sq)))
s3 = pd.Series(sq)
print('-' * 50)
print("s3 = \n{0} \ntype(s3) = {1}".format(s3, type(s3)))
print('-' * 200)
打印结果:
s =
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
type(s) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
得到一个唯一值数组: sq = s.unique() =
['a' 's' 'd' 'v' 'c' 'f' 'g']
type(sq) = <class 'numpy.ndarray'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
s2 =
0 a
1 s
2 d
3 v
4 c
5 f
6 g
dtype: object
type(s2) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sq重新排序后: sq =
['a' 'c' 'd' 'f' 'g' 's' 'v']
type(sq) = <class 'numpy.ndarray'>
--------------------------------------------------
s3 =
0 a
1 c
2 d
3 f
4 g
5 s
6 v
dtype: object
type(s3) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Process finished with exit code 0
二、value_counts():值计数(eg. 词频)
1、Series
import pandas as pd
# 值计数:.value_counts()
s = pd.Series(list('asdvasdcfgg'))
print("s = \n{0} \ntype(s) = {1}".format(s, type(s)))
print('-' * 200)
# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True
sc = s.value_counts(sort=False) # 也可以这样写:pd.value_counts(sc, sort = False)
print("sc = \n{0} \ntype(sc) = {1}".format(sc, type(sc)))
打印结果:
s =
0 a
1 s
2 d
3 v
4 a
5 s
6 d
7 c
8 f
9 g
10 g
dtype: object
type(s) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sc =
a 2
s 2
v 1
c 1
g 2
f 1
d 2
dtype: int64
type(sc) = <class 'pandas.core.series.Series'>
Process finished with exit code 0
2、DataFrame
import numpy as np
import pandas as pd
# 值计数:.value_counts()
df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
'key2': ['e', 'a', 'b', 'b', 'c'],
'key3': ['d', 'f', 'a', 3, 5]})
print("df = \n{0} \ntype(df) = {1}".format(df, type(df)))
print('-' * 200)
# 得到一个新的DataFrame,计算出不同值出现的频率
# sort参数:排序,默认为True
sc = df.value_counts(sort = False) # 也可以这样写:pd.value_counts(sc, sort = False)
print("sc = \n{0} \ntype(sc) = {1}".format(sc, type(sc)))
打印结果:
df =
key1 key2 key3
0 a e d
1 a a f
2 3 b a
3 4 b 3
4 5 c 5
type(df) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sc =
key1 key2 key3
3 b a 1
4 b 3 1
5 c 5 1
a a f 1
e d 1
dtype: int64
type(sc) = <class 'pandas.core.series.Series'>
Process finished with exit code 0
三、isin():成员资格
import numpy as np
import pandas as pd
# 成员资格:.isin()
# 用[]表示
# 得到一个布尔值的Series或者Dataframe
s = pd.Series(np.arange(10, 15))
print("s = \n{0} \ntype(s) = {1}".format(s, type(s)))
print('-' * 50)
print("s.isin([5, 14]) = \n", s.isin([5, 14]))
print('-' * 200)
df = pd.DataFrame({'key1': list('asdcbvasd'),
'key2': np.arange(4, 13)})
print("df = \n{0} \ntype(df) = {1}".format(df, type(df)))
print('-' * 50)
print("df.isin(['a', 'bc', '10', 8]) = \n", df.isin(['a', 'bc', '10', 8]))
print('-' * 200)
打印结果:
s =
0 10
1 11
2 12
3 13
4 14
dtype: int32
type(s) = <class 'pandas.core.series.Series'>
--------------------------------------------------
s.isin([5, 14]) =
0 False
1 False
2 False
3 False
4 True
dtype: bool
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
df =
key1 key2
0 a 4
1 s 5
2 d 6
3 c 7
4 b 8
5 v 9
6 a 10
7 s 11
8 d 12
type(df) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df.isin(['a', 'bc', '10', 8]) =
key1 key2
0 True False
1 False False
2 False False
3 False False
4 False True
5 False False
6 True False
7 False False
8 False False
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Process finished with exit code 0
四、duplicated:去重
import pandas as pd
# 去重 .duplicated
# Series中使用duplicated
s = pd.Series([1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 5, 5, 5])
# 判断是否重复
data1 = s.duplicated()
print("data1 = s.duplicated() = \n", data1)
print('-' * 50)
# 通过布尔判断,得到不重复的值
print("s[s.duplicated() == False] = \n", s[s.duplicated() == False])
print('-' * 200)
# drop.duplicates移除重复
# inplace参数:是否替换原值,默认False
s_re = s.drop_duplicates()
print("s_re = s.drop_duplicates() = \n{0}".format(s_re))
print('-' * 200)
# Dataframe中使用duplicated
df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
'key2': ['a', 'a', 'b', 'b', 'c']})
print("df.duplicated() = \n", df.duplicated())
print('-' * 50)
print("df['key2'].duplicated() = \n", df['key2'].duplicated())
打印结果:
data1 = s.duplicated() =
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 False
9 False
10 True
11 True
12 True
dtype: bool
--------------------------------------------------
s[s.duplicated() == False] =
0 1
4 2
7 3
8 4
9 5
dtype: int64
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
s_re = s.drop_duplicates() =
0 1
4 2
7 3
8 4
9 5
dtype: int64
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
df.duplicated() =
0 False
1 True
2 False
3 False
4 False
dtype: bool
--------------------------------------------------
df['key2'].duplicated() =
0 False
1 True
2 False
3 True
4 False
Name: key2, dtype: bool
Process finished with exit code 0
五、replace:替换
import numpy as np
import pandas as pd
# 替换 .replace
# 可一次性替换一个值或多个值
# 可传入列表或字典
s = pd.Series(list('ascaazsd'))
print("s = \n", s)
print('-' * 200)
data1 = s.replace('a', np.nan)
print("data1 = s.replace('a', np.nan) = \n", data1)
print('-' * 200)
data2 = s.replace(['a', 's'], np.nan)
print("data2 = s.replace(['a', 's'], np.nan) = \n", data2)
print('-' * 200)
data3 = s.replace({'a': 'hello world!', 's': 123})
print("data3 = s.replace({'a': 'hello world!', 's': 123}) = \n", data3)
打印结果:
s =
0 a
1 s
2 c
3 a
4 a
5 z
6 s
7 d
dtype: object
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data1 = s.replace('a', np.nan) =
0 NaN
1 s
2 c
3 NaN
4 NaN
5 z
6 s
7 d
dtype: object
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data2 = s.replace(['a', 's'], np.nan) =
0 NaN
1 NaN
2 c
3 NaN
4 NaN
5 z
6 NaN
7 d
dtype: object
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data3 = s.replace({'a': 'hello world!', 's': 123}) =
0 hello world!
1 123
2 c
3 hello world!
4 hello world!
5 z
6 123
7 d
dtype: object
Process finished with exit code 0
参考资料:
python | pandas:unique函数
版权声明:本文为u013250861原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。