Pandas-数据操作-常用函数(一):unique(获取Series中的去重值)、value_counts(值计数、词频)、isin(成员资格)、duplicated(去重)、replace(替换)

  • Post author:
  • Post category:其他




一、unique():唯一值

作用:unique()函数用于获取Series对象的唯一值。唯一性按出现顺序返回。基于哈希表的唯一,因此不排序

  • 语法:Series.unique(self)
  • 返回:ndarray 或 ExtensionArray作为 NumPy 数组返回的唯一值。
  • 不能用于DataFrame

注意:以 NumPy 数组的形式返回唯一值。 如果是扩展数组支持的系列,则返回该类型的新 ExtensionArray,其中仅包含唯一值。 这包括

  • 分类的
  • 时期
  • 带时区的日期时间
  • 间隔
  • 整数NA
import pandas as pd

# 唯一值:.unique()

s = pd.Series(list('asdvasdcfgg'))
print("s = \n{0} \ntype(s) = {1}".format(s, type(s)))
print('-' * 200)

# 得到一个唯一值数组
sq = s.unique()
print("得到一个唯一值数组: sq = s.unique() = \n{0} \ntype(sq) = {1}".format(sq, type(sq)))
print('-' * 200)

# 通过pd.Series重新变成新的Series
s2 = pd.Series(sq)
print("s2 = \n{0} \ntype(s2) = {1}".format(s2, type(s2)))
print('-' * 200)

# 重新排序
sq.sort()
print("sq重新排序后: sq = \n{0} \ntype(sq) = {1}".format(sq, type(sq)))
s3 = pd.Series(sq)
print('-' * 50)
print("s3 = \n{0} \ntype(s3) = {1}".format(s3, type(s3)))
print('-' * 200)

打印结果:

s = 
0     a
1     s
2     d
3     v
4     a
5     s
6     d
7     c
8     f
9     g
10    g
dtype: object 
type(s) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
得到一个唯一值数组: sq = s.unique() = 
['a' 's' 'd' 'v' 'c' 'f' 'g'] 
type(sq) = <class 'numpy.ndarray'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
s2 = 
0    a
1    s
2    d
3    v
4    c
5    f
6    g
dtype: object 
type(s2) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sq重新排序后: sq = 
['a' 'c' 'd' 'f' 'g' 's' 'v'] 
type(sq) = <class 'numpy.ndarray'>
--------------------------------------------------
s3 = 
0    a
1    c
2    d
3    f
4    g
5    s
6    v
dtype: object 
type(s3) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Process finished with exit code 0



二、value_counts():值计数(eg. 词频)



1、Series

import pandas as pd

# 值计数:.value_counts()

s = pd.Series(list('asdvasdcfgg'))
print("s = \n{0} \ntype(s) = {1}".format(s, type(s)))
print('-' * 200)

# 得到一个新的Series,计算出不同值出现的频率
# sort参数:排序,默认为True
sc = s.value_counts(sort=False)  # 也可以这样写:pd.value_counts(sc, sort = False)
print("sc = \n{0} \ntype(sc) = {1}".format(sc, type(sc)))

打印结果:

s = 
0     a
1     s
2     d
3     v
4     a
5     s
6     d
7     c
8     f
9     g
10    g
dtype: object 
type(s) = <class 'pandas.core.series.Series'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sc = 
a    2
s    2
v    1
c    1
g    2
f    1
d    2
dtype: int64 
type(sc) = <class 'pandas.core.series.Series'>

Process finished with exit code 0



2、DataFrame

import numpy as np
import pandas as pd

# 值计数:.value_counts()

df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
                   'key2': ['e', 'a', 'b', 'b', 'c'],
                   'key3': ['d', 'f', 'a', 3, 5]})
print("df = \n{0} \ntype(df) = {1}".format(df, type(df)))
print('-' * 200)

# 得到一个新的DataFrame,计算出不同值出现的频率
# sort参数:排序,默认为True
sc = df.value_counts(sort = False)  # 也可以这样写:pd.value_counts(sc, sort = False)
print("sc = \n{0} \ntype(sc) = {1}".format(sc, type(sc)))

打印结果:

df = 
  key1 key2 key3
0    a    e    d
1    a    a    f
2    3    b    a
3    4    b    3
4    5    c    5 
type(df) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sc = 
key1  key2  key3
3     b     a       1
4     b     3       1
5     c     5       1
a     a     f       1
      e     d       1
dtype: int64 
type(sc) = <class 'pandas.core.series.Series'>

Process finished with exit code 0



三、isin():成员资格

import numpy as np
import pandas as pd

# 成员资格:.isin()
# 用[]表示
# 得到一个布尔值的Series或者Dataframe

s = pd.Series(np.arange(10, 15))
print("s = \n{0} \ntype(s) = {1}".format(s, type(s)))
print('-' * 50)
print("s.isin([5, 14]) = \n", s.isin([5, 14]))
print('-' * 200)

df = pd.DataFrame({'key1': list('asdcbvasd'),
                   'key2': np.arange(4, 13)})
print("df = \n{0} \ntype(df) = {1}".format(df, type(df)))
print('-' * 50)
print("df.isin(['a', 'bc', '10', 8]) = \n", df.isin(['a', 'bc', '10', 8]))
print('-' * 200)

打印结果:

s = 
0    10
1    11
2    12
3    13
4    14
dtype: int32 
type(s) = <class 'pandas.core.series.Series'>
--------------------------------------------------
s.isin([5, 14]) = 
 0    False
1    False
2    False
3    False
4     True
dtype: bool
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
df = 
  key1  key2
0    a     4
1    s     5
2    d     6
3    c     7
4    b     8
5    v     9
6    a    10
7    s    11
8    d    12 
type(df) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df.isin(['a', 'bc', '10', 8]) = 
     key1   key2
0   True  False
1  False  False
2  False  False
3  False  False
4  False   True
5  False  False
6   True  False
7  False  False
8  False  False
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Process finished with exit code 0



四、duplicated:去重

import pandas as pd

# 去重 .duplicated

# Series中使用duplicated
s = pd.Series([1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 5, 5, 5])
# 判断是否重复
data1 = s.duplicated()
print("data1 = s.duplicated() = \n", data1)
print('-' * 50)
# 通过布尔判断,得到不重复的值
print("s[s.duplicated() == False] = \n", s[s.duplicated() == False])
print('-' * 200)

# drop.duplicates移除重复
# inplace参数:是否替换原值,默认False
s_re = s.drop_duplicates()
print("s_re = s.drop_duplicates() = \n{0}".format(s_re))
print('-' * 200)

# Dataframe中使用duplicated
df = pd.DataFrame({'key1': ['a', 'a', 3, 4, 5],
                   'key2': ['a', 'a', 'b', 'b', 'c']})

print("df.duplicated() = \n", df.duplicated())
print('-' * 50)
print("df['key2'].duplicated() = \n", df['key2'].duplicated())

打印结果:

data1 = s.duplicated() = 
 0     False
1      True
2      True
3      True
4     False
5      True
6      True
7     False
8     False
9     False
10     True
11     True
12     True
dtype: bool
--------------------------------------------------
s[s.duplicated() == False] = 
 0    1
4    2
7    3
8    4
9    5
dtype: int64
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
s_re = s.drop_duplicates() = 
0    1
4    2
7    3
8    4
9    5
dtype: int64
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
df.duplicated() = 
 0    False
1     True
2    False
3    False
4    False
dtype: bool
--------------------------------------------------
df['key2'].duplicated() = 
 0    False
1     True
2    False
3     True
4    False
Name: key2, dtype: bool

Process finished with exit code 0



五、replace:替换

import numpy as np
import pandas as pd

# 替换 .replace
# 可一次性替换一个值或多个值
# 可传入列表或字典

s = pd.Series(list('ascaazsd'))
print("s = \n", s)
print('-' * 200)
data1 = s.replace('a', np.nan)
print("data1 = s.replace('a', np.nan) = \n", data1)
print('-' * 200)
data2 = s.replace(['a', 's'], np.nan)
print("data2 = s.replace(['a', 's'], np.nan) = \n", data2)
print('-' * 200)

data3 = s.replace({'a': 'hello world!', 's': 123})
print("data3 = s.replace({'a': 'hello world!', 's': 123}) = \n", data3)

打印结果:

s = 
 0    a
1    s
2    c
3    a
4    a
5    z
6    s
7    d
dtype: object
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data1 = s.replace('a', np.nan) = 
 0    NaN
1      s
2      c
3    NaN
4    NaN
5      z
6      s
7      d
dtype: object
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data2 = s.replace(['a', 's'], np.nan) = 
 0    NaN
1    NaN
2      c
3    NaN
4    NaN
5      z
6    NaN
7      d
dtype: object
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
data3 = s.replace({'a': 'hello world!', 's': 123}) = 
 0    hello world!
1             123
2               c
3    hello world!
4    hello world!
5               z
6             123
7               d
dtype: object

Process finished with exit code 0



参考资料:


python | pandas:unique函数



版权声明:本文为u013250861原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。