Pandas文本处理

admin

2024-02-02 07:42:59

0次

深入浅出Pandas读书笔记

C11 Pandas文本处理

11.1 数据类型

object和StringDtype是Pandas的两个文本类型, 在1.0版本之前, object是唯一文本类型, Pandas会将混杂各种类型的一列数据归为object, 在1.0版本之后, 使用官方推荐新的数据类型StringDtype

11.1.1 文本数据类型

默认情况下, 文本数据会被推断为object类型

df = pd.DataFrame({'A': ['a1', 'a1', 'a2', 'a2'], 'B': ['b1', 'b2', None, 'b2'],'C': [1, 2, 3, 4],'D': [5, 6, None, 8],'E': [5, None, 7, 8]
})
'''A	B	C	D	E
0	a1	b1	1	5.0	5.0
1	a1	b2	2	6.0	NaN
2	a2	None	3	NaN	7.0
3	a2	b2	4	8.0	8.0
'''
df.dtypes
'''
A     object
B     object
C      int64
D    float64
E    float64
dtype: object
'''
# 如果想要string类型, 需要专门指定
pd.Series(['a', 'b', 'c'], dtype='string')
pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())

11.1.2 类型转换

关于将类型转换为string类型, 推荐使用以下转换方法, 他能只能地将数据类型转换为最新支持的合适类型

df.convert_dtypes().dtypes
'''
A    string
B    string
C     Int64
D     Int64
E     Int64
dtype: object
'''
# 也可使用astype
s = pd.Series(['a', 'b', 'c'])
s.astype('object') # 转为object
s.astype('string') # 转为string

10.1.3 类型异同

StringDtype在操作上与object有所不同

pd.Series(['a', None, 'b']).str.count('a')
'''
0    1.0
1    NaN
2    0.0
dtype: float64
'''
pd.Series(['a', None, 'b'], dtype='string').str.count('a')
'''
0       1
1    
2       0
dtype: Int64
'''
pd.Series(['a', None, 'b']).str.isdigit()
'''
0    False
1     None
2    False
dtype: object
'''
pd.Series(['a', None, 'b'], dtype='string').str.isdigit()
'''
0    False
1     
2    False
dtype: boolean
'''

11.2 字符的操作

Series和Index都有一些字符串处理方法, 可以方便的进行操作, 这些方法会自动排除缺失值和NA值. 可以通过str属性访问他的方法.

11.2.1 .str访问器

可以使用.str.访问器Accessor来对内容进行字符操作

s = pd.Series(['A', 'Boy', 'C', np.nan], dtype='string')
s.str.lower()
# 对非字符类型, 可以先转换再使用
df.Q1.astype(str).str
# 转为StringDtype
df.team.astype('string').str
# 大多数操作也适用于df.index, df.columns索引类型
# 对索引进行操作
df.index.str.lower()
# 对表头进行操作
df.columns.str.lower()
# 如果对数据进行连续字符操作, 则每个操作都要使用.str方法
# 移除字符串头尾空格, 转小写, 替换下划线
df.columns.str.strip().str.lower().str.replace('', '_')

11.2.2 文本格式

s.str.lower() # 转小写
s.str.upper() # 转大写
s.str.title() # 每个单词大写
s.str.capitalize() # 首字母大写
s.str.swapcase() # 大小写互换
s.str.casefold() # 转为小写, 支持其他语言

11.2.3 文本对齐

s.str.center(10, fillchar='-') # 居中对齐, 宽度为10, 用'-'填充
s.str.ljust(10, fillchar='-') # 左对齐
s.str.rjust(10, fillchar='-') # 右对齐
# 指定宽度, 填充内容对齐方式, 填充内容
# 参数side可取值为left, right both, 默认为left
s.str.pad(width=10, side='left', fillchar='-')
s.str.zfill(10) # 不足10位的前面补0

11.2.4 计数和编码

# 字符串中指定字母的数量
s.str.count('a')
# 字符串长度
s.str.len()
# 编码
s.str.encode('utf-8')
# 解码
s.str.decode('utf-8')
pd.Series(['年后']).str.encode('gbk').str.decode('gbk')

11.2.5 格式判定

s.str.isalpha() # 是否为字母
s.str.isnumeric() # 是否为数字0-9
s.str.isalnum() # 是否为字母和数字组成
s.str.isdigit() # 是否为数字
s.str.isdecimal() # 是否为小数
s.str.isspace() # 是否为空格
s.str.islower() # 是否为小写
s.str.isupper() # 是否为大写
s.str.istitle() # 是否标题格式

11.3 文本高级处理

11.3.1 文本分隔

对文本分隔和替换是最常用的文本处理方式, 对文本分割后会生成一个列表, 我们对列表进行切片操作, 可以找到我们想要的内容, 分隔后还可以将分隔内容展开, 形成单独的行.

s = pd.Series(['天_地_人', '你_我_她', np.nan, '风_水_火'], dtype='string')
s.str.split('_')
'''
0    [天, 地, 人]
1    [你, 我, 她]
2         
3    [风, 水, 火]
dtype: object
'''
# 分隔后可以使用get或者[]来取出相应内容, 不过[]是python列表切片操作, 更加灵活, 不仅可以去除单个内容, 也可以取出由多个内容组成的片段
s.str.split('_').str[1]
# get只能取1个值
s.str.split('_').str.get(1)
# []可以使用切片操作
s.str.split('_').str[1:3]
# 如果不指定分隔符, 会按空格进行分隔
# 限制分隔的次数, 从左开始, 剩余的不分隔
s.str.split(n=2)

11.3.2 字符分隔展开

在用.str.split()将数据分隔为列表后, 如果想让列表共同索位上的值在同一列, 形成一个DataFrame, 可以传入expand=True, 还可以通过n参数指定分隔索引位来控制形成几列

s.str.split('_', expand=True)
'''0	1	2
0	天	地	人
1	你	我	她
2			
3	风	水	火
'''
# 指定展开列数, n为切片右值
s.str.split('_', expand=True, n=1)
# rsplit和split一样, 它从右边开始分隔, 如果没有n参数, rsplit和split的输出是相同的
# 对于比较复杂的规则, 可以传入正则表达式
s = pd.Series(['你和我及他'])
s.str.split(r'和|及', expand=True)

11.3.3 文本切片选择 .str.slice()

使用.str.slice()将指定的内容切除掉, 还是推荐使用s.str[]来实现

11.3.4 文本划分 .str.partition()

.str.partition可以将文本按分隔符划分为三个部分, 形成一个新的DataFrame或者相关数据类型
partiotion与split的区别在于是否保留了分隔符

# 从左开始划分
s.str.partition()
# 从右开始划分
s.str.rpartition()
# 指定字符
s.str.partition('are')
# 划分为一个元祖列
s.str.partition('you', expand=False)
# 对索引进行划分
idx = pd.Index(['A 123', 'B 345'])
idx.str.partition()
'''
MultiIndex([('A', ' ', '123'),('B', ' ', '345')],)
'''

11.3.5 文本替换 .str.replace()

等同于df.repalce()和s.replace()

11.3.6 指定替换 .str.slice_replace()

s.str.slice_replace(start=1, repl='TTT') # 保留第一位, 后面用repl替换
s.str.slice_replace(stop=2, repl='TTT') # 保留第二位后的, 前面用repl替换
s.str.slice_replace(start=1, stop=3, repl='TTT') # 保留第一位 第三位后的, 中间用repl替换

11.3.7 重复替换 .str.repeat()

11.3.8 文本链接 .str.cat()

可以将一个文本或者将两个文本序列链接在一起

s = pd.Series(['x', 'y', 'z'])
s.str.cat()
# 用逗号链接
s.str.cat(sep=',')
# 如果序列中有空值, 会默认忽略空值, 也可以指定空值的占位符号
t = pd.Series(['h', 'i', np.nan, 'k'])
t.str.cat(sep=',', na_rep='-')
# 链接两个Series
s.str.cat(t, na_rep='-')
# 链接的对齐方式
h = pd.Series(['b', 'd', 'a'], index=[1, 0, 2])
s.str.cat(h, join='right')
s.str.cat(h, join='ouyter')

11.3.9 文本查询 .str.findall(), s.str.find()

findall返回列表, 列表中包含了找到的元素, find返回索引, 根据规则找到的第一个结果的索引

# .str.findall()可以查询文本中包括的内容
s = pd.Series(['One', 'Two', 'Three'])
s.str.findall('T')
'''
0     []
1    [T]
2    [T]
dtype: object
'''
# 大小写敏感
s.str.findall('ONE')
import re
s.str.findall('ONE', flags=re.IGNORECASE) # 引入re模块, 指定忽略大小写
s.str.findall('o', flags=re.I) # 查找是否含有`o`的元素, 忽略大小写
s.str.findall('o$') # 查找以`o`结尾的元素
s.str.findall('e')
'''
0       [e]
1        []
2    [e, e]
dtype: object
'''
# .str.find()返回匹配结果的位置(从0开始, -1为不匹配)
s.str.find('One')
'''
0    0
1   -1
2   -1
dtype: int64
'''
s.str.find('e') # 各个元素返回第一个匹配到的`e`的位置
'''
0    2
1   -1
2    3
dtype: int64
'''
# .str.rfind(), 从右边开始匹配

11.3.10 文本包含 .str.contains() .str.startswith / endswith, match

.str.contains()会判断字符是否有包含关系, 返回布尔序列, 默认支持正则表达式, na可以指定空值的处理方式
.str.match()确定每个字符串是否与正则表达式匹配

s = pd.Series(['One', 'Two', 'Three', np.nan])
s.str.contains('o')
'''
0    False
1     True
2    False
3      NaN
dtype: object
'''
# 名字包含A / C, 忽略大小写
df.loc[df.name.str.contains('A|C', flags=re.IGNORECASE)]
# 包含数字
df.name.str.contains('\d')
# startswith endswith
s.str.startswith('O', na=False)
# match
s = pd.Series(['1', '2', '3a', '4b', '03c'])
s.str.match(r'[0-9][a-z]')
'''
0    False
1    False
2     True
3     True
4    False
dtype: bool
'''

11.3.11 文本提取 .str.extract()

.str.extract()可以利用正则表达式将文本中的数据提取出来, 形成单独的列.

(pd.Series(['a1', 'b2', 'c3']).str.extract(r'([a-z])(\d)')
)
# extract中有多少个(), 返回多少列
'''0	1
0	a	1
1	b	2
2	c	3
'''
# 为正则出来的列, 命名
s.str.extract(r'(?P[ab])(?P\d)')
'''letter	digit
0	a	1
1	b	2
2	NaN	NaN
'''
# extractall
s = pd.Series(['a1a2', 'b1b7', 'c1'])
s.str.extractall(r'([a-z])(\d)')
'''0	1
match		
0	0	a	1
1	a	2
1	0	b	1
1	b	7
2	0	c	1
'''

11.3.12 提取虚拟变量

词库加载错误:未能找到文件“E:\highferrum_mysql\Configuration\Dict_Stopwords.txt”。

上一篇：曼城领先！努内斯献助攻，小蜘蛛门前包抄头球建功努内斯曼城近期表现曼城努内斯什么水平

下一篇：连续闪耀！布拉德利战蓝军68分钟1球2助，上一场助攻双响获MVP 布拉德利精彩进球布拉德利凯尔特人高光集锦