python 分割中文句子的时候报错:
File "C:\Users\Admin\anaconda3\envs\NLP\lib\re.py", line 215, in split
return _compile(pattern, flags).split(string, maxsplit)
File "C:\Users\Admin\anaconda3\envs\NLP\lib\re.py", line 288, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Admin\anaconda3\envs\NLP\lib\sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Admin\anaconda3\envs\NLP\lib\sre_parse.py", line 924, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Admin\anaconda3\envs\NLP\lib\sre_parse.py", line 420, in _parse_sub
not nested and not items))
File "C:\Users\Admin\anaconda3\envs\NLP\lib\sre_parse.py", line 574, in _parse
raise source.error(msg, len(this) + 1 + len(that))
re.error: bad character range )- at position 15
出错代码点:
txt_split = re.split(r'[,,.。!!;;::??、()- ]', txt_process.strip())
参考这位仁兄:re分割字符串时,所用的分隔符集合必须按其ASCII值的大小从小到大排列
而我原代码里的顺序为:
print([ord(x) for x in ',,.。!!;;::??、()- '])
[65292, 44, 46, 12290, 65281, 33, 65307, 59, 65306, 58, 63, 65311, 12289, 65288, 65289, 45, 32]
更改分隔符的顺序后,解决~
txt_split = re.split(r'[ !,-.:;?、。!(),:;?]', txt_process.strip())
print([ord(x) for x in ' !,-.:;?、。!(),:;?'])
[32, 33, 44, 45, 46, 58, 59, 63, 12289, 12290, 65281, 65288, 65289, 65292, 65306, 65307, 65311]