发表时间:2010-07-01
Whoosh的分词是基于正则表达式的,所以只需要写出合适的正则表达式就可以正确分词。
当然,因为Whoosh是纯python的,你要重新实现分词模块或是使用第三方分词模块都是很容易的。
下面是一些例子(基于正则表达式),可能有不完善的地方,需要继续完善完善。
#测试分词
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from whoosh.analysis import RegexAnalyzer
rex = RegexAnalyzer(ur”([\u4e00-\u9fa5])|(\w+(\.?\w+)*)”)
print [token.text for token in rex(u"hi 中 000 中文测试中文 there 3.141 big-time under_score")]
#一个完整的演示
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.analysis import RegexAnalyzer
analyzer = RegexAnalyzer(ur”([\u4e00-\u9fa5])|(\w+(\.?\w+)*)”)
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer))
ix = create_in(“indexdir”, schema)
writer = ix.writer()
writer.add_document(title=u”First document”, path=u”/a”,
content=u”This is the first document we’ve added!”)
writer.add_document(title=u”Second document”, path=u”/b”,
content=u”The second one 你 中文测试中文 is even more interesting!”)
writer.commit()
searcher = ix.searcher()
results = searcher.find(“content”, u”first”)
print results[0]
results = searcher.find(“content”, u”你”)
print results[0]
results = searcher.find(“content”, u”测试”)
print results[0]