python组件介绍_Whoosh介绍（纯python的全文搜索组件）

最新推荐文章于 2024-08-06 14:47:30 发布

weixin_39939661

最新推荐文章于 2024-08-06 14:47:30 发布

阅读量262

点赞数

文章标签： python组件介绍

本文介绍如何使用Whoosh的正则表达式分词器进行中文和英文文本的分词，并提供了一个简单的示例程序，展示了从文档建立索引到搜索查询的完整流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

发表时间：2010-07-01

Whoosh的分词是基于正则表达式的，所以只需要写出合适的正则表达式就可以正确分词。

当然，因为Whoosh是纯python的，你要重新实现分词模块或是使用第三方分词模块都是很容易的。

下面是一些例子(基于正则表达式)，可能有不完善的地方，需要继续完善完善。

#测试分词

#!/usr/bin/env python

# -*- coding: UTF-8 -*-

from whoosh.analysis import RegexAnalyzer

rex = RegexAnalyzer(ur”([\u4e00-\u9fa5])|(\w+(\.?\w+)*)”)

print [token.text for token in rex(u"hi 中 000 中文测试中文 there 3.141 big-time under_score")]

#一个完整的演示

#!/usr/bin/env python

# -*- coding: UTF-8 -*-

from whoosh.index import create_in

from whoosh.fields import *

from whoosh.analysis import RegexAnalyzer

analyzer = RegexAnalyzer(ur”([\u4e00-\u9fa5])|(\w+(\.?\w+)*)”)

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer))

ix = create_in(“indexdir”, schema)

writer = ix.writer()

writer.add_document(title=u”First document”, path=u”/a”,

content=u”This is the first document we’ve added!”)

writer.add_document(title=u”Second document”, path=u”/b”,

content=u”The second one 你中文测试中文 is even more interesting!”)

writer.commit()

searcher = ix.searcher()

results = searcher.find(“content”, u”first”)

print results[0]

results = searcher.find(“content”, u”你”)

print results[0]

results = searcher.find(“content”, u”测试”)

print results[0]

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。