基于Lucene的站内搜索

基于 lucene 的站内搜索 tangfulin <tangfulin@gmail.com> imobile.com.cn

背景多个模块需要搜索功能（ V1.0 ）不同的搜索字段不同的排序方式不同的更新频率，索引大小重建索引需求（ V1.5 ）搜索结果异常（记录重复，记录丢失）索引文件意外损坏分词或排序算法变更

背景（续）缩短更新周期需求（及时更新）（ V1.5 ） Google 索引 imobile 更新为 30 分钟 V1.0 索引更新周期为 15 分钟 V1.5 预期更新周期为 3 分钟，实际为 1-5 分钟大索引搜索需求（ V1.5 ） 3000+ 万条记录的一个库， xml 原始文件 14G V1.0 索引文件为 13 G V1.5 索引文件为 3.9G V1.5 完整重建一次： 140 分钟

背景（续二）在某些情况下作为数据库的替代数据源（ V2.0 ，在路上）类似淘宝搜索，按多个字段筛选，过滤，排序当前解决方案：使用 sql 从数据库中选取数据问题：数据可能因为业务逻辑的设计而分散在多个不同的库，表中，联表查询问题，并发压力问题

目标及时更新（ 3 分钟）快速重建（ < 2 小时）可配置（拥抱需求变化）可监控（运维友好） SLA ： always 可写， always 可读，异常的时候唯一的表现是更新延迟高性能，能承受较大的流量，并发压力

进度 2009 年 4 月 1 号 search 2.0 init （节日快乐！） 4 月 12 号，修改版本号为 1.5 6 月 1 号， search 1.5 在线上试运行（节日快乐 again ！）

基本设想分离索引与存储，二次读取分离读与写分离 update 和 rebuild 拆分大库和小库 new open 小库（小库滚动）， reopen 大库新索引预热更多： http://blog.fulin.org/category/tech/lucene

分离索引与存储，二次读取索引里只存储 id ，其他的字段只索引不存储。优点：保持索引的大小为一个可接受的范围提高索引读取速度提高索引 cache 效率缺点：搜索时需要额外的请求来获取其它必须的字段（ lucene ＋ db 方案）

分离读与写优点：降低编程复杂度保证搜索服务的可用性，和可扩展性（可以将索引文件分发到多台机器上，同时对外提供服务）提升索引更新速度缺点：需要额外的读索引更新逻辑（ reopen ）

分离 update 和 rebuild Rebuild 的同时， update 正常更新 Rebuild 需要将重建这段时间的更新计入新的索引中 Rebuild 完成后，通知 update 切换到新索引上来，并继续更新进程间通信，当前使用最原始的方案：在硬盘上 touch flag 文件

拆分大库和小库为了保证及时更新的同时，减少索引频繁同步（由写索引同步到读索引）带来的 io 压力，把索引库拆分为大库（历史库），小库（最近更新库），定期合并小库的记录到大库中，降低历史库的同步频率，在不影响搜索结果的同时减少同步的 io 消耗。大部分的搜索结果都满足时间的局部性原则，即搜索结果中，最近更新的记录排前面的可能性较大。所以，可以配置这样的策略：如果小库中的搜索结果条数已经满足要求，那么略过大库，直接返回结果给客户端，以达到加速搜索的目的。

拆分大库和小库（续）增加新记录：增加到小库更新记录：从大库中删除（标记删除）从小库中删除（物理删除）增加到小库删除记录从大库中删除（标记删除）从小库中删除（物理删除）

搜索端索引更新小库每次同步到一个新的文件夹中保留最近打开的 n (3) 份小库索引目录检测到新的索引到达，关闭一个最旧的，打开新的，预热后标识为可用检测到新的小库到达， reopen 大库（为了逻辑上的简单起见，大小库同步更新）

新索引预热遍历一遍新打开的索引，将数据都读入内存预热完成后，再投入使用消除新打开的索引上前几次搜索慢的问题

架构图端口 1985 IndexRebuilder IndexUpdater Searcher 配置文件 DAL 数据更新同步通知搜索搜索客户端调用 Searcher 的 API 搜索管理后台发出开始重建索引命令 cron 发送重建数据端口 1986 IndexReceiver Rebuild xml data Update xml data

IndexReceiver By Java ， Daemon 程序， psgrep “pname=IndexReceiver2” 端口 1986 使用 Monkey 为底层 NIO 处理框架使用 SCGI 通讯协议接受 cmd ： receiveTaskFile receiveRebuildFile rebuildIndex stage ： start stage ： end receiveDict （未实现）

IndexReceiver （续）将 byte[] docData 写入 xml 文件 Update 文件写入 ConfigUtils.getTopTaskFileDir(indexId)‏ Rebuild 文件写入 ConfigUtils.getRootRebuildFileDir(indexId)‏ Rebuild 期间， update 抄送一份到 ConfigUtils.getRootRebuildUpdateFileDir(indexId)‏ 为了保证原子性，先写入 .0.***.xml ，写入完成后，再 rename 文件名中，带有 size 信息，即该文件中有多少条 doc 记录 Receiver 只负责写入， updater 和 rebuilder 稍后负责删除（备份） TODO. 多个 receiver ，保证 always 可写

IndexRebuilder By Java ， Daemon 程序， psgrep “pname=IndexRebuilder2” 使用 Executors.newSingleThreadScheduledExecutor() 定期检查发现有新的 rebuild 请求，则起一个新线程去执行发现有已经完成或强制退出的 rebuild 线程，则清除 AddShutdownHook 退出前 close 所有打开的 IndexWriter

IndexRebuilder （续） RebuildExecutor 线程强制停止： ConfigUtils.getRootRebuildFileDir(indexId) + "/stop.sign" 循环处理 getRootRebuildFileDir(indexId) 下的 xml 文件 getRootRebuildFileDir(indexId) 下没有 xml 文件，并且有 rebuild.end 标识，则重建转入下一阶段处理 rebuild 期间的 update 数据： ConfigUtils.getRootRebuildUpdateFileDir(indexId)‏ OptimizeAndCloseIdx SetRebuildFinishSign Stop （切换索引由 IndexUpdater 完成）

IndexUpdater By Java ， Daemon 程序， psgrep “pname=IndexUpdater2” 使用 Executors.newSingleThreadScheduledExecutor() 定期检查发现有新的 update 索引，则起一个新线程去执行发现有已经删除或强制退出的 update 线程，则清除删除 ConfigUtils.getTopTaskFileDir(indexId) 目录即可 AddShutdownHook 退出前 close 所有打开的 IndexWriter

IndexUpdater （续） UpdateExecutor 线程强制停止： ConfigUtils.getTopTaskFileDir(indexId) + "/stop.sign" 删除 ConfigUtils.getTopTaskFileDir(indexId) 目录检测是否有新的 rebuild 索引（下页详细说明）循环处理 getTopTaskFileDir(indexId) 下的 xml 文件无 xml 文件？ Thread sleep ，再次查看有 xml 文件？是否需要合并大小库？处理 xml 文件 ShootSnap 拷贝一份索引的快照到 snap 目录

IndexUpdater （续二） UpdateExecutor 线程（续）如果有新的 rebuild 索引 Backup oldTaskFiles ， oldIdxFiles copy new idx to updater's src idx, and rebuilder's update xml to updater's task dir SetRebuildTransIdxFinishSign() to tell receiver that trans idx finished receiver 停止 copy update xml 到 rebuild update 目录，并作一些清理的工作 Updater 线程回到正常更新的循环

Trans Bash 脚本，每个 indexId 需要一个进程，由 ControlCenter 或 monitor 启动和停止监控 /home/data/search2/indexes/src-snap/${idx} 下的子目录子目录存在，并且子目录中存在 copy.done.sign （ IndexUpdater shootSnap 完成后 touch 的标识），则 rsync 子目录到 dest Rsync 的时候：先 rsync bigidx 目录，再 rsync 该子目录下所有 lucene idx 文件，都成功后再 rsync 一个 trans.done.sign 成功后删除子目录 /home/data/search2/indexes/src-snap/${idx} 目录被删除，则退出

Searcher By Java ， Daemon 程序， psgrep “pname=Search2” 端口 1985 使用 Monkey 为底层 NIO 处理框架使用 SCGI 通讯协议接受 cmd ： search stat （统计功能） receiveDict （未实现）

Searcher （续）启动后第一件事： warmUpAllIndex （ ConfigUtils.getDestIndexBaseDir() 下的所有子目录） MySearcherRoller ：当前使用 ParallelMultiSearcher ，由大库 IndexSearcher 和小库的 IndexSearcher 组成。大库可以为空，小库不能为空当前维护 3 份 ParallelMultiSearcher ，即 3 份大库 IndexSearcher 和小库的 IndexSearcher ，排列成一个环形队列小库滚动：监测 ConfigUtils.getDestIndexDir(indexId) 目录下的子目录子目录存在，并且是一个 lucene idx 目录，且存在 trans.done.sign 关闭一个旧的小库 IndexSearcher ，打开一个新的小库 IndexSearcher reopen 大库的 IndexSearcher 的 IndexReader 用这 2 个新的 IndexSearcher new 出一个新的 ParallelMultiSearcher Warmup ，放入可用 searcher 队列，最后移动队列指针

Searcher （续二）关闭小库的前提：当前没有其它线程还在使用它方法：计数（ get +1 ， return -1 ，为 0 表示没有被使用）在 try finally 块中 return ，防止搜索出错 Close 逻辑小库正常 close 大库不 close ，等待下次使用的时候 reopen 更新小库跳过：如果同时有多份更新的小库，则打开最新的，跳过其它的错误处理：任何一个地方都可能出错，而我们需要努力做到的是，只要有可能，服务就不能停止，宁可更新延迟一点

Searcher （续三） Stat 统计 http://searchadmin.imobile.com.cn/admin.index.php?a=search_status 该 indexId 的各个索引情况 Xml data Backup 目录 Log 统计 stat.sh 类似功能

Cleaner Bash 脚本，每个 indexId 需要一个进程，由 ControlCenter 或 monitor 启动和停止用来删除 Searcher 已经关闭或跳过的索引 TODO: 删除因各种原因导致的不一致的错误索引目录

Monitor Bash 脚本，由 ControlCenter 启动监控系统的各个进程是否存在，如不存在，则启动一个新的进程各个进程的 pid 文件都在 /home/data/search2/logs/ 目录下

ControlCenter Bash 脚本 Usage: ./controlCenter.sh {start|stop|restart} {all|receiver|updater|rebuilder|searcher|trans|cleaner} OR: ./controlCenter.sh {start|stop|restart} monitor OR: ./controlCenter.sh {mkdirs} Stop all 的时候， searcher 和 receiver 两个 service 进程不会停止（ use stop all all ）在一台新的机器上部署的时候，先使用 ./controlCenter.sh mkdirs 建立必要的目录结构

开发中的一些收获（一） FileChannel.transferTo 拷贝文件失败： An attempt is made to read up to count bytes starting at the given position in this channel's file and write them to the target channel. An invocation of this method may or may not transfer all of the requested bytes; whether or not it does so depends upon the natures and states of the channels. Fewer than the requested number of bytes are transferred if this channel's file contains fewer than count bytes starting at the given position, or if the target channel is non-blocking and it has fewer than count bytes free in its output buffer. 解决： check copied size ，断点续传

开发中的一些收获（二） Kill 的问题 Never kill smart frog, don't kill -9 Java use Runtime.getRuntime().addShutdownHook to do the cleaning things Lucene IndexWrite need close!

开发中的一些收获（三）

持续改进配置文件改动检测，自动重新加载多机部署 Searcher 和 Receiver 拆分分词模块为服务分词算法改进搜索关键字数据挖掘（搜索新词自动发现）搜索建议功能搜索联想功能排序算法改进搜索结果评价？？？实时搜索（ lucene 3.0 ）

参考资料 Lucene http://lucene.apache.org/ http://lucene.apache.org/java/2_4_1/api/index.html 中文分词 http://code.google.com/p/paoding/ http://code.google.com/p/imdict-chinese-analyzer/ http://code.google.com/p/mmseg4j/ Monkey （ Java 底层异步网络 IO 框架） DAL

更多讨论关于 imobile 网站首页 http://www.imobile.com.cn/ 关于 http://www.imobile.com.cn/about.html 团队博客 http://team.imobile.com.cn/ Longker （ V1.0 版本） http://www.longker.org/ 关于我（ V1.5 版本） http://www.fulin.org/ http://twitter.com/tangfl

基于Lucene的站内搜索

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to 基于Lucene的站内搜索 (20)

基于Lucene的站内搜索