来源:未知 时间:2015-11-30 15:53 作者:xxadmin 阅读:次
[导读] 在分析sphix原理之前,我先澄清一下为什么经常出现coreseek这个词? 因为sphinx默认不支持中文索引及检索,而coreseek基于sphinx开发了coreseek全文检索服务器,它提供了为sphinx设计的中文分...
在分析sphix原理之前,我先澄清一下为什么经常出现coreseek这个词? 因为sphinx默认不支持中文索引及检索,而coreseek基于sphinx开发了coreseek全文检索服务器,它提供了为sphinx设计的中文分词包libmmseg包含mmseg中文分词,是目前用的最多的sphinx中文检索。 如果用到sphinx,全文索引交给sphinx来做,sphinx返回含有该word的ID号,然后用该ID号直接去数据库准确定位那些数据,整个过程如下图: sphinx的索引文件存储的不是完整的数据,只是由ID和分词组成的数组,由于索引文件不同直接查看,但我们可以通过search工具来验证: 先建索引: /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) 再通过search 查找单词test: /usr/local/coreseek/bin/search test -c /usr/local/coreseek/etc/sphinx.conf Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file '/usr/local/coreseek/etc/sphinx.conf'... index 'test1': query 'test ': returned 3 matches of 3 total in 0.050 sec
displaying matches: 1. document=1, weight=2421, group_id=1, date_added=Thu Jan 8 21:43:32 2015 id=1 group_id=1 group_id2=5 date_added=2015-01-08 21:43:32 title=test one content=this is my test document number one. also checking search within phrases. 2. document=2, weight=2421, group_id=1, date_added=Thu Jan 8 21:43:32 2015 id=2 group_id=1 group_id2=6 date_added=2015-01-08 21:43:32 title=test two content=this is my test document number two 3. document=4, weight=1442, group_id=2, date_added=Thu Jan 8 21:43:32 2015 id=4 group_id=2 group_id2=8 date_added=2015-01-08 21:43:32 title=doc number four content=this is to test groups words: 1. 'test': 3 documents, 5 hits 再通过search 查找单词this: /usr/local/coreseek/bin/search this -c /usr/local/coreseek/etc/sphinx.conf Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)] Copyright (c) 2007-2011, Beijing Choice Software Technologies Inc (http://www.coreseek.com) using config file '/usr/local/coreseek/etc/sphinx.conf'... index 'test1': query 'this ': returned 4 matches of 4 total in 0.000 sec displaying matches: 1. document=1, weight=1304, group_id=1, date_added=Thu Jan 8 21:43:32 2015 id=1 group_id=1 group_id2=5 date_added=2015-01-08 21:43:32 title=test one content=this is my test document number one. also checking search within phrases. 2. document=2, weight=1304, group_id=1, date_added=Thu Jan 8 21:43:32 2015 id=2 group_id=1 group_id2=6 date_added=2015-01-08 21:43:32 title=test two content=this is my test document number two 3. document=3, weight=1304, group_id=2, date_added=Thu Jan 8 21:43:32 2015 id=3 group_id=2 group_id2=7 date_added=2015-01-08 21:43:32 title=another doc content=this is another group 4. document=4, weight=1304, group_id=2, date_added=Thu Jan 8 21:43:32 2015 id=4 group_id=2 group_id2=8 date_added=2015-01-08 21:43:32 title=doc number four content=this is to test groups words: 1. 'this': 4 documents, 4 hits 由此,我们可以看到,search 关键词 主要返回的是含有表ID和命中率的数组。
注意:不知道大家有没有想到一个致命的问题,创建了sphinx全文索引后,如果在mysql中新增加数据,不重新indexer一下,sphinx索引是搜索不到的!即使是加参数–rotate,数据多的情况下,也要很长时间,这个问题怎么解决呢!明天就来讲主索引和增量索引,以及用cron来处理新数据自动加入增量索引中。 |
自学PHP网专注网站建设学习,PHP程序学习,平面设计学习,以及操作系统学习
京ICP备14009008号-1@版权所有www.zixuephp.com
网站声明:本站所有视频,教程都由网友上传,站长收集和分享给大家学习使用,如由牵扯版权问题请联系站长邮箱904561283@qq.com