基于PARADISE平台的论文检索系统(五)

本论文在其他论文栏目，由论文格式网整理,转载请注明来源www.lwgsw.com,更多论文,请点论文格式范文查看
          shared_ptr<Field> field_content = Field::TextStatistics("Content");
          shared_ptr<FieldData> field_content_data(new FieldData(pContent->getContentToken()));
          field_content.get()->setFieldData(field_content_data);
          document.addField(field_content, NONSTATIC);

          shared_ptr<Field> field_ID = Field::Keywords("Url");
          shared_ptr<FieldData> field_ID_data(new FieldData(PDFFunction::Int2Str(pContent->getID())));
          field_ID->setFieldData(field_ID_data);
          document.addField(field_ID, NONSTATIC);
          …
          document.setDocId(doc_id);
          pWriter->addDocument(document);
          doc_id++;
}
上面的代码中，首先建立一个Content域，内容为我们的文献全文形成的字符串。然后建立了一个url域。其中，url域及其重要，是必须有的一个域，而且必须名为Url。我们知道，所谓倒排索引，是指对一系列文本的内容建立索引，通过这些内容，可以获得这些文本的ID号，就如网页搜索一样，我们通过那些网页的内容，搜索到网页的url。这里我们将文献的文本内容存在BDB中的，因此需要获得每个文章的ID号。
PARADISE系统的设置是，在我们开启一个搜索服务时，一个请求发向服务器端之后，服务器端会将搜索到得结果的url列表返回给前端，这个url列表必须是来自上面的Url域。因为PARADISE主要是针对网页搜索的，所以称这个域为Url，实际上应该叫DocumentID更确切一点。
5.3修改前台部分
PARADISE的前台部分也设计的很好，特别是摘要算法已经完成测试。因此对于前台部分，只需要修改一点，就是提供一个候选摘要的数据库。我们知道，不可能对整篇文章进行摘要算法，那样会耗费大量的时间，最终会导致前端所耗费的时间比后端检索所花费的时间还多，这显然是用户无法接受的。因此，前台部分唯一需要修改的部分，就是给定一个ID号，获得它的摘要。这里，我们利用了前面获得的metadata.dpt文件，里面存有一篇论文的摘要，获得摘要段落之后，对其利用摘要算法，可以获取较好的效果。
另外，我们这个系统不是简单的一个论文检索系统，检索只是方便使用的工具，更重要的，它是一个知识提取系统，因此，还需要自己编写一些界面用来显示知识，这些就不再赘述。
5.4系统示意图
5.4.1主界面

5.4.2搜索结果界面

5.4.3评论界面

第6章实验结果与分析
6.1实验结果
在我们的实验数据里，总共抓取了2500篇论文，其中在我们的论文集里被其他论文引用的文章个数为1686篇，总共被引用72471次，平均每个论文被42论文引用。这些论文中，总共能找到的评论句子个数为160046个。平均每个论文有个95评论句子，每个论文在被另外一篇论文引用时，平均约被评论2.2次
根据上面的比率，可以看出，如果我们最终显示在界面上的评论个数需要是5个，那么一篇论文，它被1到2篇论文引用时，就会获得足够的评论集。如果被5篇论文引用时，就会获得效果很好的评论集了。
6.2具体分析
为了更好的说明我们所做的这个系统的效果，下面随机选取一篇评论较多论文为例，来说明我们获得的这些评论以及概括的作用[Elkiss, et al.,2008]。
Paper Name:
Three-level caching for efficient query processing in large Web search engines
从题目可以看出，这篇论文是用三级缓存来处理搜索引擎中大规模的请求的。
Abstract:
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as catching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and eva luate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
摘要部分，先说了搜索引擎的负载很重的概况；然后介绍现有的两级catch有一定的缺点，而作者完成了一个三级缓存，在原有的缓存加入了一个中间层；最后说本文用到了一些算法，并且最终实验结果的性能也很好。
通过阅读摘要，我们就知道这篇论文的概况以及来龙去脉。
Comment:
(1)They may be considered separate and complementary to a cache-based approach. Raghavan and Sever [the cited paper], in one of the first papers on exploiting user query history, propose using a query base, built upon a set of persistent “optimal” queries submitted in the past, to improve the retrieva l effectiveness for similar future queries. Markatos [10] shows the existence of temporal locality in queries, and compares the performance of different catching policies.
(2)Our results show that even under the fairly general framework adopted in this paper, geographic search queries can be eva luated in a highly efficient manner and in some cases as fast as the corresponding text-only queries. The query processor that we use and adapt to geographic search queries was built by Xiaohui Long, and earlier versions were used in [26, 27]. It supports variants of all the optimizations described in Subsection 1.
(3)the survey by Gaede and G¨ nther in [17]. In particular, our u algorithms employ spatial data organizations based on R∗ -tree [5], grid files [the cited paper], and space-filling curves - see [17, 36] and the references therein. A geographic search engine may appear similar to a Geographic Information System (GIS) [20] where documents are objects in space with additional non-spatial attributes (the words they contain).
下面我们来逐条分析上面获得的评论。
从（1）中可以看出，该条评论并没有谈到源论文的三级缓存结构，而是比较看重其中的一个方法：利用用户请求的历史记录，基于以前所获得的比较理想的查询词，建立一个用户请求库，来提高搜索引擎的中相似请求的处理速度。这句话就很好的告诉了我们源论文中三级缓存的一个方法，并且可以看出，这个方法并不仅仅可以用在三级缓存中，也可以用在个性化搜索等方面。
从（2）中可以看出，该条评论说明了它利用了源论文中的请求处理器，来搭建了一个地理搜索引擎。通过这一条评论我们可以看出源论文的后续工作，有什么用处。源论文并不仅仅在三级缓存结构上有研究，其请求处理模型很可能用处更大。
从（3）中可以看出,源论文中使用了一种grid files的系统或者算法，它和R*-tree、空间填充曲线这些算法结合，能够形成一种特殊的数据结构。这也代表了源论文后续工作的一种，方便了读者以更加广阔的视野来看待该论文。

Impact-based Summary:
(1)This motivates the search for new techniques that can increase the number of queries per second that can be sustained on a given set of machines, and in addition to index compression and query pruning, caching techniques have been widely studied and deployed.
(2)Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels.
(3)To do so, the engine traverses the inverted list of each query term, and uses the information embedded in the inverted lists, about the number of occurrences of the terms in a document, their positions, and context, to compute a score for each document containing the search terms.
(4)Query characteristics: We first look at the distribution of the ratios and total costs for queries with various numbers of terms, by issuing these queries to our query processor with caching completely turned off.
(5)Thus, recent queries are analyzed by the greedy algorithm to allocate space in the cache for projections likely to be encountered in the future, and only these projections are allowed into the cache.
最后我们来分析获得的基于影响的概括，这里，为了节省篇幅，只取了前5句来进行分析。

首页上一页 2 3 4 5 6 下一页尾页 5/6/6