第6章 实验结果与分析
6.1实验结果
在我们的实验数据里,总共抓取了2500篇论文,其中在我们的论文集里被其他论文引用的文章个数为1686篇,总共被引用72471次,平均每个论文被42论文引用。这些论文中,总共能找到的评论句子个数为160046个。平均每个论文有个95评论句子,每个论文在被另外一篇论文引用时,平均约被评论2.2次
根据上面的比率,可以看出,如果我们最终显示在界面上的评论个数需要是5个,那么一篇论文,它被1到2篇论文引用时,就会获得足够的评论集。如果被5篇论文引用时,就会获得效果很好的评论集了。
6.2具体分析
为了更好的说明我们所做的这个系统的效果,下面随机选取一篇评论较多论文为例,来说明我们获得的这些评论以及概括的作用[Elkiss, et al.,2008]。
Paper Name:
Three-level caching for efficient query processing in large Web search engines
从题目可以看出,这篇论文是用三级缓存来处理搜索引擎中大规模的请求的。
Abstract:
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as catching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level. We propose and eva luate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.
摘要部分,先说了搜索引擎的负载很重的概况;然后介绍现有的两级catch有一定的缺点,而作者完成了一个三级缓存,在原有的缓存加入了一个中间层;最后说本文用到了一些算法,并且最终实验结果的性能也很好。
通过阅读摘要,我们就知道这篇论文的概况以及来龙去脉。
Comment:
(1)They may be considered separate and complementary to a cache-based approach. Raghavan and Sever [the cited paper], in one of the first papers on exploiting user query history, propose using a query base, built upon a set of persistent “optimal” queries submitted in the past, to improve the retrieva l effectiveness for similar future queries. Markatos [10] shows the existence of temporal locality in queries, and compares the performance of different catching policies.
(2)Our results show that even under the fairly general framework adopted in this paper, geographic search queries can be eva luated in a highly efficient manner and in some cases as fast as the corresponding text-only queries. The query processor that we use and adapt to geographic search queries was built by Xiaohui Long, and earlier versions were used in [26, 27]. It supports variants of all the optimizations described in Subsection 1.
(3)the survey by Gaede and G¨ nther in [17]. In particular, our u algorithms employ spatial data organizations based on R∗ -tree [5], grid files [the cited paper], and space-filling curves - see [17, 36] and the references therein. A geographic search engine may appear similar to a Geographic Information System (GIS) [20] where documents are objects in space with additional non-spatial attributes (the words they contain).
下面我们来逐条分析上面获得的评论。
从(1)中可以看出,该条评论并没有谈到源论文的三级缓存结构,而是比较看重其中的一个方法:利用用户请求的历史记录,基于以前所获得的比较理想的查询词,建立一个用户请求库,来提高搜索引擎的中相似请求的处理速度。这句话就很好的告诉了我们源论文中三级缓存的一个方法,并且可以看出,这个方法并不仅仅可以用在三级缓存中,也可以用在个性化搜索等方面。
从(2)中可以看出,该条评论说明了它利用了源论文中的请求处理器,来搭建了一个地理搜索引擎。通过这一条评论我们可以看出源论文的后续工作,有什么用处。源论文并不仅仅在三级缓存结构上有研究,其请求处理模型很可能用处更大。
从(3)中可以看出,源论文中使用了一种grid files的系统或者算法,它和R*-tree、空间填充曲线这些算法结合,能够形成一种特殊的数据结构。这也代表了源论文后续工作的一种,方便了读者以更加广阔的视野来看待该论文。
Impact-based Summary:
(1)This motivates the search for new techniques that can increase the number of queries per second that can be sustained on a given set of machines, and in addition to index compression and query pruning, caching techniques have been widely studied and deployed.
(2)Our experimental eva luation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels.
(3)To do so, the engine traverses the inverted list of each query term, and uses the information embedded in the inverted lists, about the number of occurrences of the terms in a document, their positions, and context, to compute a score for each document containing the search terms.
(4)Query characteristics: We first look at the distribution of the ratios and total costs for queries with various numbers of terms, by issuing these queries to our query processor with caching completely turned off.
(5)Thus, recent queries are analyzed by the greedy algorithm to allocate space in the cache for projections likely to be encountered in the future, and only these projections are allowed into the cache.
最后我们来分析获得的基于影响的概括,这里,为了节省篇幅,只取了前5句来进行分析。