Internet search result probabilities: Heaps' law and word associativity

Document Type

Article

Publication Date

2-1-2009

Abstract

We study the number of internet search results returned from multi-word queries based on the number of results returned when each word is searched for individually. We derive a model to describe search result values for multi-word queries using the total number of pages indexed by Google and by applying the Zipf power law to the words per page distribution on the internet and Heaps' law for unique word counts. Based on data from 351 word pairs each with exactly one hit when searched for together, and a Zipf law coefficient determined in other studies, we approximate the Heaps' law coefficient for the indexed worldwide web (about 8 billion pages) to be β = 0.52. Previous studies used under 20,000 pages. We demonstrate through examples how the model can be used to analyse automatically the relatedness of word pairs assigning each a value we call "strength of associativity". We demonstrate the validity of our method with word triplets and through two experiments conducted 8 months apart. We then use our model to compare the index sizes of competing search giants Yahoo and Google. © 2009 Taylor & Francis.

Identifier

65849518568 (Scopus)

Publication Title

Journal of Quantitative Linguistics

External Full Text Location

https://doi.org/10.1080/09296170802514153

e-ISSN

17445035

ISSN

09296174

First Page

40

Last Page

66

Issue

1

Volume

16

This document is currently not available here.

Share

COinS