- Internet & the Web»
- Search Engines
Google :: Search Engine Google
Google :: Search Engine Google
As a search engine Google is a complete architecture for gathering web pages (crawling), indexing and performing search queries (searching) on those pages. Google Inc. is the company that was formed to offer the Google Search Engine to the web searching community after Sergey Brin and Larry Page developed it at Stanford University. The Google Search Engine is an easily-scalable large-scale web search engine which efficiently crawls and indexes web content and produces text and hyperlink databases which are then accessed to produce satisfying, relevant, contextual search results to answer user search queries.
The difference between the Google Search Engine and other search engines was that it additionally utilised hypertext structures to determine quality rankings for each web content unit. This strategy allowed the Google Search Engine to formulate and present better search results than had previously been available from contemporary rival search engines. There were problems, not least of which was the need to deal effectively with uncontrolled hypertext collections where the adage 'anyone can publish anything' was nowhere more true. Throughout the existence of the web both information and users have grown rapidly. This inevitably means that there are at any time a plethora of users inexperienced in the art of web research. The Google Search Engine was developed with the philosophy that all users, whatever their experience level, should be able to retrieve relevant results for the query terms they use.
The search engine (Google) extracts distinct terms from web content such as words, phrases and non-contiguous word sequences and indexes them. This allows future queries to access the Google databases and determine those websites that have content most closely matching the search sequence entered. Once a list of relevant documents (docList), that contains all the search terms, has been created, various algorithms are employed to determine the order in which the top 1000 results will be presented back to the searcher. Thus the Google Search Engine produced improved search quality results through the application of relevance and quality filtering.
The Google Search Engine is not just the 'face' you see when you enter a search term on a Google homepage. A great deal of effort has gone into creating the databases that are accessed to return the relevant results that are required. The web has to be crawled (web content read and analysed) and indexed before relevant results can be available to be returned to be chosen for accessing and reading.
Google :: Search Engine Google Wordle
Google Search Engine Presentation by Humagaia
Google Search Engine Design Requirements
In order for the search engine from Google to match and surpass its rivals there were certain design criteria that had to be taken into account:
Fast crawling technology was required in order to gather web documents and to keep them up-to-date.
The storage space for storing indexes and documents needed to be used efficiently.
The indexing of terabytes of data required to be efficient.
Any query of the databases needed to be handled quickly (the most important aspect for the designers).
The data structures needed to be optimised for fast and efficient access.
As these design requirements were achieved the Google Search Engine became the dominant player in the search engine market surpassing their rivals very quickly.
How Search Works by Google's Matt Cutts
Google Search Engine Web Content Crawling
All search engines that support their own web content pointer databases have a need for a crawler or spider. These programs trawl through the internet to find and index web content. With the Google web crawler, a single URLServer passes lists of URL's (from those already crawled or from newly submitted URL's) to a number of Google crawlers or Googlebot's. In order to keep the time of access for each web content site to a minimum, each crawler maintains its own DNS cache and each crawler has numerous open connections. As each 'fetch' is performed a number of queues move the information fetched from state to state.
Overview Of How Search Engines Work
Search Engine And Web Crawler - Part 1
Search Engine And Web Crawlers Part 2
What Is Google PageRank?
Google Search Engine Web Content Indexing
Parsing (syntactic analysis of grammatical structure of text to determine relationships between words and to infer meaning) - every word encountered passes to a storeserver and is compressed and assigned a wordID using a regularly updated lexicon and parsed indexed into databases ('barrels'). Each word encountered in a document is converted to a set of word occurrences or hits (limited to a maximum total). The hits record the word position in documents, the font size and the capitalisation. The hits are categorised into:
Fancy hits – those that occur in the URL title, anchor text and / or meta tag. The information recorded for them is capitalisation, font size (set to 7) and position. Anchor hits have positional and docID information recorded for them.
Plain hits – everything else. These have capitalisation, relative font size and positional information recorded.
The hits are translated into a 'hit list' and distributed into forward barrels creating a partially sorted forward index. All links are parsed in every web page and important information that determines where the link points to and from, is stored in the anchor file.
It is important therefore that you ensure that the URL and Title have the targeted keywords in them and that they are capitalised. This goes for any anchor text that you create to point to one of your documents (in RSS feeds for instance). Additionally, emboldening, capitalising and increasing the font size of keywords and anchor text (where possible) will assist in raising your document up the rankings.
URL Resolver - this reads the anchors and converts the relative URL's to absolute URL's as docID's. The anchor text is passed to the forward index and populates the Google link database with pairs of docID's which are used to compute the PageRank for all documents.
Sorter – Each 'forward barrel' (index) is sorted by wordID to produce an inverted index for title, anchor hits and full document text (which is cached). An interim DumpLexicon is used to update the Lexicon.
It is the combination of lexicon, inverted index and PageRank that is used to answer a Google query.
Search, Google, and Life: Sergey Brin Lecture
Google Search Engine Web Content Searching
The Google Search Engine is focused on providing quality search results, efficiently. The order of events that take place for a Google search results list to be presented back to the searcher is as follows:
Parse the query.
Convert the search words into WordID's.
Search by every word for title and anchor text links.
Scan docLists that contain all search terms or, if not enough results, scan a subset of search terms.
Compute Google PageRank for each document retrieved.
Do 4. and 5. until sufficient results are obtained or no more docLists are available.
Sort documents in Google PageRank order and show only top 1000 results.
How Do Search Engines Decide How Web Sites Rank?
Google Search Engine System Features
The Google Search Engine makes use of a citation (pointer from one web content site to another) importance link graph map of web hyperlinks to calculate rapidly an approximation of page importance and quality as a Google quality ranking (PageRank) which allows the prioritisation of keyword search results. Links (backlinks) are normalised as they are not counted as equal with an academic citation or one from another highly ranked web authority being given greater importance when applied to a given web content page.
Google PageRank – is a model of user behaviour and was defined as the probability that a random surfer would visit a certain page (the more links pointing to the page increases the probability that the surfer will find it). A damping factor is applied to a single page or group of pages and quantifies the probability that the random surfer will become bored with that particular page (the longer a surfer stays on the page the less the damping factor that will be applied). High PageRank is therefore obtained if a large number of pages link to the page or those that point to it have a high PageRank themselves.
Anchor Text – is the text used to describe the web content on another page to which there is a link. The Google Search Engine associates the link text to both the sender and receiver web pages. Often anchors provide better descriptions of the web content to which they are pointing than the web content pages themselves. An additional advantage is that anchor text can exist for non text-based documents such as images, programs, databases, videos etc and return results even where the content pointed to has not been crawled: thus giving even better quality search engine results for Google.
Location – the Google Search Engine records location and proximity information for all hits. This means that exact matches to search queries can be located as well as close approximations to the exact keyword search phrase. These are given a weighted importance so that those words that match the search query that are closest together in the text have a higher probability of causing a positive hit for the search phrase.
Characteristics of words – the citation (pointer from one web content site to another) link graph map of hypelinks also records certain characteristics against the wordID. These include the font size and emboldening of the text, where the larger and / or bolder font receives a higher weighting in comparison to the remainder of the web content text of the page. This means that (HubPage) headings will rank higher than general content for a particular search phrase and the emboldened text will rank higher than normal text but lower (usually) than title text.
External Meta Information – the Google Search Engine also takes into account information that can be inferred about a document but that is not contained within it, such as:
The reputation of the source – on HubPages, for instance, the reputation of both the HubPages site as well as the author.
The update frequency of the content of the document – if you use RSS feeds in your document, for instance, and create another hub in a series or a new hub for an author, then the page will be updated and increase the update frequency as far as the Google Search Engine is concerned.
The quality of the content – this could be determined by whether the content is bookmarked (through bookmark sites or by using the bookmark tab) and how long a surfer stays on the document: both of which can be recorded in the Google databases.
- The popularity of the document – this by the number of reads and the duration of those reads.
- The usage of the document, and
- as above, the citations.
The Google Search Engine maintains much more information about web documents than typical search engines did. The type-weights and count-weights are incorporated into an IR score which together with Google PageRank and proximity scores allow the Google Search Engine to determine the order in which the most relevant query results will be presented back to the query results screen.
Google :: Search Engine Google
Google :: Search Engine Google Conclusion
This is a brief, simplified overview of the Google Search Engine and how it works. No-one knows all of the tweaks that occur or the nuances of its workings as these are Google trade secrets. If you were told these you would have to be shot!