ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Google :: Search Engine Google

Updated on March 26, 2012

Google :: Search Engine Google

As a search engine Google is a complete architecture for gathering web pages (crawling), indexing and performing search queries (searching) on those pages. Google Inc. is the company that was formed to offer the Google Search Engine to the web searching community after Sergey Brin and Larry Page developed it at Stanford University. The Google Search Engine is an easily-scalable large-scale web search engine which efficiently crawls and indexes web content and produces text and hyperlink databases which are then accessed to produce satisfying, relevant, contextual search results to answer user search queries.

The difference between the Google Search Engine and other search engines was that it additionally utilised hypertext structures to determine quality rankings for each web content unit. This strategy allowed the Google Search Engine to formulate and present better search results than had previously been available from contemporary rival search engines. There were problems, not least of which was the need to deal effectively with uncontrolled hypertext collections where the adage 'anyone can publish anything' was nowhere more true. Throughout the existence of the web both information and users have grown rapidly. This inevitably means that there are at any time a plethora of users inexperienced in the art of web research. The Google Search Engine was developed with the philosophy that all users, whatever their experience level, should be able to retrieve relevant results for the query terms they use.

The search engine (Google) extracts distinct terms from web content such as words, phrases and non-contiguous word sequences and indexes them. This allows future queries to access the Google databases and determine those websites that have content most closely matching the search sequence entered. Once a list of relevant documents (docList), that contains all the search terms, has been created, various algorithms are employed to determine the order in which the top 1000 results will be presented back to the searcher. Thus the Google Search Engine produced improved search quality results through the application of relevance and quality filtering.

The Google Search Engine is not just the 'face' you see when you enter a search term on a Google homepage. A great deal of effort has gone into creating the databases that are accessed to return the relevant results that are required. The web has to be crawled (web content read and analysed) and indexed before relevant results can be available to be returned to be chosen for accessing and reading.

Google :: Search Engine Google Wordle

Google Search Engine Wordle by Humagaia
Google Search Engine Wordle by Humagaia

Google Search Engine Presentation by Humagaia

Google Search Engine Design Requirements

In order for the search engine from Google to match and surpass its rivals there were certain design criteria that had to be taken into account:

  • Fast crawling technology was required in order to gather web documents and to keep them up-to-date.

  • The storage space for storing indexes and documents needed to be used efficiently.

  • The indexing of terabytes of data required to be efficient.

  • Any query of the databases needed to be handled quickly (the most important aspect for the designers).

  • The data structures needed to be optimised for fast and efficient access.

As these design requirements were achieved the Google Search Engine became the dominant player in the search engine market surpassing their rivals very quickly.

How Search Works by Google's Matt Cutts

Google Search Engine Web Content Crawling

All search engines that support their own web content pointer databases have a need for a crawler or spider. These programs trawl through the internet to find and index web content. With the Google web crawler, a single URLServer passes lists of URL's (from those already crawled or from newly submitted URL's) to a number of Google crawlers or Googlebot's. In order to keep the time of access for each web content site to a minimum, each crawler maintains its own DNS cache and each crawler has numerous open connections. As each 'fetch' is performed a number of queues move the information fetched from state to state.

Overview Of How Search Engines Work

Search Engine And Web Crawler - Part 1

Search Engine And Web Crawlers Part 2

What Is Google PageRank?

Google Search Engine Web Content Indexing

Parsing (syntactic analysis of grammatical structure of text to determine relationships between words and to infer meaning) - every word encountered passes to a storeserver and is compressed and assigned a wordID using a regularly updated lexicon and parsed indexed into databases ('barrels'). Each word encountered in a document is converted to a set of word occurrences or hits (limited to a maximum total). The hits record the word position in documents, the font size and the capitalisation. The hits are categorised into:

  • Fancy hits – those that occur in the URL title, anchor text and / or meta tag. The information recorded for them is capitalisation, font size (set to 7) and position. Anchor hits have positional and docID information recorded for them.

  • Plain hits – everything else. These have capitalisation, relative font size and positional information recorded.

The hits are translated into a 'hit list' and distributed into forward barrels creating a partially sorted forward index. All links are parsed in every web page and important information that determines where the link points to and from, is stored in the anchor file.

It is important therefore that you ensure that the URL and Title have the targeted keywords in them and that they are capitalised. This goes for any anchor text that you create to point to one of your documents (in RSS feeds for instance). Additionally, emboldening, capitalising and increasing the font size of keywords and anchor text (where possible) will assist in raising your document up the rankings.

URL Resolver - this reads the anchors and converts the relative URL's to absolute URL's as docID's. The anchor text is passed to the forward index and populates the Google link database with pairs of docID's which are used to compute the PageRank for all documents.

Sorter – Each 'forward barrel' (index) is sorted by wordID to produce an inverted index for title, anchor hits and full document text (which is cached). An interim DumpLexicon is used to update the Lexicon.

It is the combination of lexicon, inverted index and PageRank that is used to answer a Google query.

Search, Google, and Life: Sergey Brin Lecture

Google Search Engine Web Content Searching

The Google Search Engine is focused on providing quality search results, efficiently. The order of events that take place for a Google search results list to be presented back to the searcher is as follows:

  1. Parse the query.

  2. Convert the search words into WordID's.

  3. Search by every word for title and anchor text links.

  4. Scan docLists that contain all search terms or, if not enough results, scan a subset of search terms.

  5. Compute Google PageRank for each document retrieved.

  6. Do 4. and 5. until sufficient results are obtained or no more docLists are available.

  7. Sort documents in Google PageRank order and show only top 1000 results.

How Do Search Engines Decide How Web Sites Rank?

Google Search Engine System Features

The Google Search Engine makes use of a citation (pointer from one web content site to another) importance link graph map of web hyperlinks to calculate rapidly an approximation of page importance and quality as a Google quality ranking (PageRank) which allows the prioritisation of keyword search results. Links (backlinks) are normalised as they are not counted as equal with an academic citation or one from another highly ranked web authority being given greater importance when applied to a given web content page.

Google PageRank – is a model of user behaviour and was defined as the probability that a random surfer would visit a certain page (the more links pointing to the page increases the probability that the surfer will find it). A damping factor is applied to a single page or group of pages and quantifies the probability that the random surfer will become bored with that particular page (the longer a surfer stays on the page the less the damping factor that will be applied). High PageRank is therefore obtained if a large number of pages link to the page or those that point to it have a high PageRank themselves.

Anchor Text – is the text used to describe the web content on another page to which there is a link. The Google Search Engine associates the link text to both the sender and receiver web pages. Often anchors provide better descriptions of the web content to which they are pointing than the web content pages themselves. An additional advantage is that anchor text can exist for non text-based documents such as images, programs, databases, videos etc and return results even where the content pointed to has not been crawled: thus giving even better quality search engine results for Google.

Location – the Google Search Engine records location and proximity information for all hits. This means that exact matches to search queries can be located as well as close approximations to the exact keyword search phrase. These are given a weighted importance so that those words that match the search query that are closest together in the text have a higher probability of causing a positive hit for the search phrase.

Characteristics of words – the citation (pointer from one web content site to another) link graph map of hypelinks also records certain characteristics against the wordID. These include the font size and emboldening of the text, where the larger and / or bolder font receives a higher weighting in comparison to the remainder of the web content text of the page. This means that (HubPage) headings will rank higher than general content for a particular search phrase and the emboldened text will rank higher than normal text but lower (usually) than title text.

External Meta Information – the Google Search Engine also takes into account information that can be inferred about a document but that is not contained within it, such as:

  • The reputation of the source – on HubPages, for instance, the reputation of both the HubPages site as well as the author.

  • The update frequency of the content of the document – if you use RSS feeds in your document, for instance, and create another hub in a series or a new hub for an author, then the page will be updated and increase the update frequency as far as the Google Search Engine is concerned.

  • The quality of the content – this could be determined by whether the content is bookmarked (through bookmark sites or by using the bookmark tab) and how long a surfer stays on the document: both of which can be recorded in the Google databases.

  • The popularity of the document – this by the number of reads and the duration of those reads.
  • The usage of the document, and
  • as above, the citations.

The Google Search Engine maintains much more information about web documents than typical search engines did. The type-weights and count-weights are incorporated into an IR score which together with Google PageRank and proximity scores allow the Google Search Engine to determine the order in which the most relevant query results will be presented back to the query results screen.

Google :: Search Engine Google

Google :: Search Engine Google
Google :: Search Engine Google

Google :: Search Engine Google Conclusion

This is a brief, simplified overview of the Google Search Engine and how it works. No-one knows all of the tweaks that occur or the nuances of its workings as these are Google trade secrets. If you were told these you would have to be shot!

See also:

How To Google - homepage of "Google How To".

How To Google in English - for the English version index to "Google How To" subjects.

working

This website uses cookies

As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://corp.maven.io/privacy-policy

Show Details
Necessary
HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
LoginThis is necessary to sign in to the HubPages Service.
Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
AkismetThis is used to detect comment spam. (Privacy Policy)
HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
Features
Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
MavenThis supports the Maven widget and search functionality. (Privacy Policy)
Marketing
Google AdSenseThis is an ad network. (Privacy Policy)
Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
Index ExchangeThis is an ad network. (Privacy Policy)
SovrnThis is an ad network. (Privacy Policy)
Facebook AdsThis is an ad network. (Privacy Policy)
Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
AppNexusThis is an ad network. (Privacy Policy)
OpenxThis is an ad network. (Privacy Policy)
Rubicon ProjectThis is an ad network. (Privacy Policy)
TripleLiftThis is an ad network. (Privacy Policy)
Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
Statistics
Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
ClickscoThis is a data management platform studying reader behavior (Privacy Policy)