ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Search Engine Software in PHP and mySQL

Updated on November 2, 2009

I'm going to make a search engine you might say. Then I'll ask: what kind of search engine?

There's a couple of different types of search engines around, all made for a different purpose. I'll list the most common ones here, and if you know of more just leave a comment below.

  • General Internet Search Engine (like Google and Yahoo)
  • Site Search Engine (the search box you see on most web sites)
  • Database Search Engine (product search, cars, real-estate listings, find a hotel etz)
  • Niche Search Engine (a subset of a general search engine, like an automotive sites search)

This tutorial is focused on techniques for making a site search engine or a niche search engine. The methods used are also suitable for a general search engine but has to be adopted a bit to be usable on a multi server setup with multi terabytes of data.


Basic Building Blocks

The basic building blocks of a search engine are not complex at all. Look at this list and of you see something missing you can tell me by leaving a comment below.

Doing The Search

When a user types in a search phrase in the search box and hits the search button what the search engine needs to do at that point is take the words searched for and look them up in a database called the index.

This gives the search engine lists of websites, sorted by relevancy, one for every keyword or key phrase. By combining these lists it gets the search results to present to the user.

Making the Index

Ok, searching was easy but where do we get the index from? That's a little bit more tricky. We have to build and update the index from data gathered from the web. This is one of the major tasks of the search engine, keeping the index up to date.

Lets say that we have a database of all webpages that needs to be included in the index. To build the index we just read a page at a time, parse out all the words in the page, count the words and then store this list of word to count entries in a database together with a pointer to the actual web page they appear in.

The database table structure would look like this:

KEYWORD | COUNT | WEBSITE (see detailed database schema here)

When all pages have been parsed and the word to count entries have been stored in the same database table then you have an index!

Fetching the Web Pages

Making an index was easy when we alreay had all webpages stored on disk. Now to get them to the disk we have to somehow download them from the website (URL) where they are located.

What you need to do is send a request over the HTTP protocol to the web server where the web page is located and request a copy of, the page (GET). You'll use a URL to identify the webpage.

This can be easily done with the curl library in PHP. Basically you just tell curl that you want to fetch a specific URL in curl_init( URL), then you run the command curl_exec() and back you'll get the complete web page or possibly an error code.

This is easy! We have downloaded a web page.

// Using curl to fetch a web page   
$curls = curl_init( $url);
curl_setopt( $curls, CURLOPT_RETURNTRANSFER, TRUE);

$pagecontents = curl_exec( $curls);

curl_close( $curls);

What do we Fetch

Fetching the pages to index was easy but how do we know what pages to fetch? This depends a lot on what type of search engine you are building.

At Search Hippo they use a list of all sites in authority directories like dmoz.You can compile a specific list of URL:s using a general search engine or some other tool and use that as data for your niche search engine.

If you make a site search engine you might be able to list the files on the server disks. This way you get everything regardless if it's linked to or not. You may or may not want this.

When you do a general search engine there is no list you can use. There's no list of all pages of the Internet that you can buy. You have to follow links from one page to another trying your best to find all the pages of the internet, or at least the important pages.

At SecretSearchEngineLabs.com which I'm developing myself to prove my theories I use regular expressions to parse all links from a webpage. If you look in the code listing below you'll find the actual code to do this.


function parseForLinks( $inString, &$links){
    return preg_match_all( '/<a[^>]*href\s*?=\s*?[\"\'](.*?)[\"\'][^>]*>(.*?)<\/a>/is', $inString, $links, PREG_SET_ORDER);
}

Diplay the Results

There's one important component missing, actually displaying the search results to the user. To do this you need to have good titles and descriptions for the search results which in turn means you need to have this information stored somewhere so you can just read it when needed.

For this you need a database table with information on web pages. A table with at least the following entries.

PAGEID | TITLE | URL | DESCRIPTION (see detailed database schema here)

You display the title, description and url in the search results and you need to create a link containing the URL in the results so the user can click away to the website in question.

So where do you get this information from?

I'd say the title is almost without exception best to take from the title tag of the page, usually this tells you what the page is all about. The alternative is to use the description from dmoz or the Yahoo directory or to parse out a suitable description from inside the page.

// Parse the title tag from a html page
preg_match( '/<title>\s*?(.*?)\s*?<\/title>/is', $page_contents, $matches);

// Parse the description from a html page
preg_match( '/<meta\s*?name\s*?=\s*?"description"\s*?content\s*?=\s*?"(.*?)"\s*?\/?>/is', $page_contents, $matches);

Now You Have a Delicious Search Engine

Hey, you just made yourself a delicious search engine, nice!

The above tutorial is enough to get you started, you should now be able to make a simple seach engine for a site or a directory.

Please take a look at the search engine software I'm developing for my next generation search engine Secret Search Engine Labs. Here you can see a PHP and mySQL powered general search engine in action.

If you have questions please post a comment below and I'll try my best to answer all of you. And if people are interested (please comment!) I will add more info and code to the tutorial.

working

This website uses cookies

As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://corp.maven.io/privacy-policy

Show Details
Necessary
HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
LoginThis is necessary to sign in to the HubPages Service.
Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
AkismetThis is used to detect comment spam. (Privacy Policy)
HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)
Features
Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
MavenThis supports the Maven widget and search functionality. (Privacy Policy)
Marketing
Google AdSenseThis is an ad network. (Privacy Policy)
Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
Index ExchangeThis is an ad network. (Privacy Policy)
SovrnThis is an ad network. (Privacy Policy)
Facebook AdsThis is an ad network. (Privacy Policy)
Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
AppNexusThis is an ad network. (Privacy Policy)
OpenxThis is an ad network. (Privacy Policy)
Rubicon ProjectThis is an ad network. (Privacy Policy)
TripleLiftThis is an ad network. (Privacy Policy)
Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
Statistics
Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
ClickscoThis is a data management platform studying reader behavior (Privacy Policy)