Arts Autos Books Business Education Entertainment Family Fashion Food Games Gender Health Holidays Home HubPages Personal Finance Pets Politics Religion Sports Technology Travel

Search Engine Software in PHP and mySQL

Updated on November 2, 2009

sbyholm

Contact Author

I'm going to make a search engine you might say. Then I'll ask: what kind of search engine?

There's a couple of different types of search engines around, all made for a different purpose. I'll list the most common ones here, and if you know of more just leave a comment below.

General Internet Search Engine (like Google and Yahoo)
Site Search Engine (the search box you see on most web sites)
Database Search Engine (product search, cars, real-estate listings, find a hotel etz)
Niche Search Engine (a subset of a general search engine, like an automotive sites search)

This tutorial is focused on techniques for making a site search engine or a niche search engine. The methods used are also suitable for a general search engine but has to be adopted a bit to be usable on a multi server setup with multi terabytes of data.

Basic Building Blocks

The basic building blocks of a search engine are not complex at all. Look at this list and of you see something missing you can tell me by leaving a comment below.

Doing The Search

When a user types in a search phrase in the search box and hits the search button what the search engine needs to do at that point is take the words searched for and look them up in a database called the index.

This gives the search engine lists of websites, sorted by relevancy, one for every keyword or key phrase. By combining these lists it gets the search results to present to the user.

Making the Index

Ok, searching was easy but where do we get the index from? That's a little bit more tricky. We have to build and update the index from data gathered from the web. This is one of the major tasks of the search engine, keeping the index up to date.

Lets say that we have a database of all webpages that needs to be included in the index. To build the index we just read a page at a time, parse out all the words in the page, count the words and then store this list of word to count entries in a database together with a pointer to the actual web page they appear in.

The database table structure would look like this:

KEYWORD | COUNT | WEBSITE (see detailed database schema here)

When all pages have been parsed and the word to count entries have been stored in the same database table then you have an index!

Fetching the Web Pages

Making an index was easy when we alreay had all webpages stored on disk. Now to get them to the disk we have to somehow download them from the website (URL) where they are located.

What you need to do is send a request over the HTTP protocol to the web server where the web page is located and request a copy of, the page (GET). You'll use a URL to identify the webpage.

This can be easily done with the curl library in PHP. Basically you just tell curl that you want to fetch a specific URL in curl_init( URL), then you run the command curl_exec() and back you'll get the complete web page or possibly an error code.

This is easy! We have downloaded a web page.

// Using curl to fetch a web page   
$curls = curl_init( $url);
curl_setopt( $curls, CURLOPT_RETURNTRANSFER, TRUE);

$pagecontents = curl_exec( $curls);

curl_close( $curls);

What do we Fetch

Fetching the pages to index was easy but how do we know what pages to fetch? This depends a lot on what type of search engine you are building.

At Search Hippo they use a list of all sites in authority directories like dmoz.You can compile a specific list of URL:s using a general search engine or some other tool and use that as data for your niche search engine.

If you make a site search engine you might be able to list the files on the server disks. This way you get everything regardless if it's linked to or not. You may or may not want this.

When you do a general search engine there is no list you can use. There's no list of all pages of the Internet that you can buy. You have to follow links from one page to another trying your best to find all the pages of the internet, or at least the important pages.

At SecretSearchEngineLabs.com which I'm developing myself to prove my theories I use regular expressions to parse all links from a webpage. If you look in the code listing below you'll find the actual code to do this.

function parseForLinks( $inString, &$links){
    return preg_match_all( '/<a[^>]*href\s*?=\s*?[\"\'](.*?)[\"\'][^>]*>(.*?)<\/a>/is', $inString, $links, PREG_SET_ORDER);
}

Diplay the Results

There's one important component missing, actually displaying the search results to the user. To do this you need to have good titles and descriptions for the search results which in turn means you need to have this information stored somewhere so you can just read it when needed.

For this you need a database table with information on web pages. A table with at least the following entries.

PAGEID | TITLE | URL | DESCRIPTION (see detailed database schema here)

You display the title, description and url in the search results and you need to create a link containing the URL in the results so the user can click away to the website in question.

So where do you get this information from?

I'd say the title is almost without exception best to take from the title tag of the page, usually this tells you what the page is all about. The alternative is to use the description from dmoz or the Yahoo directory or to parse out a suitable description from inside the page.

// Parse the title tag from a html page
preg_match( '/<title>\s*?(.*?)\s*?<\/title>/is', $page_contents, $matches);

// Parse the description from a html page
preg_match( '/<meta\s*?name\s*?=\s*?"description"\s*?content\s*?=\s*?"(.*?)"\s*?\/?>/is', $page_contents, $matches);

Now You Have a Delicious Search Engine

Hey, you just made yourself a delicious search engine, nice!

The above tutorial is enough to get you started, you should now be able to make a simple seach engine for a site or a directory.

Please take a look at the search engine software I'm developing for my next generation search engine Secret Search Engine Labs. Here you can see a PHP and mySQL powered general search engine in action.

If you have questions please post a comment below and I'll try my best to answer all of you. And if people are interested (please comment!) I will add more info and code to the tutorial.

Upcycling & Repurposing
52 Creative Craft Ideas Using Book Pages
by Loraine Brummer11
Search Engine Optimization
7 Sites Like Google - Other Popular Search Engines
by Samuel Franklin18
Search Engine Optimization
How To Make My Item Get On The First Page Of Ebay Listings
by JP99318
Life Sciences
BLAST in Bioinformatics; Its Uses, Application & Function
by Rashel Nirjhon0
Building & Construction Toys
Early Learning: Construction Toys for Toddlers and Preschoolers
by Carolyn Augustine5

Computer Programming Tutorials
How to Easily Create Android Apps With B4X
by William R Vitanyi0
Computer Programming Tutorials
Java 2D Tutorial II--JFrame, JPanel, Coordinate System, Custom Colors, Gradient Paint, and Testing Code
by Patty Kenyon1
Computer Programming Tutorials
Java 2D Tutorial I--The Basics, Creating New Package, Abstract Class, and Class that extends Abstract Class
by Patty Kenyon4

working

This website uses cookies

As a user in the EEA, your approval is needed on a few things. To provide a better website experience, hubpages.com uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

Necessary

Features

Marketing

Statistics

Approve All & Submit
Approve Checked Only

For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at: https://corp.maven.io/privacy-policy

Show Details

Necessary
HubPages Device ID	This is used to identify particular browsers or devices when the access the service, and is used for security reasons.
Login	This is necessary to sign in to the HubPages Service.
Google Recaptcha	This is used to prevent bots and spam. (Privacy Policy)
Akismet	This is used to detect comment spam. (Privacy Policy)
HubPages Google Analytics	This is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
HubPages Traffic Pixel	This is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
Amazon Web Services	This is a cloud services platform that we used to host our service. (Privacy Policy)
Cloudflare	This is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
Google Hosted Libraries	Javascript software libraries such as jQuery are loaded at endpoints on the googleapis.com or gstatic.com domains, for performance and efficiency reasons. (Privacy Policy)

Features
Google Custom Search	This is feature allows you to search the site. (Privacy Policy)
Google Maps	Some articles have Google Maps embedded in them. (Privacy Policy)
Google Charts	This is used to display charts and graphs on articles and the author center. (Privacy Policy)
Google AdSense Host API	This service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
Google YouTube	Some articles have YouTube videos embedded in them. (Privacy Policy)
Vimeo	Some articles have Vimeo videos embedded in them. (Privacy Policy)
Paypal	This is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
Facebook Login	You can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
Maven	This supports the Maven widget and search functionality. (Privacy Policy)

Marketing
Google AdSense	This is an ad network. (Privacy Policy)
Google DoubleClick	Google provides ad serving technology and runs an ad network. (Privacy Policy)
Index Exchange	This is an ad network. (Privacy Policy)
Sovrn	This is an ad network. (Privacy Policy)
Facebook Ads	This is an ad network. (Privacy Policy)
Amazon Unified Ad Marketplace	This is an ad network. (Privacy Policy)
AppNexus	This is an ad network. (Privacy Policy)
Openx	This is an ad network. (Privacy Policy)
Rubicon Project	This is an ad network. (Privacy Policy)
TripleLift	This is an ad network. (Privacy Policy)
Say Media	We partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
Remarketing Pixels	We may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
Conversion Tracking Pixels	We may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.

Statistics
Author Google Analytics	This is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
Comscore	ComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
Amazon Tracking Pixel	Some articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)
Clicksco	This is a data management platform studying reader behavior (Privacy Policy)