Search Engine Software in PHP and mySQL

I'm going to make a search engine you might say. Then I'll ask: what kind of search engine?

There's a couple of different types of search engines around, all made for a different purpose. I'll list the most common ones here, and if you know of more just leave a comment below.

  • General Internet Search Engine (like Google and Yahoo)
  • Site Search Engine (the search box you see on most web sites)
  • Database Search Engine (product search, cars, real-estate listings, find a hotel etz)
  • Niche Search Engine (a subset of a general search engine, like an automotive sites search)

This tutorial is focused on techniques for making a site search engine or a niche search engine. The methods used are also suitable for a general search engine but has to be adopted a bit to be usable on a multi server setup with multi terabytes of data.


Basic Building Blocks

The basic building blocks of a search engine are not complex at all. Look at this list and of you see something missing you can tell me by leaving a comment below.

Doing The Search

When a user types in a search phrase in the search box and hits the search button what the search engine needs to do at that point is take the words searched for and look them up in a database called the index.

This gives the search engine lists of websites, sorted by relevancy, one for every keyword or key phrase. By combining these lists it gets the search results to present to the user.

Making the Index

Ok, searching was easy but where do we get the index from? That's a little bit more tricky. We have to build and update the index from data gathered from the web. This is one of the major tasks of the search engine, keeping the index up to date.

Lets say that we have a database of all webpages that needs to be included in the index. To build the index we just read a page at a time, parse out all the words in the page, count the words and then store this list of word to count entries in a database together with a pointer to the actual web page they appear in.

The database table structure would look like this:

KEYWORD | COUNT | WEBSITE (see detailed database schema here)

When all pages have been parsed and the word to count entries have been stored in the same database table then you have an index!

Fetching the Web Pages

Making an index was easy when we alreay had all webpages stored on disk. Now to get them to the disk we have to somehow download them from the website (URL) where they are located.

What you need to do is send a request over the HTTP protocol to the web server where the web page is located and request a copy of, the page (GET). You'll use a URL to identify the webpage.

This can be easily done with the curl library in PHP. Basically you just tell curl that you want to fetch a specific URL in curl_init( URL), then you run the command curl_exec() and back you'll get the complete web page or possibly an error code.

This is easy! We have downloaded a web page.

// Using curl to fetch a web page   
$curls = curl_init( $url);
curl_setopt( $curls, CURLOPT_RETURNTRANSFER, TRUE);

$pagecontents = curl_exec( $curls);

curl_close( $curls);

What do we Fetch

Fetching the pages to index was easy but how do we know what pages to fetch? This depends a lot on what type of search engine you are building.

At Search Hippo they use a list of all sites in authority directories like dmoz.You can compile a specific list of URL:s using a general search engine or some other tool and use that as data for your niche search engine.

If you make a site search engine you might be able to list the files on the server disks. This way you get everything regardless if it's linked to or not. You may or may not want this.

When you do a general search engine there is no list you can use. There's no list of all pages of the Internet that you can buy. You have to follow links from one page to another trying your best to find all the pages of the internet, or at least the important pages.

At SecretSearchEngineLabs.com which I'm developing myself to prove my theories I use regular expressions to parse all links from a webpage. If you look in the code listing below you'll find the actual code to do this.


function parseForLinks( $inString, &$links){
    return preg_match_all( '/<a[^>]*href\s*?=\s*?[\"\'](.*?)[\"\'][^>]*>(.*?)<\/a>/is', $inString, $links, PREG_SET_ORDER);
}

Diplay the Results

There's one important component missing, actually displaying the search results to the user. To do this you need to have good titles and descriptions for the search results which in turn means you need to have this information stored somewhere so you can just read it when needed.

For this you need a database table with information on web pages. A table with at least the following entries.

PAGEID | TITLE | URL | DESCRIPTION (see detailed database schema here)

You display the title, description and url in the search results and you need to create a link containing the URL in the results so the user can click away to the website in question.

So where do you get this information from?

I'd say the title is almost without exception best to take from the title tag of the page, usually this tells you what the page is all about. The alternative is to use the description from dmoz or the Yahoo directory or to parse out a suitable description from inside the page.

// Parse the title tag from a html page
preg_match( '/<title>\s*?(.*?)\s*?<\/title>/is', $page_contents, $matches);

// Parse the description from a html page
preg_match( '/<meta\s*?name\s*?=\s*?"description"\s*?content\s*?=\s*?"(.*?)"\s*?\/?>/is', $page_contents, $matches);

Now You Have a Delicious Search Engine

Hey, you just made yourself a delicious search engine, nice!

The above tutorial is enough to get you started, you should now be able to make a simple seach engine for a site or a directory.

Please take a look at the search engine software I'm developing for my next generation search engine Secret Search Engine Labs. Here you can see a PHP and mySQL powered general search engine in action.

If you have questions please post a comment below and I'll try my best to answer all of you. And if people are interested (please comment!) I will add more info and code to the tutorial.

Comments 6 comments

ben cumaio 6 years ago

hi...

i'm newer on search engine development, i'd like see the structure of database of your search engine in this turorial


sbyholm profile image

sbyholm 6 years ago from Finland Author

Hello Ben

I added a new tutorial describing the database schema of a basic search engine here:

http://hubpages.com/technology/Search-Engine-Datab...

See if that helps. And if not, post another question in the comments!


whoisbid profile image

whoisbid 5 years ago

I enjoyed this. Good luck with the scraping!


noyon sharma 2 years ago

Boss how to create auto logine


Johan 2 years ago

Hi Sbyholm,

I really like what you have written. Tell me, those fetch tags that are in PHP, do you have them perhaps in JavaScript and also do you maybe have a database schematic of the Search Engine Database Schemas that you talked about on another page.

Please advise.

Warm regards,

Johan@mystic-blue.co.za

PS. I would really appreciate your efforts.


sbyholm profile image

sbyholm 2 years ago from Finland Author

I don't use any Javascript, just server side scripting with PHP... hmm of course the third party trackers and other widgets are Javascript but that's not really part of the search functionality.

A basic schema is described at http://hubpages.com/technology/Search-Engine-Datab...

When I have time (probably never) I might publish something more elaborate...

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working