- HubPages»
- Technology»
- Computers & Software»
- Computer Science & Programming
Search Engine Software in PHP and mySQL
I'm going to make a search engine you might say. Then I'll ask: what kind of search engine?
There's a couple of different types of search engines around, all made for a different purpose. I'll list the most common ones here, and if you know of more just leave a comment below.
- General Internet Search Engine (like Google and Yahoo)
- Site Search Engine (the search box you see on most web sites)
- Database Search Engine (product search, cars, real-estate listings, find a hotel etz)
- Niche Search Engine (a subset of a general search engine, like an automotive sites search)
This tutorial is focused on techniques for making a site search engine or a niche search engine. The methods used are also suitable for a general search engine but has to be adopted a bit to be usable on a multi server setup with multi terabytes of data.
Basic Building Blocks
The basic building blocks of a search engine are not complex at all. Look at this list and of you see something missing you can tell me by leaving a comment below.
Doing The Search
When a user types in a search phrase in the search box and hits the search button what the search engine needs to do at that point is take the words searched for and look them up in a database called the index.
This gives the search engine lists of websites, sorted by relevancy, one for every keyword or key phrase. By combining these lists it gets the search results to present to the user.
Making the Index
Ok, searching was easy but where do we
get the index from? That's a little bit more tricky. We have to build
and update the index from data gathered from the web. This is one of
the major tasks of the search engine, keeping the index up to date.
Lets say that we have a database of all webpages that needs to be included in the index. To build the index we just read a page at a time, parse out all the words in the page, count the words and then store this list of word to count entries in a database together with a pointer to the actual web page they appear in.
The database table structure would look like this:
KEYWORD | COUNT | WEBSITE (see detailed database schema here)
When all pages have been parsed and the word to count entries have been stored in the same database table then you have an index!
Fetching the Web Pages
Making
an index was easy when we alreay had all webpages stored on disk. Now
to get them to the disk we have to somehow download them from the
website (URL) where they are located.
What you need to do is
send a request over the HTTP protocol to the web server where the web
page is located and request a copy of, the page (GET). You'll use a URL
to identify the webpage.
This can be easily done with the curl library in PHP. Basically you just tell curl that you want to fetch a specific URL in curl_init( URL), then you run the command curl_exec() and back you'll get the complete web page or possibly an error code.
This is easy! We have downloaded a web page.
// Using curl to fetch a web page $curls = curl_init( $url); curl_setopt( $curls, CURLOPT_RETURNTRANSFER, TRUE); $pagecontents = curl_exec( $curls); curl_close( $curls);
What do we Fetch
Fetching the pages to index was easy but how do we know what pages to fetch? This depends a lot on what type of search engine you are building.
At
Search Hippo they use a list of all sites in authority directories like
dmoz.You can compile a specific list of URL:s using a general search
engine or some other tool and use that as data for your niche search
engine.
If you make a site search engine you might be able to
list the files on the server disks. This way you get everything
regardless if it's linked to or not. You may or may not want this.
When you do a general search engine there is no list you can use. There's no list of all pages of the Internet
that you can buy. You have to follow links from one page to another
trying your best to find all the pages of the internet, or at least the
important pages.
At SecretSearchEngineLabs.com which I'm
developing myself to prove my theories I use regular expressions to
parse all links from a webpage. If you look in the code listing below you'll find the actual code to do this.
function parseForLinks( $inString, &$links){ return preg_match_all( '/<a[^>]*href\s*?=\s*?[\"\'](.*?)[\"\'][^>]*>(.*?)<\/a>/is', $inString, $links, PREG_SET_ORDER); }
Diplay the Results
There's one important component missing, actually displaying the search results to the user. To do this you need to have good titles and descriptions for the search results which in turn means you need to have this information stored somewhere so you can just read it when needed.
For this you need a database table with information on web pages. A table with at least the following entries.
PAGEID | TITLE | URL | DESCRIPTION (see detailed database schema here)
You display the title, description and url in the search results and you need to create a link containing the URL in the results so the user can click away to the website in question.
So where do you get this information from?
I'd say the title is almost without exception best to take from the title tag of the page, usually this tells you what the page is all about. The alternative is to use the description from dmoz or the Yahoo directory or to parse out a suitable description from inside the page.
// Parse the title tag from a html page preg_match( '/<title>\s*?(.*?)\s*?<\/title>/is', $page_contents, $matches); // Parse the description from a html page preg_match( '/<meta\s*?name\s*?=\s*?"description"\s*?content\s*?=\s*?"(.*?)"\s*?\/?>/is', $page_contents, $matches);
Now You Have a Delicious Search Engine
Hey, you just made yourself a delicious search engine, nice!
The above tutorial is enough to get you started, you should now be able to make a simple seach engine for a site or a directory.
Please take a look at the search engine software I'm developing for my next generation search engine Secret Search Engine Labs. Here you can see a PHP and mySQL powered general search engine in action.
If you have questions please post a comment below and I'll try my best to answer all of you. And if people are interested (please comment!) I will add more info and code to the tutorial.