A Data Science Central Community
Originally posted here.
Let’s discuss search engines. We will mainly be focusing on various sources of data that you might have to fetch or be given to build a search engine in the first place. So, if you are just an enthusiast or you have to build a professional search engine from scratch, you have come to the right place!
A search engine differs from objective to objective but the core functionality remains the same – information retrieval. Here are some of the sources of data that you might be given or you want to build a search engine for. At the heart they are all quite the same but they have quite different approaches to solving the same problem.
There might be a computer which is full of files and folders that you might retrieve some useful information from. Operating Systems have their own implementation of search engines which comes out of the box. The need might not be restricted to a single computer though. It may be a mount location wherein you want to embed a customized search engine to retrieve documents based on your business objective. For example: Let’s say there is a bunch of legal documents in a common location or all across computers in a network. You want a mechanism with which to retrieve that information much more efficiently than what an operating system normally provides to make the life of paralegals and lawyers easier.
In this case, the traversal of information is relatively easy. The challenge lies in updating the indices so that new information or modified information can be searched and retrieved nearly as soon as the changes are done, i.e. in real time. It is also essential that the indexing mechanism is efficient with respect to space consumption. You don’t want an engine that consumes half your memory indexing stuff, people don’t want that.
So, it’s quintessential that this problem and its many facets be looked at closely before dealing with it.
This is a very much on demand source of information nowadays. Everyone is looking at twitter feeds as we speak as it is a ready source of information. So, it’s a corollary that we might need an engine on top of it to retrieve the information we are looking for. One advantage of building a search engine for retrieving feeds is that the documents don’t change like files and folders. If a correction has to be done the feed is again republished. So, indexing is relatively simpler. Feeds basically are of two types -
They are more commercially implemented. In a commodity setup, they operate in a subscription based environment. They mostly provide APIs to pull the data. For example: the TOI app or any such apps that push news feeds straight to your phone that appear as notifications. The challenge here is latency – when was the feed published and when did it become available to you. This lag time is one of the primary differences between how efficient your search engine might be. Mostly the data comes in the form of XML or JSON
Twitter has a pull feed model. Most freely available sources of feeds allow you to pull data. They mostly are free sources. Mostly the data comes in the form of XML or JSON. The main difference is that you have to go check the URLs or source of data yourself instead of someone notifying you if some new feed has been pushed. Challenges come in the form of storing the content and its time-to-live (TTL). Since data is volatile and storing them would be expensive, most systems tend to deal with a subset of the data only.
The web should be visualized as a graph. The links provided in a webpage are its endpoints, i.e. they connect it to some other webpage, so they serve as the edges while the webpage themselves are nodes. The challenge here is traversal of this graph – usually termed as crawling.
Basically a web crawler is based on a queue. This queue stores the links you wish to traverse next. We start from a page (usually called seed page), and then find out what links are present in that page and store it in a queue. It then crawls through those pages and does the same thing again. So, this part is algorithm based. The algorithm is called a spider algorithm.
Challenges may involve encountering deadlocks or infinite loops. For example: In a website, there are links to take you to the Contact Us page from the home page and the Contact Us page has a link to take you back to the home page.
Data comes in HTML form so it’s important to use a good HTML parser. It is essential to prioritize this queue with respect to recent content and adjust to changes in web pages.
So before laying out the various logic and algorithms for retrieving the information, it is important to know how to get this information in the first place. There can be many other sources apart from this, but only the most popular and trending sources have been discussed.
References http://cdn3.img.sputniknews.com/images/101916/22/1019162264.jpgfor the cover image