Abstract |
: |
The web is a large repository of information and to facilitate the search and retrieval of pages from it, categorization of web documents is essential. An effective means to handle the complexity of information retrieval from the internet is through automatic classification of web pages. Although lots of automatic classification algorithms and systems have been presented, most of the existing approaches are computationally challenging. In order to overcome this challenge, we have proposed a parallel algorithm, known as MapReduce programming model to automatically categorize the web pages. This approach incorporates three concepts. They are web crawler, MapReduce programming model and the proposed web page categorization approach. Initially, we have utilized web crawler to mine the World Wide Web and the crawled web pages are then directly given as input to the MapReduce programming model. Here the MapReduce programming model adapted to our proposed web page categorization approach finds the appropriate category of the web page according to its content. The experimental results show that our proposed parallel web page categorization approach achieves satisfactory results in finding the right category for any given web page. |