CFP last date
20 January 2025
Reseach Article

Page Quality Optimization in Crawler's Queue through Employing Graph Traversal Algorithms

by Saedeh Tajbar-porshokohi, Fatemeh Ahmadi-abkenari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 106 - Number 11
Year of Publication: 2014
Authors: Saedeh Tajbar-porshokohi, Fatemeh Ahmadi-abkenari
10.5120/18563-9803

Saedeh Tajbar-porshokohi, Fatemeh Ahmadi-abkenari . Page Quality Optimization in Crawler's Queue through Employing Graph Traversal Algorithms. International Journal of Computer Applications. 106, 11 ( November 2014), 13-19. DOI=10.5120/18563-9803

@article{ 10.5120/18563-9803,
author = { Saedeh Tajbar-porshokohi, Fatemeh Ahmadi-abkenari },
title = { Page Quality Optimization in Crawler's Queue through Employing Graph Traversal Algorithms },
journal = { International Journal of Computer Applications },
issue_date = { November 2014 },
volume = { 106 },
number = { 11 },
month = { November },
year = { 2014 },
issn = { 0975-8887 },
pages = { 13-19 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume106/number11/18563-9803/ },
doi = { 10.5120/18563-9803 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:39:07.715202+05:30
%A Saedeh Tajbar-porshokohi
%A Fatemeh Ahmadi-abkenari
%T Page Quality Optimization in Crawler's Queue through Employing Graph Traversal Algorithms
%J International Journal of Computer Applications
%@ 0975-8887
%V 106
%N 11
%P 13-19
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In today's information era, Web becomes one of the most powerful and fastest means of communication and interaction among human beings. Search engines as Web based applications traverse the Web automatically and receive the set of existing fresh and up-to-date documents. The process of receiving, storing, categorizing and ndexing is done automatically based on partial smart algorithms. Although many facts about the structure of these applications remains hidden as commercial secrets, the literature tries to find the best approaches for each modules in the structure of search engines. Due to the limited time of today's Web surfers, providing the most related and freshest documents to them is the most significant challenge for search engines. To do so, every module in search engine architecture should be designed as smart as possible to yield not only the most related documents but also to act in a timely manner. Among these modules is the sensitive part of crawler. One of the open issues in optimization of search engines' performance is to reconfigure crawling policy in a way that it follows the most promising out-links that carries the content related to the source page. Crawler module has the responsibility to fetch pages for ranking modules. If higher quality pages with less content drift are indexed by the crawlers, the ranking module will perform faster. According to the graph structure of the Web, the way of traversing the Web is based on the literature on graph search methods. This paper experimentally employs different graph search methods and different combinations of them by issuing some queries to Google engine to measure the quality of received pages with fixing the factor of graph depth to identify the best method with reasonable time and space complexity to be employed in crawler section in search engine architecture.

References
  1. Ahmadi-Abkenari, F. , Selamat, A. 2012. "An Architecture for a Focused Trend Parallel Web Crawler with the Application of Clickstream Analysis", International Journal of Information Sciences, Elsevier, Vol. 184, pp. 266-281.
  2. Ahmadi-Abkenari, F. , and Selamat, A. 2013. "Advantages of Employing LogRank Web Page Importance Metric in Domain Specific Web Search Engines". JDCTA: International Journal of Digital Content Technology and its Applications. Vol. 7, No. 9. pp. 425-432.
  3. Ahmadi-Abkenari, F. , and Selamat, A. 2012. "LogRank: A Clickstream-based Web Page Importance Metric for Web Crawlers". JDCTA: International Journal of Digital Content Technology and its Applications. Vol. 6, No. 1. pp. 200-207.
  4. Arastoo poor ,sh. 2008. "The Crawler and Web structure" information and library journal, Vol. 9, No. 2, pp. 4-15.
  5. Baeza-Yates R. , Castillo C. , Marin M. , and Rodriguez A. 2005. "Crawling a country: Better strategies than breadth-first for Web page ordering". In Proceedings of the 14th international conference on World Wide Web / Industrial and Practical Experience Track, Chiba, Japan,. ACM Press, pp. 864– 872.
  6. Esmaeeli, m. tavakoli,m, hashemi majd, s, 2013. "The Web crawler" APA professional laboratory in context of information and communication technology security, document number, APA_FUM_W_WEB_0111, pp. 5-28, bahman.
  7. Hafri Y, and Djeraba C. 2004. "High performance Crawling system". In Proceedings of the 6th ACM SIGMM Int. Workshop on Multimedia Information Retrieval pp. 299–306.
  8. Junghoo Cho, 2002. "Parallel Crawlers". In proceedings of WWW2002, Honolulu, Hawaii, USA, May 7-11. ACM 1-58113-449-5/02/005.
  9. Junghoo Cho, Hector Garcia-Molina, and Lawrence. 1998. "Efficient Crawling through URL Ordering Page". In Proceedings of the 7th World-Wide Web Conference. pp. 161-171.
  10. Kumar G. , Duhan N. , and Sharma A. K. 2011. "Page Ranking Based on Number of Visits of Links of Web Page". International Conference on Computer & Communication Technology (ICCCT)-2011, IEEE, pp. 11-14.
  11. MENCZER F and SRINIVASAN P. 2004. "Topical Web Crawlers: Evaluating Adaptive Algorithms", ACM Transactions on Internet Technology'. Vol. 4, No. 4, pp. 378–419.
  12. MENCZER, F. , PANT, G. , RUIZ, M. , AND SRINIVASAN, P. 2001. "Evaluating topic-driven Web Crawlers". In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, Eds. ACM Press, New York, NY, pp . 241–249.
  13. M. Kurant, A. Markopoulou, and P. Thiran. 2010. "On the bias of BFS (Breadth First Search)". In arXiv: 1004. 1729.
  14. Najork, M. , Wiener, J. L. 2001. "Breadth-First Search Crawling Yields High-Quality Pages". In WWW'01,10th International World Wide Web Conference. pp. 114-118.
  15. Olston Ch, and Najork M. 2010. "Web Crawling'. Foundations and Trends in Information Retrieval". Vol. 4, No. 3, pp . 175–246.
  16. Onn Brandman, Junghoo Cho, and Hector Garcia-Molina. 2000. "Crawler Friendly Servers". In Proceedings of the Workshop on Performance and Architecture of Web Servers (PAWS). Santa Clara, California.
  17. Pant G. and Menczer F. 2003. "Topical Crawling for Business Intelligence". In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003), Norway.
  18. Pant G. , Srinivasan P. , and Menczer F. 2002. "Exploration versus Exploitation in Topic driven Crawlers". In WWW02 Workshop on Web Dynamics, Hawaii.
  19. Pant G. , Srinivasan P. , and Menczer F. 2004. "Crawling the Web". Web Dynamics, pp. 153-178.
  20. Tyagi N. , and Sharma S. 2012. "Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page". International Journal of Soft Computing and Engineering (IJSCE) , Vol. 2, Issue-3.
  21. Hoffmann, J. 2000. "A heuristic for Domain Independent Planning, and its Use in an Enforced Hill-Climbing Algorithm". 12th International Symposium on Methodologies for Intelligent Systems (ISMIS-00), Springer, pp. 216–227. Berlin.
  22. Stern, R. , Kulberis, T and Felner, A. 2010. "Using Lookaheads with Optimal Best-First Search". Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10). pp. 185-90.
  23. Reid, M and Korf, R. E. 1998. "Complexity Analysis of Admissible Heuristic Search". American Association for Artificial Intelligence (AAAI-98), pp. 1-6.
Index Terms

Computer Science
Information Sciences

Keywords

Graph Traversal approaches Search Engine Optimization (SEO) Web Crawler Web Page Ranking Methods.