CFP last date
20 December 2024
Reseach Article

Crawling the Hidden Web: An Approach to Dynamic Web Indexing

by Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 55 - Number 1
Year of Publication: 2012
Authors: Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud
10.5120/8717-7290

Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud . Crawling the Hidden Web: An Approach to Dynamic Web Indexing. International Journal of Computer Applications. 55, 1 ( October 2012), 7-15. DOI=10.5120/8717-7290

@article{ 10.5120/8717-7290,
author = { Moumie Soulemane, Mohammad Rafiuzzaman, Hasan Mahmud },
title = { Crawling the Hidden Web: An Approach to Dynamic Web Indexing },
journal = { International Journal of Computer Applications },
issue_date = { October 2012 },
volume = { 55 },
number = { 1 },
month = { October },
year = { 2012 },
issn = { 0975-8887 },
pages = { 7-15 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume55/number1/8717-7290/ },
doi = { 10.5120/8717-7290 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:56:08.292525+05:30
%A Moumie Soulemane
%A Mohammad Rafiuzzaman
%A Hasan Mahmud
%T Crawling the Hidden Web: An Approach to Dynamic Web Indexing
%J International Journal of Computer Applications
%@ 0975-8887
%V 55
%N 1
%P 7-15
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The majority of the websites encapsulating online information are dynamic and hence too sophisticated for many traditional search engines to index. With the ever growing quantity of such hidden web pages, this issue continues to raise diverse opinions between the research and practitioner among the web mining communities. Several aspects enriching these dynamic web pages are bringing more challenges day-by-day to index them. By explaining these aspects and challenges, in this paper we have presented a framework for dynamic web indexing. With the implementation of this framework and the results which we have found from it, all the necessary experimental setup and the developmental processes are explained. We have concluded by exposing a possible future scope through the integration of Hadoop-Mapreduce with this framework to update and maintain the index.

References
  1. Dan Sisson. Google SEO secrets, the complete guide, pp. 26–28, 2006.
  2. S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web, in: Proc. of the 27th Int. Conf. on Very Large Databases (VLDB 2001), September 2001.
  3. Dilip Kumar Sharmal, A. k. Sharma2. Analysis of techniques for detection of web search interfaces, 2YMCA University of Science and Technology, Faridabad, Haryana, India,http://www. csi-india. org/web/csi/studentskorner-december10, accessed on June, 2011.
  4. A. Ntoulas, Petros Zerfos, Junghoo Cho, Downloading Textual Hidden Web Content through Keyword Queries, JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference, 2005.
  5. Luciano Barbosa, Juliano Freire, siphoning hidden-web data through keyword-based interfaces, Journal of Information and Data management, 2010.
  6. http://www. w3schools. com/html/html_forms. asp, accessed on, June 2011.
  7. Wiley, Data Mining the Web Uncovering Patterns. (2007) .
  8. .
  9. Pradeep, Shubha Singh, NewNet- Crawling Deep Web, IJCSNS International Journal of Computer Science and Network Security, VOL. 10 No. 5, pp. 129-130, May 2010.
  10. http://www. worldwidewebsize. com/, accessed on June, 2010.
  11. J Bar-Ilan - Methods for comparing rankings of search engine result-2005, http://www. seojerusalem. com/googles-best-kept-secret/, http://www. search-marketing. info/search-algorithm/index. htm, accessed on June, 2010.
  12. David Hawking, Web Search Engines-1, pp. 87-88, 2006.
  13. Jayant Madhavan, David Ko, Luc jaKot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. "Google's Deep-Web Crawl", Proceedings of the International Conference on Very Large Databases (VLDB), 2008.
  14. http://www. dmoz. org/, accessed on June, 2010.
  15. Brijendra Singh, Hemant Kumar Singh. "Web Data Mining Research: A Survey", IEEE, 2010.
  16. http://www. ncbi. nlm. nih. gov/pubmed, accessed on June, 2010.
  17. C. H. Chang, M. Kayed, M. R. Girgis, K. F. Shaalan," A survey of web information extraction systems". IEEE Transactions on Knowledge and Data Engineering 18(10), pp. 1411–1428, 2006.
  18. P. Wu, J. R. Wen, H. Liu, W. Y. Ma,"Query selection techniques for efficient crawling of structured web sources". In: Proc. of ICDE, 2006.
  19. Wang Hui-chang, Ruan,Shu-hua, Tang,Qi-jie. "The Implementation of a Web Crawler URL Filter Algorithm Based on Caching". Second International Workshop on Computer Science and Engineering, IEEE, 2009.
  20. Jeffrey Dean, Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters". To appear in OSDI, 2004 http://labs. google. com/papers/mapreduce. html.
  21. http://hadoop. apache. org/, accessed on june, 2010.
  22. King-Ip Lin, Hui Chen. "Automatic Information Discovery from the "Invisible Web"", Information Technology: Coding and Computing (ITCC'02), IEEE, 2002.
  23. S. Chakrabarti, Mining the web: Discovering knowledge from Hypertext Data, p. 67. Morgan Kaufmann Publishers, 2003.
  24. Hasan Mahmud, Moumie Soulemane, Muhammad Rafiuzzaman, 'Framework for dynamic indexing from hidden web', IJCSI, Vol. 8, Issue 5, September 2011.
Index Terms

Computer Science
Information Sciences

Keywords

Dynamic web pages crawler hidden web index hadoop