CFP last date
20 January 2025
Reseach Article

TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos

by Pooja Ajwani, Harshal Arolkar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 182 - Number 50
Year of Publication: 2019
Authors: Pooja Ajwani, Harshal Arolkar
10.5120/ijca2019918738

Pooja Ajwani, Harshal Arolkar . TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos. International Journal of Computer Applications. 182, 50 ( Apr 2019), 14-17. DOI=10.5120/ijca2019918738

@article{ 10.5120/ijca2019918738,
author = { Pooja Ajwani, Harshal Arolkar },
title = { TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2019 },
volume = { 182 },
number = { 50 },
month = { Apr },
year = { 2019 },
issn = { 0975-8887 },
pages = { 14-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume182/number50/30537-2019918738/ },
doi = { 10.5120/ijca2019918738 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:14:52.721605+05:30
%A Pooja Ajwani
%A Harshal Arolkar
%T TubeExtractor: A Crawler and Converter for Generating Research DataSet from YouTube Videos
%J International Journal of Computer Applications
%@ 0975-8887
%V 182
%N 50
%P 14-17
%D 2019
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With the advent of the internet and e-resources, there has been an exponential growth of data available to the users. Amongst many content providers, YouTube succeeds in securing the second most popular website in the world. The data from YouTube is easily available to the users, due to which many researchers gather YouTube videos as their dataset for research. Searching the required video for data analysis from YouTube is a cumbersome task as YouTube is overloaded with trillions of videos. Researchers thus need to spend a huge amount of time to get required dataset. To save the time taken by researchers for accumulating dataset, an open source application “TubeExtractor” is proposed in this paper. The TubeExtractor application will allow researchers to download the videos and its metadata from YouTube based on the desired parameters provided by the researcher. The TubeExtractor will also provide as an output a plain text file of the downloaded video. This file can be used by the researchers to do additional processing of their choice if required. The keywords to download the videos are provided to the crawler in the form of a document, generated using a keyphrase extractor algorithm. If the vtt (Video Text Tracks) file of the video to be downloaded is available then a plain text file is created using a two-step parser. This TubeExtractor can save enough time of researchers.

References
  1. Aliaa A.A. Youssif, Atef Z.Ghalwash, Islam A.Amer, “KPE: An Automatic Keyphrase Extraction Algorithm” , International Conference on Information Systems and Computational Intelligence (ICISCI 2011), 2011.
  2. Ashish Sureka, Ponnurangam Kumaraguru, Atul Goyal, and Sidharth Chhabra, “Mining YouTube to Discover Extremist Videos, Users and Hidden Communities”, Information Retrieval Technology, 6458,13-24.
  3. Chirag Shah, “Supporting Research Data Collection from YouTube with TubeKit” Journal of Information Technology & Politics (JITP), 7(2-3), 226-240 [DOI].
  4. Egor Lakomkin Sven Magg CorneliusWeber Stefan Wermter, “KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos”, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (System Demonstrations), pages 90–95.
  5. Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin and Craig G. Nevill-Manning, “KEA: Practical Automatic Keyphrase Extraction” in Proceeding of DL '99 Proceedings of the fourth ACM conference on Digital libraries, Pages 254-255 , Berkeley, California, USA — August 11 - 14, 1999.
  6. Kayvan Kousha, Mike Thelwall, Mahshid Abdoli, “ The role of online videos in research communication: A content analysis of YouTube videos cited in academic publications”, Journal of the American Society for Information Science and Technology 63(9):1710-1727 · September 2012.
  7. LetianWang, Fang Li, “SJTULTLAB: Chunk Based Method for Keyphrase Extraction”, Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 158–161.
  8. Luca Rossetto and Heiko Schuldt, “Web Video in Numbers An Analysis of Web-Video Metadata”, arXiv preprint arXiv:1707.01340 (2017).
  9. Nirmala Pudota, Antonina Dattolo, Andrea Baruzzo, Felice Ferrara, Carlo Tasso, “Automatic keyphrase extraction and ontology mining for content-based tag recommendation” , International Journal of Intelligent Systems - New Trends for Ontology-Based Knowledge Discovery, Volume 25 Issue 12, December 2010 , Pages 1158-1186 .
  10. Rada Mihalcea and Paul Tarau, “TextRank: Bringing Order into Texts”, Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 2004.
  11. Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley, Automatic keyword extraction from individual documents”, Proceedings of the 5th International Workshop on Semantic Evaluation, ACL 2010, pages 158–161, Uppsala, Sweden, 15-16 July 2010. 2010 Association for Computational Linguistics.
  12. Thomas Steiner, Hannes Mühleisen, Ruben Verborgh, Pierre-Antoine Champin, Benoît Encelle, Yannick Prié , “Weaving the Web(VTT) of Data”, LDO16014 (7th International Workshop about Linked Data on the Web), April 8, 2014, Seoul, Korea.
  13. Wang Bingwei1, Yu Su2, “The Research on Related Technologies of Web Crawler”, International Refereed Journal of Engineering and Science (IRJES), ISSN (Online) 2319-183X, (Print) 2319-1821, Volume 6, Issue 4 (April 2017), PP.16-19.
  14. Yuhao Fan, “Design and Implementation of Distributed Crawler System Based on Scrapy”, IOP Conf. Series: Earth and Environmental Science 108 (2018) 042086 doi :10.1088/1755-1315/108/4/042086
  15. www.wikipedia.org
  16. www.alexa.com/siteinfo/youtube.com
  17. https://github.com/rg3/youtube-dl/blob/master/README.md
  18. https://en.wikipedia.org/wiki/Web_crawler
  19. https://www.youtube.com/
Index Terms

Computer Science
Information Sciences

Keywords

Crawler Keyphrase extractor parser youtube-dl vtt RAKE.