Understand the source of data

Since we use data from youtube API, we will use a notebook to find out what is the result from the API, and what we can use?

vertopal.com_introduce.pdf

From above information, we determine which API endpoint we will use.

Resources	Methods	Parameters	Quota Impact	Result
Search	search.list	snippet	100 units per request	Up to 50 results per request
Videos	videos.list	snippet	1 unit per part per request	Up to 50 results per request
Videos	videos.list	statistics	1 unit per part per request	Up to 50 results per request
Videos	videos.list	contentDetails	1 unit per part per request	Up to 50 results per request
Channels		snippet
		statistics
		topicDetails
		contentDetails

YouTube allows free accounts to have 10,000 quota units per day, meaning:

5000 searrch results per day
~160000 detail information (include 3 para) of video
~125000 detail information (4 para) of channel

Although we cannot get the caption of video through youtube offical api (require permission to edit a video to download caption), we can get caption of video through youtube_transcript_api free, but have some limited in request time per hours and IPBlock. These disadvantages can be bypassed by time.sleep() and use Kaggle, Colab

In the first stage of the project, we will focus on find out what current trends, popular channel, brief attitiude, so we will focus search (to get result from search), video, channel API.

Stage of the project

Craw 10000 video to get brief introduce⇒ preprocessing, analyze information about these video
Crawl channel related to these video ⇒ preprocessing, analyze
From lít of channel above, filter out top channel and crawl top video of these channel (include caption, comment ,..) ⇒ analyze, should we suggest this to company?
Automate: each biweek, crawl more video and update data of old video