Youtube Scraper#
💡Overview of methods#
How to send web request via API?
Using
googleapiclientpackage provided by googleSending API requests by formatting URLs
How to get videos? Getting videos by different functions of APIs
by searching, each search will cost 100 units of quota
by listing all videos under playlistItems then search, each listing activity costs 1 unit of quota
Processes of listing playlistItems method
Getting uploadPlaylist id by a channel id
Getting video ids under the uploadPlaylist id
Getting all comments/chats of each video ids
Ask LLM#
跟據方法,爬取videos有以下兩種方法:
用google所提供的API googleapiclient:#
googleapiclient是Google提供的Python函式庫,這個函式庫可以讓我們更容易使用Google的各種API,包括YouTube Data API,便於發送請求、獲取數據,並處理API的回應。
googleapiclient.discoveryModule: The googleapiclient library is a Python client library for interacting with various Google APIs, including the YouTube Data API. The discovery module within this library provides tools for creating API client objects.buildFunction: Thebuildfunction is a key function provided by the googleapiclient.discovery module. It is used to create a client object for interacting with a specific Google API. This client object allows you to make requests to the API and retrieve data.
from googleapiclient.discovery import build
api_key = 'YOUR_API_KEY'
keyword = "desired keyword"
youtube = build('youtube', 'v3', developerKey=api_key)
search_response = youtube.search().list(
part='snippet',
q=keyword,
maxResults=50
).execute()
videos = search_response.get('items', [])
print(f'Number of videos found: {len(videos)}')
用url送出API request:#
也可以直接使用Python的requests庫來發送HTTP請求,這樣可以更靈活地控制請求的細節。這種方法需要手動構建請求的URL,並處理API的回應。
...
keyword = "desired keyword"
url = f"https://www.googleapis.com/youtube/v3/search?part=snippet&q={keyword}&maxResults=25&type=video&key={api_key}"
response = requests.get(url)
if response.status_code == 200:
search_results = response.json()
for item in search_results.get('items', []):
title = item['snippet']['title']
...
Search videos metadata#
根據目的,爬取videos有以下兩種方法:
用前述Search的方式,查找合乎關鍵字的videos(但每查一次扣100個units,一天是10000 units)。
用playlistitems把某個playlist底下的videos通通列出來(每列50個videos一個units),如下範例:
from googleapiclient.discovery import build
# Replace with the ID of the channel you want to retrieve videos from
channel_id = "UC5nwNW4KdC0SzrhF9BXEYOQ"
# Create a YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)
# Use the channels().list() method to get the default upload playlist ID of the channel
channel_response = youtube.channels().list(
part='contentDetails',
id=channel_id
).execute()
# Extract the upload playlist ID
if 'items' in channel_response:
upload_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']
# Use the playlistItems().list() method to get all video items from the upload playlist
videos = []
next_page_token = None
limit = 0
# Loop through the pages of the playlist items (videos)
while True:
# if limit > 5: # Limit the number of pages to retrieve (adjust as needed)
# break
playlist_items_response = youtube.playlistItems().list(
part='snippet',
maxResults=50,
playlistId=upload_playlist_id,
pageToken=next_page_token
).execute()
videos.extend(playlist_items_response.get('items', []))
next_page_token = playlist_items_response.get('nextPageToken')
if not next_page_token:
break
# limit += 1
# Print the number of videos retrieved
print(f'Number of videos: {len(videos)}')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [1], in <cell line: 7>()
4 channel_id = "UC5nwNW4KdC0SzrhF9BXEYOQ"
6 # Create a YouTube API client
----> 7 youtube = build('youtube', 'v3', developerKey=api_key)
9 # Use the channels().list() method to get the default upload playlist ID of the channel
10 channel_response = youtube.channels().list(
11 part='contentDetails',
12 id=channel_id
13 ).execute()
NameError: name 'api_key' is not defined
Retrieve the upload playlist of each channel#
Upload Playlist ID: If the channel information is found, the code extracts the upload_playlist_id from the response. This is crucial because the upload playlist is a special playlist associated with a channel that contains all the videos uploaded to that channel. By retrieving the upload_playlist_id, you can later access and retrieve all the videos uploaded to the channel.
channel_response = youtube.channels().list(
part='contentDetails',
id=channel_id
).execute()
# Extract the upload playlist ID
if 'items' in channel_response:
upload_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']
3. Main scraper with channel_id and upload_playlist_id#
while Trueandlimit:while True:creates an infinite loop. This means the code inside the loop will keep executing until explicitly stopped using thebreakstatement.limitis a variable that is initially set to 0. It’s used to keep track of the number of pages retrieved. The loop will continue as long as the value oflimitis less than or equal to 5. In other words, it limits the number of iterations to 6 (0 to 5), including the initial iteration.
nextToken:When you make a request to retrieve a list of items (in this case, playlist items), the API may not return all items at once, especially if there are many. Instead, it provides a limited number of items in the response along with a
nextPageToken.You can use this
nextPageTokenin subsequent requests to retrieve the next “page” of results. It helps you navigate through large sets of data efficiently.
break:breakis a control flow statement that is used to exit a loop prematurely. When the condition specified in theifstatement (limit > 5) is met, thebreakstatement is executed, which terminates thewhileloop.In this code, the
breakstatement is used to stop the loop if thelimitexceeds 5 or if there is nonextPageTokenavailable. This helps limit the number of pages retrieved, preventing an infinite loop or excessive API requests.
videos = []
next_page_token = None
limit = 0
# Loop through the pages of the playlist items (videos)
while True:
if limit > 5: # Limit the number of pages to retrieve (adjust as needed)
break
playlist_items_response = youtube.playlistItems().list(
part='snippet',
maxResults=50,
playlistId=upload_playlist_id,
pageToken=next_page_token
).execute()
videos.extend(playlist_items_response.get('items', []))
next_page_token = playlist_items_response.get('nextPageToken')
if not next_page_token:
break
limit += 1