Youtube Scraper

Youtube Scraper#

💡Overview of methods#

How to send web request via API?
- Using googleapiclient package provided by google
- Sending API requests by formatting URLs
How to get videos? Getting videos by different functions of APIs
- by searching, each search will cost 100 units of quota
- by listing all videos under playlistItems then search, each listing activity costs 1 unit of quota
Processes of listing playlistItems method
1. Getting uploadPlaylist id by a channel id
2. Getting video ids under the uploadPlaylist id
3. Getting all comments/chats of each video ids

Ask LLM#

跟據方法，爬取videos有以下兩種方法：

用google所提供的API `googleapiclient`:#

googleapiclient是Google提供的Python函式庫，這個函式庫可以讓我們更容易使用Google的各種API，包括YouTube Data API，便於發送請求、獲取數據，並處理API的回應。

googleapiclient.discovery Module: The googleapiclient library is a Python client library for interacting with various Google APIs, including the YouTube Data API. The discovery module within this library provides tools for creating API client objects.
build Function: The build function is a key function provided by the googleapiclient.discovery module. It is used to create a client object for interacting with a specific Google API. This client object allows you to make requests to the API and retrieve data.

from googleapiclient.discovery import build
api_key = 'YOUR_API_KEY'
keyword = "desired keyword"
youtube = build('youtube', 'v3', developerKey=api_key)
search_response = youtube.search().list(
    part='snippet',
    q=keyword,
    maxResults=50
).execute()
videos = search_response.get('items', [])
print(f'Number of videos found: {len(videos)}')

用url送出API request：#

也可以直接使用Python的requests庫來發送HTTP請求，這樣可以更靈活地控制請求的細節。這種方法需要手動構建請求的URL，並處理API的回應。

...
keyword = "desired keyword"
url = f"https://www.googleapis.com/youtube/v3/search?part=snippet&q={keyword}&maxResults=25&type=video&key={api_key}"
response = requests.get(url)
if response.status_code == 200:
    search_results = response.json()
    for item in search_results.get('items', []):
        title = item['snippet']['title']
				...

Search videos metadata#

根據目的，爬取videos有以下兩種方法：

用前述Search的方式，查找合乎關鍵字的videos（但每查一次扣100個units，一天是10000 units）。
用playlistitems把某個playlist底下的videos通通列出來（每列50個videos一個units），如下範例：

from googleapiclient.discovery import build

# Replace with the ID of the channel you want to retrieve videos from
channel_id = "UC5nwNW4KdC0SzrhF9BXEYOQ"

# Create a YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)

# Use the channels().list() method to get the default upload playlist ID of the channel
channel_response = youtube.channels().list(
    part='contentDetails',
    id=channel_id
).execute()

# Extract the upload playlist ID
if 'items' in channel_response:
    upload_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']

    # Use the playlistItems().list() method to get all video items from the upload playlist

videos = []
next_page_token = None
limit = 0

# Loop through the pages of the playlist items (videos)
while True:
#    if limit > 5:  # Limit the number of pages to retrieve (adjust as needed)
#        break
    playlist_items_response = youtube.playlistItems().list(
        part='snippet',
        maxResults=50,  
        playlistId=upload_playlist_id,

        pageToken=next_page_token
    ).execute()

    videos.extend(playlist_items_response.get('items', []))

    next_page_token = playlist_items_response.get('nextPageToken')
    if not next_page_token:
        break
#    limit += 1

# Print the number of videos retrieved
print(f'Number of videos: {len(videos)}')

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [1], in <cell line: 7>()
      4 channel_id = "UC5nwNW4KdC0SzrhF9BXEYOQ"
      6 # Create a YouTube API client
----> 7 youtube = build('youtube', 'v3', developerKey=api_key)
      9 # Use the channels().list() method to get the default upload playlist ID of the channel
     10 channel_response = youtube.channels().list(
     11     part='contentDetails',
     12     id=channel_id
     13 ).execute()

NameError: name 'api_key' is not defined

Retrieve the upload playlist of each channel#

Upload Playlist ID: If the channel information is found, the code extracts the upload_playlist_id from the response. This is crucial because the upload playlist is a special playlist associated with a channel that contains all the videos uploaded to that channel. By retrieving the upload_playlist_id, you can later access and retrieve all the videos uploaded to the channel.

channel_response = youtube.channels().list(
    part='contentDetails',
    id=channel_id
).execute()

# Extract the upload playlist ID
if 'items' in channel_response:
    upload_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']

3. Main scraper with `channel_id` and `upload_playlist_id`#

while True and limit:
- while True: creates an infinite loop. This means the code inside the loop will keep executing until explicitly stopped using the break statement.
- limit is a variable that is initially set to 0. It’s used to keep track of the number of pages retrieved. The loop will continue as long as the value of limit is less than or equal to 5. In other words, it limits the number of iterations to 6 (0 to 5), including the initial iteration.
nextToken:
- When you make a request to retrieve a list of items (in this case, playlist items), the API may not return all items at once, especially if there are many. Instead, it provides a limited number of items in the response along with a nextPageToken.
- You can use this nextPageToken in subsequent requests to retrieve the next “page” of results. It helps you navigate through large sets of data efficiently.
break:
- break is a control flow statement that is used to exit a loop prematurely. When the condition specified in the if statement (limit > 5) is met, the break statement is executed, which terminates the while loop.
- In this code, the break statement is used to stop the loop if the limit exceeds 5 or if there is no nextPageToken available. This helps limit the number of pages retrieved, preventing an infinite loop or excessive API requests.

videos = []
next_page_token = None
limit = 0

# Loop through the pages of the playlist items (videos)
while True:
    if limit > 5:  # Limit the number of pages to retrieve (adjust as needed)
        break
    playlist_items_response = youtube.playlistItems().list(
        part='snippet',
        maxResults=50,  
        playlistId=upload_playlist_id,
        pageToken=next_page_token
    ).execute()

    videos.extend(playlist_items_response.get('items', []))

    next_page_token = playlist_items_response.get('nextPageToken')
    if not next_page_token:
        break
    limit += 1