AS05. Improving Youtube Scraper with ChatGPT#
A workable Youtube scraper#
api_key = "YOUR YOUTUBE API KEY"
channel_id = "YOUTUBE CHANNEL ID"
Remember to uncomment the following code
"""
from googleapiclient.discovery import build
# Create a YouTube API client
youtube = build('youtube', 'v3', developerKey=api_key)
# Use the channels().list() method to get the default upload playlist ID of the channel
channel_response = youtube.channels().list(
part='contentDetails',
id=channel_id
).execute()
# Extract the upload playlist ID
if 'items' in channel_response:
upload_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']
# Use the playlistItems().list() method to get all video items from the upload playlist
videos = []
next_page_token = None
limit = 0
# Loop through the pages of the playlist items (videos)
while True:
if limit > 5: # Limit the number of pages to retrieve (adjust as needed)
break
playlist_items_response = youtube.playlistItems().list(
part='snippet',
maxResults=50,
playlistId=upload_playlist_id,
pageToken=next_page_token
).execute()
videos.extend(playlist_items_response.get('items', []))
next_page_token = playlist_items_response.get('nextPageToken')
if not next_page_token:
break
limit += 1
# Print the number of videos retrieved
print(f'Number of videos: {len(videos)}')
"""
"\nfrom googleapiclient.discovery import build\n\n# Create a YouTube API client\nyoutube = build('youtube', 'v3', developerKey=api_key)\n\n# Use the channels().list() method to get the default upload playlist ID of the channel\nchannel_response = youtube.channels().list(\n part='contentDetails',\n id=channel_id\n).execute()\n\n# Extract the upload playlist ID\nif 'items' in channel_response:\n upload_playlist_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']\n\n # Use the playlistItems().list() method to get all video items from the upload playlist\n\nvideos = []\nnext_page_token = None\nlimit = 0\n\n# Loop through the pages of the playlist items (videos)\nwhile True:\n if limit > 5: # Limit the number of pages to retrieve (adjust as needed)\n break\n playlist_items_response = youtube.playlistItems().list(\n part='snippet',\n maxResults=50, \n playlistId=upload_playlist_id,\n pageToken=next_page_token\n ).execute()\n\n videos.extend(playlist_items_response.get('items', []))\n\n next_page_token = playlist_items_response.get('nextPageToken')\n if not next_page_token:\n break\n limit += 1\n\n# Print the number of videos retrieved\nprint(f'Number of videos: {len(videos)}')\n"
videos[0]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 videos[0]
NameError: name 'videos' is not defined
1. Get more Youtube data (40%)#
Choose one youtue channel of large news media such as TVBS, 關鍵時刻, 三立新聞, …
Modify the above sample code if needs
Get back more than 10,000 video’s video id, description, title, published time
Store the crawled data into a dataframe
video_df
with columns['title', 'videoId', 'title', 'description', 'publishedAt']
Print the head of the dataframe
Print the shape of the dataframe
# your code here
2. Cleaning and observing data#
2.1 Detect if there are duplicated rows (10%)#
Print the number of duplicated rows by
print(f"There are {nDuplicated} rows")
Drop duplicated rows and print the shape of the cleaned dataframe
print("The shape of cleaned dataframe is", ...)
# your code here
2.2 Plot data timeline (10%)#
Each Video has a publishedAt
variable which is the uploaded time of the video.
Plot a bar chart to show the productivity of the channel by hour(0~23).
# your code here
3. Filter videos (10%)#
Detect whether there are one or more keywords such as “以色列” or “Israel” in the video title or description. If so, filter out these videos and save them as a new dataframe (e.g., related_df)
# your code here
4. Scraping comments of these videos (30%)#
For the previously crawled Israel-related videos, scrape all the comments from these videos and save them in a dataframe. Each row of your data should represent a single comment, and you should retain the videoId in the dataframe so that you know which video each comment belongs to.
Print head() of the dataframe
Print the shape of the dataframe
# your code here
5. Stored as sqlite (Optional)#
Take use of ChatGPT to understand how to use SQLite
5.1 Dataframe to SQLite DB (10%)#
Storing the dataframe into an sqlite database’ table videos
# your code here
5.2 SQLite Query 1: Counting records (10%)#
Send an SQLite query in python to print the number of records in the videos
table.
# your code here
5.3 SQLite Query: Filter records (10%)#
Send an SQLite query in python to filter how many videos’ titles have the keywords “以巴” or “巴勒斯坦”, or “以色列”
# your code here