P08 HTML Parsing - Crawling PTT#
Using BeautifulSoup Extract Text without Tags http://stackoverflow.com/questions/23380171/using-beautifulsoup-extract-text-without-tags
Scraper: comparing beautifulsoup vs. scrapy
I. Introduction#
1.1 Get url content by requests#
Try to get the following links (click and observe them, get them by
request.get()
)
import requests
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
print(url)
response = requests.get(url, timeout=(2, 3))
print(response.text[:500])
https://www.ptt.cc/bbs/Boy-Girl/index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>看板 Boy-Girl 文章列表 - 批踢踢實業坊</title>
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//imag
1.2 Using Chrome DevTools to get right elements#
1.3 Parsing url content by beautifulsoup#
Beautifulsoup4 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
II. Get links from a page#
2.1 Getting response#
import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
2.2 Converting webpage to a soup object#
soup = BeautifulSoup(response.text, 'html.parser')
print(type(soup))
print(soup.prettify())
<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
看板 Boy-Girl 文章列表 - 批踢踢實業坊
</title>
<link href="//images.ptt.cc/bbs/v2.27/bbs-common.css" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-custom.css" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="topbar-container">
<div class="bbs-content" id="topbar">
<a href="/bbs/" id="logo">
批踢踢實業坊
</a>
<span>
›
</span>
<a class="board" href="/bbs/Boy-Girl/index.html">
<span class="board-label">
看板
</span>
Boy-Girl
</a>
<a class="right small" href="/about.html">
關於我們
</a>
<a class="right small" href="/contact.html">
聯絡資訊
</a>
</div>
</div>
<div id="main-container">
<div id="action-bar-container">
<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/Boy-Girl/index.html">
看板
</a>
<a class="btn" href="/man/Boy-Girl/index.html">
精華區
</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/Boy-Girl/index1.html">
最舊
</a>
<a class="btn wide" href="/bbs/Boy-Girl/index6502.html">
‹ 上頁
</a>
<a class="btn wide disabled">
下頁 ›
</a>
<a class="btn wide" href="/bbs/Boy-Girl/index.html">
最新
</a>
</div>
</div>
</div>
<div class="r-list-container action-bar-margin bbs-screen">
<div class="search-bar">
<form action="search" id="search-bar" type="get">
<input class="query" name="q" placeholder="搜尋文章⋯" type="text" value=""/>
</form>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
8
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1743052562.A.D43.html">
Re: [討論] 有人被迫請客過嗎?
</a>
</div>
<div class="meta">
<div class="author">
Redwing13
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E6%9C%89%E4%BA%BA%E8%A2%AB%E8%BF%AB%E8%AB%8B%E5%AE%A2%E9%81%8E%E5%97%8E%3F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ARedwing13">
搜尋看板內 Redwing13 的文章
</a>
</div>
</div>
</div>
<div class="date">
3/27
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
2
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1743054682.A.9D2.html">
Re: [討論] 有人被迫請客過嗎?
</a>
</div>
<div class="meta">
<div class="author">
CuLiZn5566
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E6%9C%89%E4%BA%BA%E8%A2%AB%E8%BF%AB%E8%AB%8B%E5%AE%A2%E9%81%8E%E5%97%8E%3F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ACuLiZn5566">
搜尋看板內 CuLiZn5566 的文章
</a>
</div>
</div>
</div>
<div class="date">
3/27
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
1
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1743056882.A.2FE.html">
Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?
</a>
</div>
<div class="meta">
<div class="author">
kousun
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E5%A5%87%E6%80%AA%EF%BC%8C%E7%82%BA%E4%BB%80%E9%BA%BC%E4%B8%8D%E5%90%83%E9%A3%AF%E5%89%8D%E5%B0%B1%E5%85%88%E8%AC%9B%E5%A5%BDAA%3F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Akousun">
搜尋看板內 kousun 的文章
</a>
</div>
</div>
</div>
<div class="date">
3/27
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-list-sep">
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
2
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1727313746.A.8E6.html">
[公告] fantasy03102 文章刪文
</a>
</div>
<div class="meta">
<div class="author">
gogin
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+fantasy03102+%E6%96%87%E7%AB%A0%E5%88%AA%E6%96%87">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Agogin">
搜尋看板內 gogin 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/26
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
5
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1729589160.A.FF1.html">
[公告] 板規四侮辱標準2025.01更新
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E6%9D%BF%E8%A6%8F%E5%9B%9B%E4%BE%AE%E8%BE%B1%E6%A8%99%E6%BA%962025.01%E6%9B%B4%E6%96%B0">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
10/22
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
1
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1732475209.A.AAE.html">
[公告] 關於肉搜/公開個資
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E9%97%9C%E6%96%BC%E8%82%89%E6%90%9C%2F%E5%85%AC%E9%96%8B%E5%80%8B%E8%B3%87">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
11/25
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
15
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1735917789.A.F1A.html">
[公告] 置底檢舉暨閒聊區2025.01~
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E7%BD%AE%E5%BA%95%E6%AA%A2%E8%88%89%E6%9A%A8%E9%96%92%E8%81%8A%E5%8D%802025.01~">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
1/03
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
5
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1736495718.A.C52.html">
[公告] Boy-Girl板規 25.01.13
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+Boy-Girl%E6%9D%BF%E8%A6%8F+25.01.13">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
1/10
</div>
<div class="mark">
M
</div>
</div>
</div>
</div>
</div>
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-DZ6Y3BY9GW">
</script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-DZ6Y3BY9GW');
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-32365737-1', {
cookieDomain: 'ptt.cc',
legacyCookieDomain: 'ptt.cc'
});
ga('send', 'pageview');
</script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js">
</script>
<script src="//images.ptt.cc/bbs/v2.27/bbs.js">
</script>
<script>
(function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'926d1d420a298267',t:'MTc0MzA1OTkxMi4wMDAwMDA='};var a=document.createElement('script');a.nonce='';a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();
</script>
</body>
</html>
2.3 Traversing an html file by soup#
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
<title>看板 Boy-Girl 文章列表 - 批踢踢實業坊</title>
title
看板 Boy-Girl 文章列表 - 批踢踢實業坊
head
2.4 Finding all links in the doc#
print(type(soup.find_all("a")))
print(len(soup.find_all("a")))
soup.find_all("a")[:5]
<class 'bs4.element.ResultSet'>
34
[<a href="/bbs/" id="logo">批踢踢實業坊</a>,
<a class="board" href="/bbs/Boy-Girl/index.html"><span class="board-label">看板 </span>Boy-Girl</a>,
<a class="right small" href="/about.html">關於我們</a>,
<a class="right small" href="/contact.html">聯絡資訊</a>,
<a class="btn selected" href="/bbs/Boy-Girl/index.html">看板</a>]
2.5 Extracting element content and attribute#
nodes_a = soup.find_all('a')
print(nodes_a[0])
print(nodes_a[0].text)
print(nodes_a[0].get('href'))
<a href="/bbs/" id="logo">批踢踢實業坊</a>
批踢踢實業坊
/bbs/
for link in soup.find_all('a')[:5]:
print(link.get('href'))
/bbs/
/bbs/Boy-Girl/index.html
/about.html
/contact.html
/bbs/Boy-Girl/index.html
2.6 append()
links into a list#
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
print(len(links))
34
III. Get article links from the first page#
注意這行code的class的寫法
for link in soup.find_all(class_ = "r-ent"):
str.strip()
可以幫你把一個字串前後的空白拿掉。
astring = " 123123 \n123123 "
astring.strip()
[out]:"123123 \n123123"
3.1 Get elements by a specific class#
# Just re-run this
import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent")[:3]:
print(div.prettify())
print("-"*80)
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
8
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1743052562.A.D43.html">
Re: [討論] 有人被迫請客過嗎?
</a>
</div>
<div class="meta">
<div class="author">
Redwing13
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E6%9C%89%E4%BA%BA%E8%A2%AB%E8%BF%AB%E8%AB%8B%E5%AE%A2%E9%81%8E%E5%97%8E%3F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ARedwing13">
搜尋看板內 Redwing13 的文章
</a>
</div>
</div>
</div>
<div class="date">
3/27
</div>
<div class="mark">
</div>
</div>
</div>
--------------------------------------------------------------------------------
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
2
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1743054682.A.9D2.html">
Re: [討論] 有人被迫請客過嗎?
</a>
</div>
<div class="meta">
<div class="author">
CuLiZn5566
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E6%9C%89%E4%BA%BA%E8%A2%AB%E8%BF%AB%E8%AB%8B%E5%AE%A2%E9%81%8E%E5%97%8E%3F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ACuLiZn5566">
搜尋看板內 CuLiZn5566 的文章
</a>
</div>
</div>
</div>
<div class="date">
3/27
</div>
<div class="mark">
</div>
</div>
</div>
--------------------------------------------------------------------------------
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
1
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1743056882.A.2FE.html">
Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?
</a>
</div>
<div class="meta">
<div class="author">
kousun
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E5%A5%87%E6%80%AA%EF%BC%8C%E7%82%BA%E4%BB%80%E9%BA%BC%E4%B8%8D%E5%90%83%E9%A3%AF%E5%89%8D%E5%B0%B1%E5%85%88%E8%AC%9B%E5%A5%BDAA%3F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Akousun">
搜尋看板內 kousun 的文章
</a>
</div>
</div>
</div>
<div class="date">
3/27
</div>
<div class="mark">
</div>
</div>
</div>
--------------------------------------------------------------------------------
3.2 soup.select()
with CSS selector#
for div in soup.select(".r-ent")[:10]:
print(div.get_text(strip = True))
print("-"*80)
8Re: [討論] 有人被迫請客過嗎?Redwing13⋯搜尋同標題文章搜尋看板內 Redwing13 的文章3/27
--------------------------------------------------------------------------------
2Re: [討論] 有人被迫請客過嗎?CuLiZn5566⋯搜尋同標題文章搜尋看板內 CuLiZn5566 的文章3/27
--------------------------------------------------------------------------------
1Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?kousun⋯搜尋同標題文章搜尋看板內 kousun 的文章3/27
--------------------------------------------------------------------------------
2[公告] fantasy03102 文章刪文gogin⋯搜尋同標題文章搜尋看板內 gogin 的文章9/26
--------------------------------------------------------------------------------
5[公告] 板規四侮辱標準2025.01更新eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章10/22M
--------------------------------------------------------------------------------
1[公告] 關於肉搜/公開個資eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章11/25M
--------------------------------------------------------------------------------
15[公告] 置底檢舉暨閒聊區2025.01~eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章1/03M
--------------------------------------------------------------------------------
5[公告] Boy-Girl板規 25.01.13eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章1/10M
--------------------------------------------------------------------------------
for div in soup.find_all(class_ = "r-ent")[:10]:
print(div.get_text(strip = True))
print("-"*80)
8Re: [討論] 有人被迫請客過嗎?Redwing13⋯搜尋同標題文章搜尋看板內 Redwing13 的文章3/27
--------------------------------------------------------------------------------
2Re: [討論] 有人被迫請客過嗎?CuLiZn5566⋯搜尋同標題文章搜尋看板內 CuLiZn5566 的文章3/27
--------------------------------------------------------------------------------
1Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?kousun⋯搜尋同標題文章搜尋看板內 kousun 的文章3/27
--------------------------------------------------------------------------------
2[公告] fantasy03102 文章刪文gogin⋯搜尋同標題文章搜尋看板內 gogin 的文章9/26
--------------------------------------------------------------------------------
5[公告] 板規四侮辱標準2025.01更新eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章10/22M
--------------------------------------------------------------------------------
1[公告] 關於肉搜/公開個資eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章11/25M
--------------------------------------------------------------------------------
15[公告] 置底檢舉暨閒聊區2025.01~eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章1/03M
--------------------------------------------------------------------------------
5[公告] Boy-Girl板規 25.01.13eulbos⋯搜尋同標題文章搜尋看板內 eulbos 的文章1/10M
--------------------------------------------------------------------------------
Get href
of <a>
#
for div in soup.find_all(class_ = "r-ent")[:5]:
print(div.find(class_='nrec').text.strip())
print(div.find(class_='date').text.strip())
print(div.find(class_='author').text.strip())
print(div.find(class_="title").text.strip())
print(div.find(class_='title').a['href'])
print("-"*80)
8
3/27
Redwing13
Re: [討論] 有人被迫請客過嗎?
/bbs/Boy-Girl/M.1743052562.A.D43.html
--------------------------------------------------------------------------------
2
3/27
CuLiZn5566
Re: [討論] 有人被迫請客過嗎?
/bbs/Boy-Girl/M.1743054682.A.9D2.html
--------------------------------------------------------------------------------
1
3/27
kousun
Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?
/bbs/Boy-Girl/M.1743056882.A.2FE.html
--------------------------------------------------------------------------------
2
9/26
gogin
[公告] fantasy03102 文章刪文
/bbs/Boy-Girl/M.1727313746.A.8E6.html
--------------------------------------------------------------------------------
5
10/22
eulbos
[公告] 板規四侮辱標準2025.01更新
/bbs/Boy-Girl/M.1729589160.A.FF1.html
--------------------------------------------------------------------------------
3.3 Using try
and except
to handle exception#
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout = (1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent")[:3]:
print(div.find(class_="title").text.strip())
try:
print(div.find(class_='title').a['href'])
except:
print("The Page was removed")
Re: [討論] 有人被迫請客過嗎?
/bbs/Boy-Girl/M.1743052562.A.D43.html
Re: [討論] 有人被迫請客過嗎?
/bbs/Boy-Girl/M.1743054682.A.9D2.html
Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?
/bbs/Boy-Girl/M.1743056882.A.2FE.html
3.4 Add the prefix url to each url#
pre = 'https://www.ptt.cc'
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent"):
print(div.find(class_="title").text.strip())
try:
print(pre + div.find(class_='title').a['href'])
except:
pass
Re: [討論] 有人被迫請客過嗎?
https://www.ptt.cc/bbs/Boy-Girl/M.1743052562.A.D43.html
Re: [討論] 有人被迫請客過嗎?
https://www.ptt.cc/bbs/Boy-Girl/M.1743054682.A.9D2.html
Re: [討論] 奇怪,為什麼不吃飯前就先講好AA?
https://www.ptt.cc/bbs/Boy-Girl/M.1743056882.A.2FE.html
[公告] fantasy03102 文章刪文
https://www.ptt.cc/bbs/Boy-Girl/M.1727313746.A.8E6.html
[公告] 板規四侮辱標準2025.01更新
https://www.ptt.cc/bbs/Boy-Girl/M.1729589160.A.FF1.html
[公告] 關於肉搜/公開個資
https://www.ptt.cc/bbs/Boy-Girl/M.1732475209.A.AAE.html
[公告] 置底檢舉暨閒聊區2025.01~
https://www.ptt.cc/bbs/Boy-Girl/M.1735917789.A.F1A.html
[公告] Boy-Girl板規 25.01.13
https://www.ptt.cc/bbs/Boy-Girl/M.1736495718.A.C52.html
3.5 Appending scraped urls to a list#
links = []
pre = 'https://www.ptt.cc'
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent"):
try:
links.append(pre + div.find(class_='title').a['href'])
except:
pass
print(len(links))
8
IV. Get more links from more pages#
https://www.ptt.cc/bbs/Boy-Girl/M.1463279135.A.825.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463280694.A.659.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463280951.A.42C.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463281950.A.094.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463281956.A.CE1.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463286737.A.ECC.html
...
4.1 Get the last page#
import re
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
last2nd = soup.find(class_='action-bar').find(class_='btn-group-paging').find_all(class_='btn')[1].get('href')
print(last2nd)
lastpage = int(re.search("index(.+?)\.html", last2nd).group(1)) + 1
print(lastpage)
/bbs/Boy-Girl/index6502.html
6503
4.2 Get the last 10 pages#
pre = 'https://www.ptt.cc'
links = []
lastpage = 5303
for i in range(lastpage, lastpage-5, -1):
url = 'https://www.ptt.cc/bbs/Boy-Girl/index{}.html'.format(i)
response = requests.get(url, timeout=(1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_='r-ent'):
try:
links.append(pre + div.find(class_='title').a['href'])
except:
pass
print(i, "\t", len(links))
5303
20
5302 40
5301 60
5300 80
5299 100
Notes: break, continue, and pass#
break
: 符合條件的時候,就跳出迴圈,別再執行了。continue
: 符合條件的時候,就不要執行這一圈了,接下去執行下一圈pass
: 不做任何事
V. Getting post content for each post link#
Beautifulsoup has two ways to detect elements and its attributes
soup.find()
,soup.find_all()
soup.select()
<– You can input CSS selector by this way
len(links)
100
for link in links[:5]:
print(link)
https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A.E09.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A.EE7.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A.F87.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A.60B.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A.611.html
for link in links[:5]:
res = requests.get(link, timeout=(1,2))
soup = BeautifulSoup(res.text, 'html.parser')
print(type(soup))
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
5.1 Get metadata#
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
print(len(metas))
print(metas[0].text)
print(metas[1].text)
print(metas[2].text)
https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A.E09.html
4
SaChiA5566 (煞氣ㄟ5566)
Boy-Girl
Re: [分享] 在歐兔徵友 遇到的情況
https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A.EE7.html
4
koil (雞兔哥)
Boy-Girl
[心情] 還不要見面的好
https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A.F87.html
4
xakg (夤)
Boy-Girl
Re: R: [分享] 在歐兔徵友 遇到的情況
https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A.60B.html
4
skywendy (天龍溫蒂)
Boy-Girl
Fw: [問卦] 為啥性無能能構成離婚事由
https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A.611.html
4
duck08 (一隻鴨子跳進湖)
Boy-Girl
[分享] 徵友不是只有氧氣
5.2 Assign data to variables#
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A.E09.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A.EE7.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A.F87.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A.60B.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A.611.html
5.3 Get all content#
alist = [1, 2, 3]
print(isinstance(alist, tuple))
: False
from bs4 import NavigableString
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
if len(metas) == 0:
continue
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
content = ''
for text in soup.find(id='main-content'):
if isinstance(text, NavigableString):
content += text.strip()
print(len(content)) # calculating number of words
https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A.E09.html
699
https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A.EE7.html
132
https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A.F87.html
948
https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A.60B.html
277
https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A.611.html
810
5.4 Save them all to a dictionary and append to a list#
from bs4 import NavigableString
all_post = []
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
if len(metas) == 0:
continue
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
content = ''
for text in soup.find(id='main-content'):
if isinstance(text, NavigableString):
content += text.strip()
all_post.append({'author':author,
'link':link,
'title':title,
'timestamp':timestamp,
'content':content})
https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A.E09.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A.EE7.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A.F87.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A.60B.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A.611.html
import pandas as pd
pd.DataFrame(all_post)
author | link | title | timestamp | content | |
---|---|---|---|---|---|
0 | SaChiA5566 (煞氣ㄟ5566) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | Re: [分享] 在歐兔徵友 遇到的情況 | Tue Jul 13 12:54:12 2021 | 光你打的字\n\n其實就會覺得你是一個蠻吸引人的人\n\n其實文章文字都會或多或少透露出這個... |
1 | koil (雞兔哥) | https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A... | [心情] 還不要見面的好 | Tue Jul 13 13:43:05 2021 | 最近玩線上遊戲\n愛上了一個女網友\n每天都會打開看著她的人物發呆\n等她上線私訊她\n漸漸... |
2 | xakg (夤) | https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A... | Re: R: [分享] 在歐兔徵友 遇到的情況 | Tue Jul 13 15:15:51 2021 | 其實我看完原PO的文章,我第一個也是反思自己。\n\n很怕自己也會落入這種狀態。\n「開車、... |
3 | skywendy (天龍溫蒂) | https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A... | Fw: [問卦] 為啥性無能能構成離婚事由 | Tue Jul 13 17:40:25 2021 | 作者: skywendy (天龍溫蒂) 看板: Gossiping\n標題: [問卦] 為啥... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... |
VI. Scraping comments#
from bs4 import NavigableString
all_post = []
for link in links[:5]:
print(link)
resulrests = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
if len(metas) == 0:
continue
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
content = ''
for text in soup.find(id='main-content'):
if isinstance(text, NavigableString):
content += text.strip()
comments = []
for push in soup.find_all(class_="push"):
push_tag = push.find(class_='push-tag').text
push_userid = push.find(class_='push-userid').text
push_content = push.find(class_='push-content').text
push_ipdatetime = push.find(class_='push-ipdatetime').text
#print push_userid, push_content, push_ipdatetime
comments.append({'tag': push_tag,'userid':push_userid,
'content':push_content,
'timestamp':push_ipdatetime})
all_post.append({'author':author,
'link':link,
'title':title,
'timestamp':timestamp,
'content':content,
'comments':comments})
https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A.E09.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A.EE7.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A.F87.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A.60B.html
https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A.611.html
pd.DataFrame(all_post)
author | link | title | timestamp | content | comments | |
---|---|---|---|---|---|---|
0 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
1 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626154987.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
2 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626160555.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
3 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626169226.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
pd.DataFrame(all_post).explode('comments')
author | link | title | timestamp | content | comments | |
---|---|---|---|---|---|---|
0 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '推 ', 'userid': 'death840922', 'conten... |
0 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '→ ', 'userid': 'death840922', 'conten... |
0 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '→ ', 'userid': 'plutox', 'content': '... |
0 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '推 ', 'userid': 'zero955147', 'content... |
0 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626152059.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '→ ', 'userid': 'zero955147', 'content... |
... | ... | ... | ... | ... | ... | ... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '推 ', 'userid': 'valinor', 'content': ... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '→ ', 'userid': 'yyqq999', 'content': ... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '→ ', 'userid': 'yyqq999', 'content': ... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '→ ', 'userid': 'yyqq999', 'content': ... |
4 | duck08 (一隻鴨子跳進湖) | https://www.ptt.cc/bbs/Boy-Girl/M.1626173787.A... | [分享] 徵友不是只有氧氣 | Tue Jul 13 18:56:22 2021 | 看到一連串的氧氣版徵友討論\n許多人的重點放在於[男徵女]遠遠大過於[女徵男]\n認為供需失... | {'tag': '推 ', 'userid': 'j0987 ', 'conte... |
665 rows × 6 columns
VII. Saving data#
import pickle
# with open('ptt1092_testing.pkl', 'wb') as fout: # Python 3: open(..., 'wb')
# pickle.dump(all_post, fout)