P08 HTML Parsing - Crawling PTT#
Using BeautifulSoup Extract Text without Tags http://stackoverflow.com/questions/23380171/using-beautifulsoup-extract-text-without-tags
Scraper: comparing beautifulsoup vs. scrapy
I. Introduction#
1.1 Get url content by requests#
Try to get the following links (click and observe them, get them by
request.get())
import requests
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
print(url)
response = requests.get(url, timeout=(2, 3))
print(response.text[:500])
https://www.ptt.cc/bbs/Boy-Girl/index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>看板 Boy-Girl 文章列表 - 批踢踢實業坊</title>
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//imag
1.2 Using Chrome DevTools to get right elements#
1.3 Parsing url content by beautifulsoup#
Beautifulsoup4 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
II. Get links from a page#
2.1 Getting response#
import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
2.2 Converting webpage to a soup object#
soup = BeautifulSoup(response.text, 'html.parser')
print(type(soup))
print(soup.prettify())
<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
看板 Boy-Girl 文章列表 - 批踢踢實業坊
</title>
<link href="//images.ptt.cc/bbs/v2.27/bbs-common.css" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-custom.css" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen" rel="stylesheet" type="text/css"/>
<link href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print" rel="stylesheet" type="text/css"/>
</head>
<body>
<div id="topbar-container">
<div class="bbs-content" id="topbar">
<a href="/bbs/" id="logo">
批踢踢實業坊
</a>
<span>
›
</span>
<a class="board" href="/bbs/Boy-Girl/index.html">
<span class="board-label">
看板
</span>
Boy-Girl
</a>
<a class="right small" href="/about.html">
關於我們
</a>
<a class="right small" href="/contact.html">
聯絡資訊
</a>
</div>
</div>
<div id="main-container">
<div id="action-bar-container">
<div class="action-bar">
<div class="btn-group btn-group-dir">
<a class="btn selected" href="/bbs/Boy-Girl/index.html">
看板
</a>
<a class="btn" href="/man/Boy-Girl/index.html">
精華區
</a>
</div>
<div class="btn-group btn-group-paging">
<a class="btn wide" href="/bbs/Boy-Girl/index1.html">
最舊
</a>
<a class="btn wide" href="/bbs/Boy-Girl/index6499.html">
‹ 上頁
</a>
<a class="btn wide disabled">
下頁 ›
</a>
<a class="btn wide" href="/bbs/Boy-Girl/index.html">
最新
</a>
</div>
</div>
</div>
<div class="r-list-container action-bar-margin bbs-screen">
<div class="search-bar">
<form action="search" id="search-bar" type="get">
<input class="query" name="q" placeholder="搜尋文章⋯" type="text" value=""/>
</form>
</div>
<div class="r-ent">
<div class="nrec">
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758188186.A.4FB.html">
[分享] 未婚伴侶親密關係成長工作坊
</a>
</div>
<div class="meta">
<div class="author">
sylviechen
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%88%86%E4%BA%AB%5D+%E6%9C%AA%E5%A9%9A%E4%BC%B4%E4%BE%B6%E8%A6%AA%E5%AF%86%E9%97%9C%E4%BF%82%E6%88%90%E9%95%B7%E5%B7%A5%E4%BD%9C%E5%9D%8A">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Asylviechen">
搜尋看板內 sylviechen 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
5
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758192054.A.346.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
ppgod
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Appgod">
搜尋看板內 ppgod 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
14
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758198648.A.F32.html">
[閒聊] 另一半都是從哪裡認識的?
</a>
</div>
<div class="meta">
<div class="author">
sooge
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E9%96%92%E8%81%8A%5D+%E5%8F%A6%E4%B8%80%E5%8D%8A%E9%83%BD%E6%98%AF%E5%BE%9E%E5%93%AA%E8%A3%A1%E8%AA%8D%E8%AD%98%E7%9A%84%EF%BC%9F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Asooge">
搜尋看板內 sooge 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
36
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758204069.A.C93.html">
[求助] 雙方家長第一次見面 該約在哪
</a>
</div>
<div class="meta">
<div class="author">
LaAc
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E6%B1%82%E5%8A%A9%5D+%E9%9B%99%E6%96%B9%E5%AE%B6%E9%95%B7%E7%AC%AC%E4%B8%80%E6%AC%A1%E8%A6%8B%E9%9D%A2+%E8%A9%B2%E7%B4%84%E5%9C%A8%E5%93%AA">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ALaAc">
搜尋看板內 LaAc 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
58
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758248477.A.5AA.html">
[討論] 不想公開、邊界感、磨合時聽而不回應
</a>
</div>
<div class="meta">
<div class="author">
Benson8891
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E4%B8%8D%E6%83%B3%E5%85%AC%E9%96%8B%E3%80%81%E9%82%8A%E7%95%8C%E6%84%9F%E3%80%81%E7%A3%A8%E5%90%88%E6%99%82%E8%81%BD%E8%80%8C%E4%B8%8D%E5%9B%9E%E6%87%89">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ABenson8891">
搜尋看板內 Benson8891 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/19
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
3
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758254313.A.321.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
outuse
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aoutuse">
搜尋看板內 outuse 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/19
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
6
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758267252.A.6EB.html">
[閒聊]當渣男的感覺好爽喔
</a>
</div>
<div class="meta">
<div class="author">
bmwg8
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E9%96%92%E8%81%8A%5D%E7%95%B6%E6%B8%A3%E7%94%B7%E7%9A%84%E6%84%9F%E8%A6%BA%E5%A5%BD%E7%88%BD%E5%96%94">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Abmwg8">
搜尋看板內 bmwg8 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/19
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
10
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758268423.A.106.html">
[閒聊] 請勇敢當個普信男
</a>
</div>
<div class="meta">
<div class="author">
bmwg8
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E9%96%92%E8%81%8A%5D+%E8%AB%8B%E5%8B%87%E6%95%A2%E7%95%B6%E5%80%8B%E6%99%AE%E4%BF%A1%E7%94%B7">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Abmwg8">
搜尋看板內 bmwg8 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/19
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
2
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758276398.A.22E.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
princessws
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aprincessws">
搜尋看板內 princessws 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/19
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
2
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758277884.A.D77.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
kobe8bryant
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Akobe8bryant">
搜尋看板內 kobe8bryant 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/19
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
3
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758340941.A.F0E.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
CuLiZn5566
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ACuLiZn5566">
搜尋看板內 CuLiZn5566 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/20
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
5
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758344055.A.A6E.html">
Re: [閒聊] 請勇敢當個普信男
</a>
</div>
<div class="meta">
<div class="author">
CuLiZn5566
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E9%96%92%E8%81%8A%5D+%E8%AB%8B%E5%8B%87%E6%95%A2%E7%95%B6%E5%80%8B%E6%99%AE%E4%BF%A1%E7%94%B7">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ACuLiZn5566">
搜尋看板內 CuLiZn5566 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/20
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
1
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758357165.A.3B5.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
Subaru5566
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ASubaru5566">
搜尋看板內 Subaru5566 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/20
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
10
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758380453.A.393.html">
[閒聊] GetMarry板主連署宣傳
</a>
</div>
<div class="meta">
<div class="author">
NomeL
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E9%96%92%E8%81%8A%5D+GetMarry%E6%9D%BF%E4%B8%BB%E9%80%A3%E7%BD%B2%E5%AE%A3%E5%82%B3">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3ANomeL">
搜尋看板內 NomeL 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/20
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
1
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758421309.A.31A.html">
Re: [求助] 雙方家長第一次見面 該約在哪
</a>
</div>
<div class="meta">
<div class="author">
b122771
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E6%B1%82%E5%8A%A9%5D+%E9%9B%99%E6%96%B9%E5%AE%B6%E9%95%B7%E7%AC%AC%E4%B8%80%E6%AC%A1%E8%A6%8B%E9%9D%A2+%E8%A9%B2%E7%B4%84%E5%9C%A8%E5%93%AA">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Ab122771">
搜尋看板內 b122771 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/21
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
21
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758422017.A.F67.html">
同居時房子裝潢的觀念?
</a>
</div>
<div class="meta">
<div class="author">
ginholan
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%E5%90%8C%E5%B1%85%E6%99%82%E6%88%BF%E5%AD%90%E8%A3%9D%E6%BD%A2%E7%9A%84%E8%A7%80%E5%BF%B5%EF%BC%9F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aginholan">
搜尋看板內 ginholan 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/21
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758453856.A.EF0.html">
Re: 同居時房子裝潢的觀念?
</a>
</div>
<div class="meta">
<div class="author">
jamo
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%E5%90%8C%E5%B1%85%E6%99%82%E6%88%BF%E5%AD%90%E8%A3%9D%E6%BD%A2%E7%9A%84%E8%A7%80%E5%BF%B5%EF%BC%9F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Ajamo">
搜尋看板內 jamo 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/21
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758455447.A.B9F.html">
Re: [討論] 不想公開、邊界感、磨合時聽而不回應
</a>
</div>
<div class="meta">
<div class="author">
FlyinDance56
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E8%A8%8E%E8%AB%96%5D+%E4%B8%8D%E6%83%B3%E5%85%AC%E9%96%8B%E3%80%81%E9%82%8A%E7%95%8C%E6%84%9F%E3%80%81%E7%A3%A8%E5%90%88%E6%99%82%E8%81%BD%E8%80%8C%E4%B8%8D%E5%9B%9E%E6%87%89">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3AFlyinDance56">
搜尋看板內 FlyinDance56 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/21
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
1
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758461250.A.5D6.html">
[求助] 女生好朋友跟仇人在一起,心情很複雜
</a>
</div>
<div class="meta">
<div class="author">
wanderer80
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E6%B1%82%E5%8A%A9%5D+%E5%A5%B3%E7%94%9F%E5%A5%BD%E6%9C%8B%E5%8F%8B%E8%B7%9F%E4%BB%87%E4%BA%BA%E5%9C%A8%E4%B8%80%E8%B5%B7%EF%BC%8C%E5%BF%83%E6%83%85%E5%BE%88%E8%A4%87%E9%9B%9C">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Awanderer80">
搜尋看板內 wanderer80 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/21
</div>
<div class="mark">
</div>
</div>
</div>
<div class="r-list-sep">
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
7
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1729589160.A.FF1.html">
[公告] 板規四侮辱標準2025.01更新
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E6%9D%BF%E8%A6%8F%E5%9B%9B%E4%BE%AE%E8%BE%B1%E6%A8%99%E6%BA%962025.01%E6%9B%B4%E6%96%B0">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
10/22
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
2
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1732475209.A.AAE.html">
[公告] 關於肉搜/公開個資
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E9%97%9C%E6%96%BC%E8%82%89%E6%90%9C%2F%E5%85%AC%E9%96%8B%E5%80%8B%E8%B3%87">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
11/25
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
47
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1735917789.A.F1A.html">
[公告] 置底檢舉暨閒聊區2025.01~
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E7%BD%AE%E5%BA%95%E6%AA%A2%E8%88%89%E6%9A%A8%E9%96%92%E8%81%8A%E5%8D%802025.01~">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
1/03
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
5
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1736495718.A.C52.html">
[公告] Boy-Girl板規 25.01.13
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+Boy-Girl%E6%9D%BF%E8%A6%8F+25.01.13">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
1/10
</div>
<div class="mark">
M
</div>
</div>
</div>
<div class="r-ent">
<div class="nrec">
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1757140752.A.AFC.html">
[公告] 9月水桶公告
</a>
</div>
<div class="meta">
<div class="author">
eulbos
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+9%E6%9C%88%E6%B0%B4%E6%A1%B6%E5%85%AC%E5%91%8A">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Aeulbos">
搜尋看板內 eulbos 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/06
</div>
<div class="mark">
M
</div>
</div>
</div>
</div>
</div>
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-DZ6Y3BY9GW">
</script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-DZ6Y3BY9GW');
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-32365737-1', {
cookieDomain: 'ptt.cc',
legacyCookieDomain: 'ptt.cc'
});
ga('send', 'pageview');
</script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js">
</script>
<script src="//images.ptt.cc/bbs/v2.27/bbs.js">
</script>
<script>
(function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'982a19e6e834cb82',t:'MTc1ODQ2MzM1NQ=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();
</script>
</body>
</html>
2.3 Traversing an html file by soup#
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
<title>看板 Boy-Girl 文章列表 - 批踢踢實業坊</title>
title
看板 Boy-Girl 文章列表 - 批踢踢實業坊
head
2.4 Finding all links in the doc#
print(type(soup.find_all("a")))
print(len(soup.find_all("a")))
soup.find_all("a")[:5]
<class 'bs4.element.ResultSet'>
82
[<a href="/bbs/" id="logo">批踢踢實業坊</a>,
<a class="board" href="/bbs/Boy-Girl/index.html"><span class="board-label">看板 </span>Boy-Girl</a>,
<a class="right small" href="/about.html">關於我們</a>,
<a class="right small" href="/contact.html">聯絡資訊</a>,
<a class="btn selected" href="/bbs/Boy-Girl/index.html">看板</a>]
2.5 Extracting element content and attribute#
nodes_a = soup.find_all('a')
print(nodes_a[0])
print(nodes_a[0].text)
print(nodes_a[0].get('href'))
<a href="/bbs/" id="logo">批踢踢實業坊</a>
批踢踢實業坊
/bbs/
for link in soup.find_all('a')[:5]:
print(link.get('href'))
/bbs/
/bbs/Boy-Girl/index.html
/about.html
/contact.html
/bbs/Boy-Girl/index.html
2.6 append() links into a list#
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
print(len(links))
82
III. Get article links from the first page#
注意這行code的class的寫法
for link in soup.find_all(class_ = "r-ent"):str.strip()可以幫你把一個字串前後的空白拿掉。
astring = " 123123 \n123123 "
astring.strip()
[out]:"123123 \n123123"
3.1 Get elements by a specific class#
# Just re-run this
import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent")[:3]:
print(div.prettify())
print("-"*80)
<div class="r-ent">
<div class="nrec">
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758188186.A.4FB.html">
[分享] 未婚伴侶親密關係成長工作坊
</a>
</div>
<div class="meta">
<div class="author">
sylviechen
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%88%86%E4%BA%AB%5D+%E6%9C%AA%E5%A9%9A%E4%BC%B4%E4%BE%B6%E8%A6%AA%E5%AF%86%E9%97%9C%E4%BF%82%E6%88%90%E9%95%B7%E5%B7%A5%E4%BD%9C%E5%9D%8A">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Asylviechen">
搜尋看板內 sylviechen 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
--------------------------------------------------------------------------------
<div class="r-ent">
<div class="nrec">
<span class="hl f2">
5
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758192054.A.346.html">
Re: [心情] 交友軟體好難用
</a>
</div>
<div class="meta">
<div class="author">
ppgod
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E5%BF%83%E6%83%85%5D+%E4%BA%A4%E5%8F%8B%E8%BB%9F%E9%AB%94%E5%A5%BD%E9%9B%A3%E7%94%A8">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Appgod">
搜尋看板內 ppgod 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
--------------------------------------------------------------------------------
<div class="r-ent">
<div class="nrec">
<span class="hl f3">
14
</span>
</div>
<div class="title">
<a href="/bbs/Boy-Girl/M.1758198648.A.F32.html">
[閒聊] 另一半都是從哪裡認識的?
</a>
</div>
<div class="meta">
<div class="author">
sooge
</div>
<div class="article-menu">
<div class="trigger">
⋯
</div>
<div class="dropdown">
<div class="item">
<a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E9%96%92%E8%81%8A%5D+%E5%8F%A6%E4%B8%80%E5%8D%8A%E9%83%BD%E6%98%AF%E5%BE%9E%E5%93%AA%E8%A3%A1%E8%AA%8D%E8%AD%98%E7%9A%84%EF%BC%9F">
搜尋同標題文章
</a>
</div>
<div class="item">
<a href="/bbs/Boy-Girl/search?q=author%3Asooge">
搜尋看板內 sooge 的文章
</a>
</div>
</div>
</div>
<div class="date">
9/18
</div>
<div class="mark">
</div>
</div>
</div>
--------------------------------------------------------------------------------
3.2 soup.select() with CSS selector#
for div in soup.select(".r-ent")[:10]:
print(div.get_text(strip = True))
print("-"*80)
[分享] 未婚伴侶親密關係成長工作坊sylviechen⋯搜尋同標題文章搜尋看板內 sylviechen 的文章9/18
--------------------------------------------------------------------------------
5Re: [心情] 交友軟體好難用ppgod⋯搜尋同標題文章搜尋看板內 ppgod 的文章9/18
--------------------------------------------------------------------------------
14[閒聊] 另一半都是從哪裡認識的?sooge⋯搜尋同標題文章搜尋看板內 sooge 的文章9/18
--------------------------------------------------------------------------------
36[求助] 雙方家長第一次見面 該約在哪LaAc⋯搜尋同標題文章搜尋看板內 LaAc 的文章9/18
--------------------------------------------------------------------------------
58[討論] 不想公開、邊界感、磨合時聽而不回應Benson8891⋯搜尋同標題文章搜尋看板內 Benson8891 的文章9/19
--------------------------------------------------------------------------------
3Re: [心情] 交友軟體好難用outuse⋯搜尋同標題文章搜尋看板內 outuse 的文章9/19
--------------------------------------------------------------------------------
6[閒聊]當渣男的感覺好爽喔bmwg8⋯搜尋同標題文章搜尋看板內 bmwg8 的文章9/19
--------------------------------------------------------------------------------
10[閒聊] 請勇敢當個普信男bmwg8⋯搜尋同標題文章搜尋看板內 bmwg8 的文章9/19
--------------------------------------------------------------------------------
2Re: [心情] 交友軟體好難用princessws⋯搜尋同標題文章搜尋看板內 princessws 的文章9/19
--------------------------------------------------------------------------------
2Re: [心情] 交友軟體好難用kobe8bryant⋯搜尋同標題文章搜尋看板內 kobe8bryant 的文章9/19
--------------------------------------------------------------------------------
for div in soup.find_all(class_ = "r-ent")[:10]:
print(div.get_text(strip = True))
print("-"*80)
[分享] 未婚伴侶親密關係成長工作坊sylviechen⋯搜尋同標題文章搜尋看板內 sylviechen 的文章9/18
--------------------------------------------------------------------------------
5Re: [心情] 交友軟體好難用ppgod⋯搜尋同標題文章搜尋看板內 ppgod 的文章9/18
--------------------------------------------------------------------------------
14[閒聊] 另一半都是從哪裡認識的?sooge⋯搜尋同標題文章搜尋看板內 sooge 的文章9/18
--------------------------------------------------------------------------------
36[求助] 雙方家長第一次見面 該約在哪LaAc⋯搜尋同標題文章搜尋看板內 LaAc 的文章9/18
--------------------------------------------------------------------------------
58[討論] 不想公開、邊界感、磨合時聽而不回應Benson8891⋯搜尋同標題文章搜尋看板內 Benson8891 的文章9/19
--------------------------------------------------------------------------------
3Re: [心情] 交友軟體好難用outuse⋯搜尋同標題文章搜尋看板內 outuse 的文章9/19
--------------------------------------------------------------------------------
6[閒聊]當渣男的感覺好爽喔bmwg8⋯搜尋同標題文章搜尋看板內 bmwg8 的文章9/19
--------------------------------------------------------------------------------
10[閒聊] 請勇敢當個普信男bmwg8⋯搜尋同標題文章搜尋看板內 bmwg8 的文章9/19
--------------------------------------------------------------------------------
2Re: [心情] 交友軟體好難用princessws⋯搜尋同標題文章搜尋看板內 princessws 的文章9/19
--------------------------------------------------------------------------------
2Re: [心情] 交友軟體好難用kobe8bryant⋯搜尋同標題文章搜尋看板內 kobe8bryant 的文章9/19
--------------------------------------------------------------------------------
Get href of <a>#
for div in soup.find_all(class_ = "r-ent")[:5]:
print(div.find(class_='nrec').text.strip())
print(div.find(class_='date').text.strip())
print(div.find(class_='author').text.strip())
print(div.find(class_="title").text.strip())
print(div.find(class_='title').a['href'])
print("-"*80)
9/18
sylviechen
[分享] 未婚伴侶親密關係成長工作坊
/bbs/Boy-Girl/M.1758188186.A.4FB.html
--------------------------------------------------------------------------------
5
9/18
ppgod
Re: [心情] 交友軟體好難用
/bbs/Boy-Girl/M.1758192054.A.346.html
--------------------------------------------------------------------------------
14
9/18
sooge
[閒聊] 另一半都是從哪裡認識的?
/bbs/Boy-Girl/M.1758198648.A.F32.html
--------------------------------------------------------------------------------
36
9/18
LaAc
[求助] 雙方家長第一次見面 該約在哪
/bbs/Boy-Girl/M.1758204069.A.C93.html
--------------------------------------------------------------------------------
58
9/19
Benson8891
[討論] 不想公開、邊界感、磨合時聽而不回應
/bbs/Boy-Girl/M.1758248477.A.5AA.html
--------------------------------------------------------------------------------
3.3 Using try and except to handle exception#
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout = (1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent")[:3]:
print(div.find(class_="title").text.strip())
try:
print(div.find(class_='title').a['href'])
except:
print("The Page was removed")
[分享] 未婚伴侶親密關係成長工作坊
/bbs/Boy-Girl/M.1758188186.A.4FB.html
Re: [心情] 交友軟體好難用
/bbs/Boy-Girl/M.1758192054.A.346.html
[閒聊] 另一半都是從哪裡認識的?
/bbs/Boy-Girl/M.1758198648.A.F32.html
3.4 Add the prefix url to each url#
pre = 'https://www.ptt.cc'
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent"):
print(div.find(class_="title").text.strip())
try:
print(pre + div.find(class_='title').a['href'])
except:
pass
[分享] 未婚伴侶親密關係成長工作坊
https://www.ptt.cc/bbs/Boy-Girl/M.1758188186.A.4FB.html
Re: [心情] 交友軟體好難用
https://www.ptt.cc/bbs/Boy-Girl/M.1758192054.A.346.html
[閒聊] 另一半都是從哪裡認識的?
https://www.ptt.cc/bbs/Boy-Girl/M.1758198648.A.F32.html
[求助] 雙方家長第一次見面 該約在哪
https://www.ptt.cc/bbs/Boy-Girl/M.1758204069.A.C93.html
[討論] 不想公開、邊界感、磨合時聽而不回應
https://www.ptt.cc/bbs/Boy-Girl/M.1758248477.A.5AA.html
Re: [心情] 交友軟體好難用
https://www.ptt.cc/bbs/Boy-Girl/M.1758254313.A.321.html
[閒聊]當渣男的感覺好爽喔
https://www.ptt.cc/bbs/Boy-Girl/M.1758267252.A.6EB.html
[閒聊] 請勇敢當個普信男
https://www.ptt.cc/bbs/Boy-Girl/M.1758268423.A.106.html
Re: [心情] 交友軟體好難用
https://www.ptt.cc/bbs/Boy-Girl/M.1758276398.A.22E.html
Re: [心情] 交友軟體好難用
https://www.ptt.cc/bbs/Boy-Girl/M.1758277884.A.D77.html
Re: [心情] 交友軟體好難用
https://www.ptt.cc/bbs/Boy-Girl/M.1758340941.A.F0E.html
Re: [閒聊] 請勇敢當個普信男
https://www.ptt.cc/bbs/Boy-Girl/M.1758344055.A.A6E.html
Re: [心情] 交友軟體好難用
https://www.ptt.cc/bbs/Boy-Girl/M.1758357165.A.3B5.html
[閒聊] GetMarry板主連署宣傳
https://www.ptt.cc/bbs/Boy-Girl/M.1758380453.A.393.html
Re: [求助] 雙方家長第一次見面 該約在哪
https://www.ptt.cc/bbs/Boy-Girl/M.1758421309.A.31A.html
同居時房子裝潢的觀念?
https://www.ptt.cc/bbs/Boy-Girl/M.1758422017.A.F67.html
Re: 同居時房子裝潢的觀念?
https://www.ptt.cc/bbs/Boy-Girl/M.1758453856.A.EF0.html
Re: [討論] 不想公開、邊界感、磨合時聽而不回應
https://www.ptt.cc/bbs/Boy-Girl/M.1758455447.A.B9F.html
[求助] 女生好朋友跟仇人在一起,心情很複雜
https://www.ptt.cc/bbs/Boy-Girl/M.1758461250.A.5D6.html
[公告] 板規四侮辱標準2025.01更新
https://www.ptt.cc/bbs/Boy-Girl/M.1729589160.A.FF1.html
[公告] 關於肉搜/公開個資
https://www.ptt.cc/bbs/Boy-Girl/M.1732475209.A.AAE.html
[公告] 置底檢舉暨閒聊區2025.01~
https://www.ptt.cc/bbs/Boy-Girl/M.1735917789.A.F1A.html
[公告] Boy-Girl板規 25.01.13
https://www.ptt.cc/bbs/Boy-Girl/M.1736495718.A.C52.html
[公告] 9月水桶公告
https://www.ptt.cc/bbs/Boy-Girl/M.1757140752.A.AFC.html
3.5 Appending scraped urls to a list#
links = []
pre = 'https://www.ptt.cc'
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent"):
try:
links.append(pre + div.find(class_='title').a['href'])
except:
pass
print(len(links))
24
IV. Get more links from more pages#
https://www.ptt.cc/bbs/Boy-Girl/M.1463279135.A.825.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463280694.A.659.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463280951.A.42C.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463281950.A.094.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463281956.A.CE1.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463286737.A.ECC.html
...
4.1 Get the last page#
import re
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
last2nd = soup.find(class_='action-bar').find(class_='btn-group-paging').find_all(class_='btn')[1].get('href')
print(last2nd)
lastpage = int(re.search("index(.+?)\.html", last2nd).group(1)) + 1
print(lastpage)
/bbs/Boy-Girl/index6499.html
6500
4.2 Get the last 10 pages#
pre = 'https://www.ptt.cc'
links = []
lastpage = 5303
for i in range(lastpage, lastpage-5, -1):
url = 'https://www.ptt.cc/bbs/Boy-Girl/index{}.html'.format(i)
response = requests.get(url, timeout=(1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_='r-ent'):
try:
links.append(pre + div.find(class_='title').a['href'])
except:
pass
print(i, "\t", len(links))
5303 20
5302 40
5301 60
5300 80
5299 100
Notes: break, continue, and pass#
break: 符合條件的時候,就跳出迴圈,別再執行了。continue: 符合條件的時候,就不要執行這一圈了,接下去執行下一圈pass: 不做任何事
V. Getting post content for each post link#
Beautifulsoup has two ways to detect elements and its attributes
soup.find(),soup.find_all()soup.select()<– You can input CSS selector by this way
len(links)
100
for link in links[:5]:
print(link)
https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A.ED2.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A.B3F.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A.641.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A.A13.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A.0E0.html
for link in links[:5]:
res = requests.get(link, timeout=(1,2))
soup = BeautifulSoup(res.text, 'html.parser')
print(type(soup))
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
5.1 Get metadata#
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
print(len(metas))
print(metas[0].text)
print(metas[1].text)
print(metas[2].text)
https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A.ED2.html
4
CuLiZn5566 (同理心5566)
Boy-Girl
Re: [討論] 台灣現在還有重男輕女的觀念嗎?
https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A.B3F.html
4
tobeacat (香嚕噜)
Boy-Girl
Re: [討論] 台灣現在還有重男輕女的觀念嗎?
https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A.641.html
4
moshenisshit (嘻嘻)
Boy-Girl
Re: [討論] 台灣現在還有重男輕女的觀念嗎?
https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A.A13.html
4
BoBoLung (泡泡龍)
Boy-Girl
Re: [討論] 台灣現在還有重男輕女的觀念嗎?
https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A.0E0.html
4
upon515 (沒有你的天氣怎麼能好)
Boy-Girl
Re: [討論] 台灣現在還有重男輕女的觀念嗎?
5.2 Assign data to variables#
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A.ED2.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A.B3F.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A.641.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A.A13.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A.0E0.html
5.3 Get all content#
alist = [1, 2, 3]
print(isinstance(alist, tuple))
: False
from bs4 import NavigableString
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
if len(metas) == 0:
continue
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
content = ''
for text in soup.find(id='main-content'):
if isinstance(text, NavigableString):
content += text.strip()
print(len(content)) # calculating number of words
https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A.ED2.html
216
https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A.B3F.html
1603
https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A.641.html
213
https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A.A13.html
1000
https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A.0E0.html
2327
5.4 Save them all to a dictionary and append to a list#
from bs4 import NavigableString
all_post = []
for link in links[:5]:
print(link)
res = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
if len(metas) == 0:
continue
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
content = ''
for text in soup.find(id='main-content'):
if isinstance(text, NavigableString):
content += text.strip()
all_post.append({'author':author,
'link':link,
'title':title,
'timestamp':timestamp,
'content':content})
https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A.ED2.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A.B3F.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A.641.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A.A13.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A.0E0.html
import pandas as pd
pd.DataFrame(all_post)
/Users/jirlong/opt/anaconda3/lib/python3.9/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.8.1' currently installed).
from pandas.core.computation.check import NUMEXPR_INSTALLED
/Users/jirlong/opt/anaconda3/lib/python3.9/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.4' currently installed).
from pandas.core import (
| author | link | title | timestamp | content | |
|---|---|---|---|---|---|
| 0 | CuLiZn5566 (同理心5566) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 14:46:20 2021 | 啊這種事情兩個人說好就好\n\n有那麼麻煩嗎???\n\n\n到底有哪對情侶夫妻真的每天在那... |
| 1 | tobeacat (香嚕噜) | https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 18:34:56 2021 | 我家是傳產,早年賺很多->被中國便宜貨打趴,生意慘淡—>消費者回頭買品質好的台貨\n—>新冠... |
| 2 | moshenisshit (嘻嘻) | https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 20:19:32 2021 | 說到這個分遺產喔,\n\n我來講講我親戚遇到美味自助餐的饕客的故事,\n\n我親戚男生,家裡... |
| 3 | BoBoLung (泡泡龍) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:08:41 2021 | 就說台男真的是很可憐QQ\n\n說貓大死要錢的人是誰?\n\n是他媽?男的女的,女的\n\n... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... |
VI. Scraping comments#
from bs4 import NavigableString
all_post = []
for link in links[:5]:
print(link)
resulrests = requests.get(link, timeout=(1, 2))
soup = BeautifulSoup(res.text, "html.parser")
metas = soup.find_all(class_='article-meta-value')
if len(metas) == 0:
continue
author = metas[0].text
title = metas[-2].text
timestamp = metas[-1].text
content = ''
for text in soup.find(id='main-content'):
if isinstance(text, NavigableString):
content += text.strip()
comments = []
for push in soup.find_all(class_="push"):
push_tag = push.find(class_='push-tag').text
push_userid = push.find(class_='push-userid').text
push_content = push.find(class_='push-content').text
push_ipdatetime = push.find(class_='push-ipdatetime').text
#print push_userid, push_content, push_ipdatetime
comments.append({'tag': push_tag,'userid':push_userid,
'content':push_content,
'timestamp':push_ipdatetime})
all_post.append({'author':author,
'link':link,
'title':title,
'timestamp':timestamp,
'content':content,
'comments':comments})
https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A.ED2.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A.B3F.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A.641.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A.A13.html
https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A.0E0.html
pd.DataFrame(all_post)
| author | link | title | timestamp | content | comments | |
|---|---|---|---|---|---|---|
| 0 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
| 1 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631097298.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
| 2 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631103574.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
| 3 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106523.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | [{'tag': '推 ', 'userid': 'death840922', 'conte... |
pd.DataFrame(all_post).explode('comments')
| author | link | title | timestamp | content | comments | |
|---|---|---|---|---|---|---|
| 0 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '推 ', 'userid': 'death840922', 'conten... |
| 0 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '噓 ', 'userid': 'newtypeL9', 'content'... |
| 0 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '噓 ', 'userid': 'wingthink', 'content'... |
| 0 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '推 ', 'userid': 'marktak', 'content': ... |
| 0 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631083582.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '推 ', 'userid': 'seaping', 'content': ... |
| ... | ... | ... | ... | ... | ... | ... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '→ ', 'userid': 'seaping', 'content': ... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '→ ', 'userid': 'seaping', 'content': ... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '→ ', 'userid': 'seaping', 'content': ... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '噓 ', 'userid': 'IMISSA', 'content': '... |
| 4 | upon515 (沒有你的天氣怎麼能好) | https://www.ptt.cc/bbs/Boy-Girl/M.1631106987.A... | Re: [討論] 台灣現在還有重男輕女的觀念嗎? | Wed Sep 8 21:16:21 2021 | 個人不喜歡做發文佔版面這種事,但以推文來說實在太長,\n\n手眼腦皆殘如我,對長推文感到苦手... | {'tag': '噓 ', 'userid': 'osmanthusjo', 'conten... |
65 rows × 6 columns
VII. Saving data#
import pickle
# with open('ptt1092_testing.pkl', 'wb') as fout: # Python 3: open(..., 'wb')
# pickle.dump(all_post, fout)