# P08 HTML Parsing - Crawling PTT

* Using BeautifulSoup Extract Text without Tags http://stackoverflow.com/questions/23380171/using-beautifulsoup-extract-text-without-tags
* Scraper: [comparing beautifulsoup vs. scrapy](https://stackoverflow.com/questions/19687421/difference-between-beautifulsoup-and-scrapy-crawler)
* [comparing beautifulsoup vs. scrapy 2](https://blog.michaelyin.info/scrapy-tutorial-1-scrapy-vs-beautiful-soup/)

## I. Introduction

### 1.1 Get url content by requests
* Try to get the following links (click and observe them, get them by `request.get()`)
    * https://www.ptt.cc/bbs/Boy-Girl/index.html
    * https://www.ptt.cc/bbs/Gossiping/index.html
    * http://ecshweb.pchome.com.tw/search/v3.3/?q=iphone

In [4]:
import requests
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
print(url)

response = requests.get(url, timeout=(2, 3))
print(response.text[:500])

https://www.ptt.cc/bbs/Boy-Girl/index.html
<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 Boy-Girl 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//imag


### 1.2 Using Chrome DevTools to get right elements

* [Slide: Inspecting HTML by Chrome DevTools](https://docs.google.com/presentation/d/e/2PACX-1vSrIfJQJpr_24wwIjMaTMKiq_xrhZ5n-J26G7xbXC1HIMMKWfMsm6zFWOsX8NxNEN_S46z9PnsASj32/pub?start=false&loop=false&delayms=3000)

### 1.3 Parsing url content by beautifulsoup

* Beautifulsoup4 https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## II. Get links from a page

### 2.1 Getting response

In [2]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))

### 2.2 Converting webpage to a soup object

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')
print(type(soup))
print(soup.prettify())

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   看板 Boy-Girl 文章列表 - 批踢踢實業坊
  </title>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-common.css" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-custom.css" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen" rel="stylesheet" type="text/css"/>
  <link href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print" rel="stylesheet" type="text/css"/>
 </head>
 <body>
  <div id="topbar-container">
   <div class="bbs-content" id="topbar">
    <a href="/bbs/" id="logo">
     批踢踢實業坊
    </a>
    <span>
     ›
    </span>
    <a class="board" href="/bbs/Boy-Girl/index.html">
     <span class="board-label">
      看板
     </span>
     Boy-Girl
    <

### 2.3 Traversing an html file by soup

In [4]:
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)

<title>看板 Boy-Girl 文章列表 - 批踢踢實業坊</title>
title
看板 Boy-Girl 文章列表 - 批踢踢實業坊
head


### 2.4 Finding all links in the doc

In [5]:
print(type(soup.find_all("a")))
print(len(soup.find_all("a")))
soup.find_all("a")[:5]

<class 'bs4.element.ResultSet'>
40


[<a href="/bbs/" id="logo">批踢踢實業坊</a>,
 <a class="board" href="/bbs/Boy-Girl/index.html"><span class="board-label">看板 </span>Boy-Girl</a>,
 <a class="right small" href="/about.html">關於我們</a>,
 <a class="right small" href="/contact.html">聯絡資訊</a>,
 <a class="btn selected" href="/bbs/Boy-Girl/index.html">看板</a>]

### 2.5 Extracting element content and attribute

In [6]:
nodes_a = soup.find_all('a')
print(nodes_a[0])
print(nodes_a[0].text)
print(nodes_a[0].get('href'))

<a href="/bbs/" id="logo">批踢踢實業坊</a>
批踢踢實業坊
/bbs/


In [7]:
for link in soup.find_all('a')[:5]:
    print(link.get('href'))

/bbs/
/bbs/Boy-Girl/index.html
/about.html
/contact.html
/bbs/Boy-Girl/index.html


### 2.6 `append()` links into a list

In [8]:
links = []
for link in soup.find_all('a'):
    links.append(link.get('href'))
print(len(links))

40


## III. Get article links from the first page
* 注意這行code的class的寫法`for link in soup.find_all(class_ = "r-ent"):`
* `str.strip()`可以幫你把一個字串前後的空白拿掉。
```
astring = "     123123   \n123123     "
astring.strip()
[out]:"123123   \n123123"
```



### 3.1 Get elements by a specific class

In [9]:
# Just re-run this
import requests
from bs4 import BeautifulSoup

url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')

In [10]:
for div in soup.find_all(class_ = "r-ent")[:3]:
    print(div.prettify())
    print("-"*80)

<div class="r-ent">
 <div class="nrec">
  <span class="hl f3">
   28
  </span>
 </div>
 <div class="title">
  <a href="/bbs/Boy-Girl/M.1636276845.A.58A.html">
   [求助] 因無法陪伴而被劈腿是否該放手
  </a>
 </div>
 <div class="meta">
  <div class="author">
   EEBGL
  </div>
  <div class="article-menu">
   <div class="trigger">
    ⋯
   </div>
   <div class="dropdown">
    <div class="item">
     <a href="/bbs/Boy-Girl/search?q=thread%3A%5B%E6%B1%82%E5%8A%A9%5D+%E5%9B%A0%E7%84%A1%E6%B3%95%E9%99%AA%E4%BC%B4%E8%80%8C%E8%A2%AB%E5%8A%88%E8%85%BF%E6%98%AF%E5%90%A6%E8%A9%B2%E6%94%BE%E6%89%8B">
      搜尋同標題文章
     </a>
    </div>
    <div class="item">
     <a href="/bbs/Boy-Girl/search?q=author%3AEEBGL">
      搜尋看板內 EEBGL 的文章
     </a>
    </div>
   </div>
  </div>
  <div class="date">
   11/07
  </div>
  <div class="mark">
  </div>
 </div>
</div>

--------------------------------------------------------------------------------
<div class="r-ent">
 <div class="nrec">
  <span class="hl f2">
   1
  </span>
 </di

#### 3.2 `soup.select()` with CSS selector

In [11]:
for div in soup.select(".r-ent")[:10]:
    print(div.get_text(strip = True))
    print("-"*80)

28[求助] 因無法陪伴而被劈腿是否該放手EEBGL⋯搜尋同標題文章搜尋看板內 EEBGL 的文章11/07
--------------------------------------------------------------------------------
1Re: [求助] 曖昧到最後無疾而終cloud72426⋯搜尋同標題文章搜尋看板內 cloud72426 的文章11/07
--------------------------------------------------------------------------------
(本文已被刪除) [ss61512tw]-11/07
--------------------------------------------------------------------------------
14[討論] 你們是怎麼度過尬聊期的A22813079⋯搜尋同標題文章搜尋看板內 A22813079 的文章11/07
--------------------------------------------------------------------------------
[討論] 板上的人是不是都太年輕啊？pttkobe⋯搜尋同標題文章搜尋看板內 pttkobe 的文章11/07
--------------------------------------------------------------------------------
[討論] 如果用學術的角度切入kittor⋯搜尋同標題文章搜尋看板內 kittor 的文章11/07
--------------------------------------------------------------------------------
6[公告] 關於新制板規說明（必讀）snda⋯搜尋同標題文章搜尋看板內 snda 的文章5/11!
--------------------------------------------------------------------------------
[公告] 檢舉格式教學ChenDao⋯搜尋同標題文章搜尋看板內 ChenDao 的文章4/08!
----------------------

In [12]:
for div in soup.find_all(class_ = "r-ent")[:10]:
    print(div.get_text(strip = True))
    print("-"*80)

28[求助] 因無法陪伴而被劈腿是否該放手EEBGL⋯搜尋同標題文章搜尋看板內 EEBGL 的文章11/07
--------------------------------------------------------------------------------
1Re: [求助] 曖昧到最後無疾而終cloud72426⋯搜尋同標題文章搜尋看板內 cloud72426 的文章11/07
--------------------------------------------------------------------------------
(本文已被刪除) [ss61512tw]-11/07
--------------------------------------------------------------------------------
14[討論] 你們是怎麼度過尬聊期的A22813079⋯搜尋同標題文章搜尋看板內 A22813079 的文章11/07
--------------------------------------------------------------------------------
[討論] 板上的人是不是都太年輕啊？pttkobe⋯搜尋同標題文章搜尋看板內 pttkobe 的文章11/07
--------------------------------------------------------------------------------
[討論] 如果用學術的角度切入kittor⋯搜尋同標題文章搜尋看板內 kittor 的文章11/07
--------------------------------------------------------------------------------
6[公告] 關於新制板規說明（必讀）snda⋯搜尋同標題文章搜尋看板內 snda 的文章5/11!
--------------------------------------------------------------------------------
[公告] 檢舉格式教學ChenDao⋯搜尋同標題文章搜尋看板內 ChenDao 的文章4/08!
----------------------

#### Get `href` of `<a>`

In [5]:
for div in soup.find_all(class_ = "r-ent")[:5]:
    print(div.find(class_='nrec').text.strip())
    print(div.find(class_='date').text.strip())
    print(div.find(class_='author').text.strip())
    print(div.find(class_="title").text.strip())
    print(div.find(class_='title').a['href'])
    print("-"*80)

NameError: name 'soup' is not defined

### 3.3 Using `try` and `except` to handle exception

In [15]:
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout = (1, 2))
soup = BeautifulSoup(response.text, 'html.parser')

for div in soup.find_all(class_ = "r-ent")[:3]:
    print(div.find(class_="title").text.strip())
    try:
        print(div.find(class_='title').a['href'])
    except:
        print("The Page was removed")

[求助] 因無法陪伴而被劈腿是否該放手
/bbs/Boy-Girl/M.1636276845.A.58A.html
Re: [求助] 曖昧到最後無疾而終
/bbs/Boy-Girl/M.1636277263.A.F73.html
(本文已被刪除) [ss61512tw]
The Page was removed


### 3.4 Add the prefix url to each url

In [16]:
pre = 'https://www.ptt.cc'
url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')
for div in soup.find_all(class_ = "r-ent"):
    print(div.find(class_="title").text.strip())
    try:
        print(pre + div.find(class_='title').a['href'])
    except:
        pass

[求助] 因無法陪伴而被劈腿是否該放手
https://www.ptt.cc/bbs/Boy-Girl/M.1636276845.A.58A.html
Re: [求助] 曖昧到最後無疾而終
https://www.ptt.cc/bbs/Boy-Girl/M.1636277263.A.F73.html
(本文已被刪除) [ss61512tw]
[討論] 你們是怎麼度過尬聊期的
https://www.ptt.cc/bbs/Boy-Girl/M.1636283415.A.644.html
[討論] 板上的人是不是都太年輕啊？
https://www.ptt.cc/bbs/Boy-Girl/M.1636287054.A.17D.html
[討論] 如果用學術的角度切入
https://www.ptt.cc/bbs/Boy-Girl/M.1636290612.A.9DB.html
[公告] 關於新制板規說明（必讀）
https://www.ptt.cc/bbs/Boy-Girl/M.1399818891.A.D72.html
[公告] 檢舉格式教學
https://www.ptt.cc/bbs/Boy-Girl/M.1554727003.A.986.html
[公告] 有關於"問卷文申請"
https://www.ptt.cc/bbs/Boy-Girl/M.1560802697.A.8E6.html
[公告] 男女板板規
https://www.ptt.cc/bbs/Boy-Girl/M.1565730811.A.BDC.html
Fw: [情報] 34th 小天使招考（9/5～9/15）
https://www.ptt.cc/bbs/Boy-Girl/M.1631115662.A.014.html


### 3.5 Appending scraped urls to a list

In [17]:
links = []
pre = 'https://www.ptt.cc'

url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(2, 3))
soup = BeautifulSoup(response.text, 'html.parser')

for div in soup.find_all(class_ = "r-ent"):
    try:
        links.append(pre + div.find(class_='title').a['href'])
    except:
        pass
print(len(links))

10


## IV. Get more links from more pages
```
https://www.ptt.cc/bbs/Boy-Girl/M.1463279135.A.825.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463280694.A.659.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463280951.A.42C.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463281950.A.094.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463281956.A.CE1.html
https://www.ptt.cc/bbs/Boy-Girl/M.1463286737.A.ECC.html
...
```

### 4.1 Get the last page

In [18]:
import re

url = 'https://www.ptt.cc/bbs/Boy-Girl/index.html'
response = requests.get(url, timeout=(1, 2))
soup = BeautifulSoup(response.text, 'html.parser')
last2nd = soup.find(class_='action-bar').find(class_='btn-group-paging').find_all(class_='btn')[1].get('href')
print(last2nd)
lastpage = int(re.search("index(.+?)\.html", last2nd).group(1)) + 1
print(lastpage)

/bbs/Boy-Girl/index5537.html
5538


### 4.2 Get the last 10 pages

In [19]:
pre = 'https://www.ptt.cc'
links = []
lastpage = 5303
for i in range(lastpage, lastpage-5, -1):
    url = 'https://www.ptt.cc/bbs/Boy-Girl/index{}.html'.format(i)
    response = requests.get(url, timeout=(1, 2))
    soup = BeautifulSoup(response.text, 'html.parser')
    for div in soup.find_all(class_='r-ent'):
        try:
            links.append(pre + div.find(class_='title').a['href'])
        except:
            pass
    print(i, "\t", len(links))


5303 	 20
5302 	 40
5301 	 60
5300 	 80
5299 	 100
5298 	 120
5297 	 140
5296 	 160
5295 	 180
5294 	 200


### Notes: break, continue, and pass
* `break`: 符合條件的時候，就跳出迴圈，別再執行了。
* `continue`: 符合條件的時候，就不要執行這一圈了，接下去執行下一圈
* `pass`: 不做任何事

## V. Getting post content for each post link
* Beautifulsoup has two ways to detect elements and its attributes
    * `soup.find()`, `soup.find_all()`
    * `soup.select()` <-- You can input CSS selector by this way

In [20]:
len(links)

200

In [21]:
for link in links[:5]:
    print(link)

https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A.B37.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A.255.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A.610.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A.1BD.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A.4AE.html


In [22]:
for link in links[:5]:
    res = requests.get(link, timeout=(1,2))
    soup = BeautifulSoup(res.text, 'html.parser')
    print(type(soup))

<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>
<class 'bs4.BeautifulSoup'>


### 5.1 Get metadata

In [23]:
for link in links[:5]:
    print(link)
    res = requests.get(link, timeout=(1, 2))
    soup = BeautifulSoup(res.text, "html.parser")
    metas = soup.find_all(class_='article-meta-value')
    print(len(metas))
    print(metas[0].text)
    print(metas[1].text)
    print(metas[2].text)

https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A.B37.html
4
talkmyself (休息中)
Boy-Girl
Re: [討論] 男生真的會重視女生的付出嗎
https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A.255.html
4
ciza (執行力才是主道)
Boy-Girl
Re: [討論] 男生真的會重視女生的付出嗎
https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A.610.html
4
sumade (斬卍凱蒂貓卍佛)
Boy-Girl
Re: [討論] 男生真的會重視女生的付出嗎
https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A.1BD.html
4
eulbos (心想事成)
Boy-Girl
Re: [討論] 如果另一半騎車讓你覺得很危險
https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A.4AE.html
4
foam0406 (南山南)
Boy-Girl
[討論] 會不喜歡吃飯吃很慢的異性嗎


### 5.2 Assign data to variables

In [24]:
for link in links[:5]:
    print(link)
    res = requests.get(link, timeout=(1, 2))
    soup = BeautifulSoup(res.text, "html.parser")
    metas = soup.find_all(class_='article-meta-value')
    author = metas[0].text
    title = metas[-2].text
    timestamp = metas[-1].text

https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A.B37.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A.255.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A.610.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A.1BD.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A.4AE.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620039732.A.98A.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620040499.A.8A3.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620040898.A.285.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620042170.A.A25.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620042553.A.EC5.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620045051.A.48A.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620055542.A.837.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620060517.A.9EB.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620061213.A.C17.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620063054.A.D99.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620065259.A.F5D.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620085306.A.B07.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620093356.A.4

KeyboardInterrupt: 

### 5.3 Get all content
```
alist = [1, 2, 3]
print(isinstance(alist, tuple))
: False
```

In [25]:
from bs4 import NavigableString
for link in links[:5]:
    print(link)
    res = requests.get(link, timeout=(1, 2))
    soup = BeautifulSoup(res.text, "html.parser")
    metas = soup.find_all(class_='article-meta-value')
    if len(metas) == 0: 
        continue
    author = metas[0].text
    title = metas[-2].text
    timestamp = metas[-1].text
    
    content = ''
    for text in soup.find(id='main-content'):
        if isinstance(text, NavigableString):
            content += text.strip()
    print(len(content)) # calculating number of words

https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A.B37.html
421
https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A.255.html
557
https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A.610.html
1090


### 5.4 Save them all to a dictionary and append to a list

In [26]:
from bs4 import NavigableString
all_post = []
for link in links[:5]:
    print(link)
    res = requests.get(link, timeout=(1, 2))
    soup = BeautifulSoup(res.text, "html.parser")
    metas = soup.find_all(class_='article-meta-value')
    if len(metas) == 0: 
        continue
    author = metas[0].text
    title = metas[-2].text
    timestamp = metas[-1].text
    
    content = ''
    for text in soup.find(id='main-content'):
        if isinstance(text, NavigableString):
            content += text.strip()
    
    all_post.append({'author':author, 
                     'link':link,
                      'title':title, 
                      'timestamp':timestamp, 
                      'content':content}) 

https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A.B37.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A.255.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A.610.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A.1BD.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A.4AE.html


In [27]:
import pandas as pd
pd.DataFrame(all_post)

Unnamed: 0,author,link,title,timestamp,content
0,talkmyself (休息中),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,Re: [討論] 男生真的會重視女生的付出嗎,Mon May 3 15:31:03 2021,"你是邏輯有問題嗎?\n男生的付出""只有""金錢? 直接輕描淡寫的帶過\n女生的付出洋洋灑灑長篇..."
1,ciza (執行力才是主道),https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A...,Re: [討論] 男生真的會重視女生的付出嗎,Mon May 3 16:39:29 2021,不要仇女啦\n\n好女孩可能就只是你還沒遇到而已啦\n\n像我就很感謝我老婆的付出阿\n\n...
2,sumade (斬卍凱蒂貓卍佛),https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A...,Re: [討論] 男生真的會重視女生的付出嗎,Mon May 3 16:57:19 2021,看到你推這句 就知道你根本沒把大家的意見聽進去\n\n只覺得自己好可憐 為什麼鄉民都欺負...
3,eulbos (心想事成),https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A...,Re: [討論] 如果另一半騎車讓你覺得很危險,Mon May 3 17:40:53 2021,我阿肥啦，因為開車技術太差不敢交女朋友QQ\n\n阿肥建議啦，真的要花點心思練車才好開車門也...
4,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...


## VI. Scraping comments


In [28]:
from bs4 import NavigableString
all_post = []
for link in links[:5]:
    print(link)
    resulrests = requests.get(link, timeout=(1, 2))
    soup = BeautifulSoup(res.text, "html.parser")
    metas = soup.find_all(class_='article-meta-value')
    if len(metas) == 0: 
        continue
    author = metas[0].text
    title = metas[-2].text
    timestamp = metas[-1].text
    
    content = ''
    for text in soup.find(id='main-content'):
        if isinstance(text, NavigableString):
            content += text.strip()
    
    comments = []
    for push in soup.find_all(class_="push"):
        push_tag = push.find(class_='push-tag').text
        push_userid = push.find(class_='push-userid').text
        push_content = push.find(class_='push-content').text
        push_ipdatetime = push.find(class_='push-ipdatetime').text

        #print push_userid, push_content, push_ipdatetime
        comments.append({'tag': push_tag,'userid':push_userid, 
                         'content':push_content, 
                         'timestamp':push_ipdatetime})
    all_post.append({'author':author, 
                     'link':link,
                      'title':title, 
                      'timestamp':timestamp, 
                      'content':content, 
                      'comments':comments}) 

https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A.B37.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A.255.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A.610.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A.1BD.html
https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A.4AE.html


In [29]:
pd.DataFrame(all_post)

Unnamed: 0,author,link,title,timestamp,content,comments
0,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"[{'tag': '推 ', 'userid': 'mintsnow', 'content'..."
1,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620031172.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"[{'tag': '推 ', 'userid': 'mintsnow', 'content'..."
2,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620032241.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"[{'tag': '推 ', 'userid': 'mintsnow', 'content'..."
3,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620034856.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"[{'tag': '推 ', 'userid': 'mintsnow', 'content'..."
4,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"[{'tag': '推 ', 'userid': 'mintsnow', 'content'..."


In [30]:
pd.DataFrame(all_post).explode('comments')

Unnamed: 0,author,link,title,timestamp,content,comments
0,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'mintsnow', 'content':..."
0,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '→ ', 'userid': 'onizuka7kimo', 'conte..."
0,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'wl843907', 'content':..."
0,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'mind324', 'content': ..."
0,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620027067.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'Alan1988', 'content':..."
...,...,...,...,...,...,...
4,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'fanfan0113', 'content..."
4,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'edc3', 'content': ': ..."
4,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '推 ', 'userid': 'fix78', 'content': ':..."
4,foam0406 (南山南),https://www.ptt.cc/bbs/Boy-Girl/M.1620039447.A...,[討論] 會不喜歡吃飯吃很慢的異性嗎,Mon May 3 18:57:25 2021,如題\n小魯認識一個女生\n之前跟她吃火鍋\n下班吃的\n\n\n\n七點開始吃\n我大概八...,"{'tag': '→ ', 'userid': 'windgaia', 'content':..."


## VII. Saving data

In [32]:
import pickle
# with open('ptt1092_testing.pkl', 'wb') as fout:  # Python 3: open(..., 'wb')
#     pickle.dump(all_post, fout)

## VIII. Scraping with Cookies

In [1]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
response = requests.get(url, timeout=(1, 2))
print(response.text)

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">




	</head>
    <body>
		
<div class="bbs-screen bbs-content">
    <div class="over18-notice">
        <p>本網站已依網站內容分級規定處理</p>

        <p>警告︰您即將進入之看板內容需滿十八歲方可瀏覽。</p>

        <p>若您尚未年滿十八歲，請點選離開。若您已滿十八歲，亦不可將本區之內容派發、傳閱、出售、出租、交給或借予年齡未滿18歲的人士瀏覽，或將本網站內容向該人士出示、播放或放映。</p>
    </div>
</div>

<div class="bbs-screen bbs-content center clear">
    <form action="/ask/over18"

### 8.1 Adding cookies

In [2]:
cookies = {'over18': '1'}
res = requests.get(url, timeout=(1, 2), cookies = cookies)
res.text

'<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<meta charset="utf-8">\n\t\t\n\n<meta name="viewport" content="width=device-width, initial-scale=1">\n\n<title>看板 Gossiping 文章列表 - 批踢踢實業坊</title>\n\n<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">\n<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">\n<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">\n<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">\n<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">\n\n\n\n\n\t</head>\n    <body>\n\t\t\n<div id="topbar-container">\n\t<div id="topbar" class="bbs-content">\n\t\t<a id="logo" href="/bbs/">批踢踢實業坊</a>\n\t\t<span>&rsaquo;</span>\n\t\t<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>\n\t\t<a class="right small" href="/about.html