如何从动态加载的多个页面一次性获取数百条会议议题、演讲嘉宾及其职务/公司信息?
目录
案例背景:Snowflake Summit 2025 页面解析
准备工作:环境与依赖安装
第一步:抓取演讲嘉宾(Speakers)详细信息
第二步:加载会话列表(Sessions)并提取核心字段,去除乱码与多余符号
第三步:导出 CSV
附录:完整代码汇总
本文案例选择 Snowflake Summit 2025,演示如何一次拿下 572 条会话 + 160 多位嘉宾 的:
-
会话主题(Track)
-
议题名称
-
会话简介(Description)
-
演讲嘉宾姓名、头像URL
-
演讲嘉宾职务(Job Title)与所在公司
案例背景:Snowflake Summit 2025 页面解析
Session Catalog URL(会话列表页面):
https://reg.snowflake.com/flow/snowflake/summit25/sessions/page/catalog
-
页面动态渲染,需要不断点击“Show more”按钮加载后续卡片
-
每个会话卡片的 HTML 结构大致:
<div class="rf-tile-wrapper">
<div class="rf-tile">
<!-- Banner 区:Track 信息在 img src 中可见 -->
<div class="rf-tile-banner">
<img src="…_Breakout-Session_….png" alt="… banner">
</div>
<!-- Body 区:Avatar + 标题 + 描述 -->
<div class="rf-tile-body">
<div class="rf-tile-avatars">
<button class="rf-tile-avatar" aria-label="Virendra Singh speaker for the '…' session">
<img class="rf-tile-avatar-img" src="…/Virendra.jpg">
</button>
<!-- 可能多个 avatar -->
</div>
<h4 class="rf-tile-title">
<a>会议议题名称,带编号</a>
</h4>
<p class="rf-tile-info rf-tile-line-two">……Session Description……</p>
</div>
<!-- Footer 区:Learn More 按钮(略) -->
</div>
</div>
-
每点击一次“Show more”,就会动态加载约50 条卡片,最终总条数 572 条
-
Speaker Catalog URL(嘉宾列表页面):
https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog
-
页面一次性加载完毕(暂不考虑分页/懒加载)
-
每个嘉宾卡片结构示例:speaker-tile-container
内部包含姓名、职位、公司:
<div class="speaker-tile-container" tabindex="-1">
<div class="attendee-tile no-border">
<div class="attendee-tile-image no-avatar" role="button">
<img src="…/christiankleinerman.jpg" alt="Christian Kleinerman">
</div>
<div class="attendee-tile-text-container">
<button class="attendee-tile-name" aria-label="Christian Kleinerman">Christian Kleinerman</button>
<p class="attendee-tile-role">
<span class="attendee-tile-role-job-title">EVP of Product</span><br>
<span class="attendee-tile-role-company">Snowflake</span>
</p>
</div>
</div>
</div>
只要抓到以上两个页面的 DOM,就能把 572 条 Session + 160+ 位 Speaker 的所有信息一网打尽。
准备工作:环境与依赖安装
1.Python 版本
推荐使用 Python 3.8 及以上。
2.安装 Selenium
pip install selenium
3.下载 ChromeDriver
-
版本需与本地 Chrome 浏览器匹配,一般可到 https://chromedriver.chromium.org/ 下载并解压到 PATH 可访问位置。
-
Windows 用户可放到
C:\Windows\
或者直接在脚本里传绝对路径。
第一步:抓取演讲嘉宾(Speakers)详细信息
我们先开一个独立的 WebDriver,专门去“Speaker Catalog”页面,把每位嘉宾的姓名、职位、公司采集下来,存到一个字典里,方便后面直接 lookup。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import re
def clean_text(raw: str) -> str:
"""
清洗字符串:去除非 ASCII 可打印字符,规范破折号空格。
"""
ascii_only = re.sub(r"[^\x20-\x7E]", "", raw)
ascii_only = re.sub(r"\?\-|-\-", "-", ascii_only)
ascii_only = re.sub(r"\s*-\s*", " - ", ascii_only)
return ascii_only.strip()
def build_speaker_details_map(speakers_url: str, timeout: int = 30) -> dict:
"""
1. 打开 Speaker Catalog 页面
2. 等待至少一个嘉宾容器出现
3. 遍历所有 .speaker-tile-container,取出姓名、职位、公司
4. 返回 { "Speaker Name": {"job_title": "...", "company": "..."}, ... }
"""
# 0. 启动 WebDriver
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=options)
speaker_map = {}
try:
driver.get(speakers_url)
# 等待页面加载出来至少一个嘉宾卡片
WebDriverWait(driver, timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.speaker-tile-container"))
)
# 获取所有 .speaker-tile-container
containers = driver.find_elements(By.CSS_SELECTOR, "div.speaker-tile-container")
print(f"[Speaker] 总共找到 {len(containers)} 位嘉宾。")
for cont in containers:
# 姓名
try:
name_btn = cont.find_element(By.CSS_SELECTOR, "button.attendee-tile-name")
raw_name = name_btn.get_attribute("aria-label") or name_btn.text.strip()
name = clean_text(raw_name)
except NoSuchElementException:
continue
# 职位
try:
job_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-job-title")
job_title = clean_text(job_elem.text.strip())
except NoSuchElementException:
job_title = ""
# 公司
try:
comp_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-company")
company = clean_text(comp_elem.text.strip())
except NoSuchElementException:
company = ""
speaker_map[name] = {
"job_title": job_title,
"company": company
}
except TimeoutException:
print("[Speaker] 等待超时,未能定位到任何嘉宾卡片,请检查 URL 或者选择器。")
finally:
driver.quit()
return speaker_map
if __name__ == "__main__":
speakers_url = "https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog"
speaker_details_map = build_speaker_details_map(speakers_url, timeout=30)
# 演示一下前 5 位嘉宾信息
for i, (k, v) in enumerate(speaker_details_map.items()):
if i >= 5: break
print(f"{k} - 职位:{v['job_title']} ; 公司:{v['company']}")
把所有嘉宾信息过一遍,塞进 Python 字典 speaker_details_map
,形如
{
"Sridhar Ramaswamy": {"job_title": "Chief Executive Officer", "company": "Snowflake"},
"Christian Kleinerman": {"job_title": "EVP of Product", "company": "Snowflake"},
...
}
若线上页面存在分页或“Show more”逻辑,需要同样写一个循环点击的函数;但本实例中“Speakers”一次性加载完就能拿到全部。
第二步:加载会话列表(Sessions)并提取核心字段
接下来,用另一个独立的 WebDriver,访问 Session Catalog 页面,一次性拿到 572 条会话卡片。关键步骤:循环滚动 + 点击“Show more”,直到卡片总数不再增长为止,然后遍历每个 <div class="rf-tile-wrapper">
,提取出:
-
Track(从
<img src="…_Breakout-Session_…png">
中拆出) -
Session Title(
h4.rf-tile-title > a
) -
Description(
p.rf-tile-info.rf-tile-line-two
) -
Speakers(按钮
button.rf-tile-avatar
中的aria-label
拿到姓名 + 下一级<img>
拿到头像 URL)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import (
NoSuchElementException,
ElementClickInterceptedException,
TimeoutException
)
import time
import re
def clean_text(raw: str) -> str:
ascii_only = re.sub(r"[^\x20-\x7E]", "", raw)
ascii_only = re.sub(r"\?\-|-\-", "-", ascii_only)
ascii_only = re.sub(r"\s*-\s*", " - ", ascii_only)
return ascii_only.strip()
def expand_all_sessions(driver, timeout=15):
"""
循环点击“Show more”,直到加载不出新条目为止。
"""
WebDriverWait(driver, timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.rf-tile-wrapper"))
)
last_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
print(f"[Session] 初始会话卡片数:{last_count}")
while True:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
btn = WebDriverWait(driver, timeout).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "button.show-more-btn"))
)
btn.click()
try:
WebDriverWait(driver, timeout).until(
lambda d: len(d.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")) > last_count
)
new_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
print(f"[Session] 成功加载新会话,当前总数:{new_count}")
last_count = new_count
except TimeoutException:
print("[Session] 未能加载更多,已加载完毕。")
break
except (TimeoutException, NoSuchElementException):
print("[Session] 找不到“Show more”按钮或超时,退出。")
break
except ElementClickInterceptedException:
print("[Session] 点击被拦截,重试……")
time.sleep(2)
continue
def scrape_sessions(driver, speaker_map: dict):
"""
提取每个会话卡片中的以下内容,并把 speaker_map 里的职务/公司信息补齐:
- track, session_title, description, speakers: [ {"name","avatar_url","job_title","company"}, ... ]
"""
sessions_data = []
wrappers = driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")
print(f"[Session] 最终共抓取到 {len(wrappers)} 条会话卡片。")
for sess in wrappers:
# ——— Track ———
try:
img = sess.find_element(By.CSS_SELECTOR, "div.rf-tile-banner img")
src = img.get_attribute("src") or ""
track = clean_text(src.split("_")[1] if "_" in src else "")
except NoSuchElementException:
track = ""
# ——— Session Title ———
try:
raw_title = sess.find_element(By.CSS_SELECTOR, "h4.rf-tile-title a").text.strip()
session_title = clean_text(raw_title)
except NoSuchElementException:
session_title = ""
# ——— Description ———
try:
raw_desc = sess.find_element(By.CSS_SELECTOR, "p.rf-tile-info.rf-tile-line-two").text.strip()
description = clean_text(raw_desc)
except NoSuchElementException:
description = ""
# ——— Speakers ———
speakers = []
avatar_buttons = sess.find_elements(By.CSS_SELECTOR, "button.rf-tile-avatar")
for btn in avatar_buttons:
aria = btn.get_attribute("aria-label") or ""
m = re.match(r"^(.+?) speaker\s+for\s+the\s+'(.+)' session$", aria)
raw_name = m.group(1).strip() if m else ""
name = clean_text(raw_name)
try:
avatar_img = btn.find_element(By.CSS_SELECTOR, "img.rf-tile-avatar-img")
avatar_url = avatar_img.get_attribute("src")
except NoSuchElementException:
avatar_url = ""
# 从 speaker_map 中获取职务/公司
job_title = ""
company = ""
if name in speaker_map:
job_title = speaker_map[name].get("job_title", "")
company = speaker_map[name].get("company", "")
if name:
speakers.append({
"name": name,
"avatar_url": avatar_url,
"job_title": job_title,
"company": company
})
sessions_data.append({
"track" : track,
"session_title": session_title,
"description" : description,
"speakers" : speakers
})
return sessions_data
if __name__ == "__main__":
# —— 第一步:先抓取 Speaker 信息 ——
speakers_url = "https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog"
speaker_map = build_speaker_details_map(speakers_url, timeout=30)
# —— 第二步:抓 Session ——
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=options)
try:
sessions_url = "https://reg.snowflake.com/flow/snowflake/summit25/sessions/page/catalog?tab.sessioncatalogtab=1714168666431001NNiH"
driver.get(sessions_url)
expand_all_sessions(driver, timeout=30)
sessions_data = scrape_sessions(driver, speaker_map)
finally:
driver.quit()
# —— 第三步:把结果写入 CSV ——
with open("sessions_full_with_speaker_details.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = [
"Track",
"Session Title",
"Description",
"Speaker Names",
"Speaker Avatars",
"Speaker Job Titles",
"Speaker Companies"
]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for sess in sessions_data:
names = "; ".join([sp["name"] for sp in sess["speakers"]])
avatars = "; ".join([sp["avatar_url"] for sp in sess["speakers"]])
job_titles = "; ".join([sp["job_title"] for sp in sess["speakers"]])
companies = "; ".join([sp["company"] for sp in sess["speakers"]])
writer.writerow({
"Track" : sess["track"],
"Session Title" : sess["session_title"],
"Description" : sess["description"],
"Speaker Names" : names,
"Speaker Avatars" : avatars,
"Speaker Job Titles": job_titles,
"Speaker Companies" : companies
})
print(f"共提取 {len(sessions_data)} 条会话信息(含嘉宾详细)并保存至 sessions_full_with_speaker_details.csv")
核心思路回顾与拆解:
-
用两个独立的 WebDriver 实例 分别处理两个页面,避免 JS 框架冲突:
-
build_speaker_details_map(...)
→ 负责抓取所有嘉宾的「姓名、职位、公司」。 -
scrape_sessions(...)
→ 负责抓取所有会话卡片,并把每个演讲嘉宾的name
当作 key,从前面字典里取出job_title
与company
。
-
-
循环“Show more”直到卡片不增 才停止,才能保证拿到全部 572 条 会话。每次点击前后都打印当前卡片总数,方便调试与观察进度。
-
清洗字符串 (
clean_text
):-
去掉“鈥淪/鈥?” 等乱码,保留 32~126 之间的 ASCII
-
规范破折号、问号等标点前后空格
-
(仅作了简单去除处理)
-
-
数据输出:
-
CSV 四列基础字段:Track / Session Title / Description / Speaker Names
-
每条 Session 可能有多位嘉宾,用 ; 分号拼接:
-
第三步:导出 CSV
Speaker Names、
Speaker Avatars
、Speaker Job Titles
与 Speaker Companies
都用 ;
将多位嘉宾信息拼成一个单元格,方便在 Excel 或 Pandas 里拆分。
with open("sessions_full_with_speaker_details.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = [
"Track",
"Session Title",
"Description",
"Speaker Names",
"Speaker Avatars",
"Speaker Job Titles",
"Speaker Companies"
]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for sess in sessions_data:
names = "; ".join([sp["name"] for sp in sess["speakers"]])
avatars = "; ".join([sp["avatar_url"] for sp in sess["speakers"]])
job_titles = "; ".join([sp["job_title"] for sp in sess["speakers"]])
companies = "; ".join([sp["company"] for sp in sess["speakers"]])
writer.writerow({
"Track" : sess["track"],
"Session Title" : sess["session_title"],
"Description" : sess["description"],
"Speaker Names" : names,
"Speaker Avatars" : avatars,
"Speaker Job Titles": job_titles,
"Speaker Companies" : companies
})
CSV文件结果如下(由于要求,删除了部分数据):
欢迎大家在评论区交流爬虫技术,在下爬虫小白,向评论区各位大佬学习
附录:完整代码汇总
# -*- coding: utf-8 -*-
"""
一键实现:
1. 抓取 Speakers 页面,获取 {姓名: {职位, 公司}} 的字典
2. 抓取 Sessions 页面,循环“Show more”拿到 572 条会话
3. 清洗标题 / 描述 / 嘉宾姓名乱码
4. 在每条 Session 的每位演讲嘉宾中,补充“职位 + 公司”
5. 最终写入 sessions_full_with_speaker_details.csv
"""
import re
import time
import csv
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import (
NoSuchElementException,
ElementClickInterceptedException,
TimeoutException
)
def clean_text(raw: str) -> str:
ascii_only = re.sub(r"[^\x20-\x7E]", "", raw)
ascii_only = re.sub(r"\?\-|-\-", "-", ascii_only)
ascii_only = re.sub(r"\s*-\s*", " - ", ascii_only)
return ascii_only.strip()
def build_speaker_details_map(speakers_url: str, timeout: int = 30) -> dict:
"""
抓取 Speaker Catalog,返回 { "Name": {"job_title": "...", "company": "..."}, ... }
"""
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=options)
speaker_map = {}
try:
driver.get(speakers_url)
WebDriverWait(driver, timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.speaker-tile-container"))
)
containers = driver.find_elements(By.CSS_SELECTOR, "div.speaker-tile-container")
print(f"[Speaker] 找到 {len(containers)} 个嘉宾卡片。")
for cont in containers:
try:
name_btn = cont.find_element(By.CSS_SELECTOR, "button.attendee-tile-name")
raw_name = name_btn.get_attribute("aria-label") or name_btn.text.strip()
name = clean_text(raw_name)
except NoSuchElementException:
continue
try:
job_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-job-title")
job_title = clean_text(job_elem.text.strip())
except NoSuchElementException:
job_title = ""
try:
comp_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-company")
company = clean_text(comp_elem.text.strip())
except NoSuchElementException:
company = ""
speaker_map[name] = {"job_title": job_title, "company": company}
except TimeoutException:
print("[Speaker] 等待超时,未能定位到任何嘉宾卡片!")
finally:
driver.quit()
return speaker_map
def expand_all_sessions(driver, timeout: int = 30):
"""
循环点击“Show more”,加载所有会话卡片,直到不再增长为止。
"""
WebDriverWait(driver, timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "div.rf-tile-wrapper"))
)
last_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
print(f"[Session] 初始会话数:{last_count}")
while True:
try:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
btn = WebDriverWait(driver, timeout).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, "button.show-more-btn"))
)
btn.click()
try:
WebDriverWait(driver, timeout).until(
lambda d: len(d.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")) > last_count
)
new_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
print(f"[Session] 成功加载新会话,当前总数:{new_count}")
last_count = new_count
except TimeoutException:
print("[Session] 无法加载更多,已完全展开。")
break
except (TimeoutException, NoSuchElementException):
print("[Session] 未找到“Show more”按钮或超时,退出循环。")
break
except ElementClickInterceptedException:
print("[Session] 点击被拦截,稍后重试……")
time.sleep(2)
continue
def scrape_sessions(driver, speaker_map: dict) -> list:
"""
提取所有会话信息,并为每位演讲嘉宾补充“职位+公司”。返回 list of dict
"""
data = []
wrappers = driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")
print(f"[Session] 抓到 {len(wrappers)} 条会话卡片。")
for sess in wrappers:
# 1. Track
try:
img = sess.find_element(By.CSS_SELECTOR, "div.rf-tile-banner img")
src = img.get_attribute("src") or ""
track = clean_text(src.split("_")[1] if "_" in src else "")
except NoSuchElementException:
track = ""
# 2. Session Title
try:
raw_title = sess.find_element(By.CSS_SELECTOR, "h4.rf-tile-title a").text.strip()
session_title = clean_text(raw_title)
except NoSuchElementException:
session_title = ""
# 3. Description
try:
raw_desc = sess.find_element(By.CSS_SELECTOR, "p.rf-tile-info.rf-tile-line-two").text.strip()
description = clean_text(raw_desc)
except NoSuchElementException:
description = ""
# 4. Speakers
speakers = []
avatar_buttons = sess.find_elements(By.CSS_SELECTOR, "button.rf-tile-avatar")
for btn in avatar_buttons:
aria = btn.get_attribute("aria-label") or ""
m = re.match(r"^(.+?) speaker\s+for\s+the\s+'(.+)' session$", aria)
raw_name = m.group(1).strip() if m else ""
name = clean_text(raw_name)
try:
avatar_img = btn.find_element(By.CSS_SELECTOR, "img.rf-tile-avatar-img")
avatar_url = avatar_img.get_attribute("src")
except NoSuchElementException:
avatar_url = ""
job_title = speaker_map.get(name, {}).get("job_title", "")
company = speaker_map.get(name, {}).get("company", "")
if name:
speakers.append({
"name": name,
"avatar_url": avatar_url,
"job_title": job_title,
"company": company
})
data.append({
"track" : track,
"session_title": session_title,
"description" : description,
"speakers" : speakers
})
return data
if __name__ == "__main__":
# —— 1. 抓取 Speaker 信息 ——
speakers_url = "https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog"
speaker_map = build_speaker_details_map(speakers_url, timeout=30)
# —— 2. 抓取 Sessions 信息 ——
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=options)
try:
sessions_url = "https://reg.snowflake.com/flow/snowflake/summit25/sessions/page/catalog?tab.sessioncatalogtab=1714168666431001NNiH"
driver.get(sessions_url)
expand_all_sessions(driver, timeout=30)
sessions_data = scrape_sessions(driver, speaker_map)
finally:
driver.quit()
# —— 3. 导出 CSV ——
with open("sessions_full_with_speaker_details.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = [
"Track",
"Session Title",
"Description",
"Speaker Names",
"Speaker Avatars",
"Speaker Job Titles",
"Speaker Companies"
]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for sess in sessions_data:
names = "; ".join([sp["name"] for sp in sess["speakers"]])
avatars = "; ".join([sp["avatar_url"] for sp in sess["speakers"]])
job_titles = "; ".join([sp["job_title"] for sp in sess["speakers"]])
companies = "; ".join([sp["company"] for sp in sess["speakers"]])
writer.writerow({
"Track" : sess["track"],
"Session Title" : sess["session_title"],
"Description" : sess["description"],
"Speaker Names" : names,
"Speaker Avatars" : avatars,
"Speaker Job Titles": job_titles,
"Speaker Companies" : companies
})
print(f"共提取 {len(sessions_data)} 条会话信息(含嘉宾详细),已保存为 sessions_full_with_speaker_details.csv")