一文教你用 Python + Selenium 抓取动态页面信息

如何从动态加载的多个页面一次性获取数百条会议议题、演讲嘉宾及其职务/公司信息?

目录

案例背景:Snowflake Summit 2025 页面解析

准备工作:环境与依赖安装

第一步:抓取演讲嘉宾(Speakers)详细信息

第二步:加载会话列表(Sessions)并提取核心字段,去除乱码与多余符号

第三步:导出 CSV

附录:完整代码汇总

本文案例选择 Snowflake Summit 2025,演示如何一次拿下 572 条会话 + 160 多位嘉宾 的:

  • 会话主题(Track)

  • 议题名称

  • 会话简介(Description)

  • 演讲嘉宾姓名、头像URL

  • 演讲嘉宾职务(Job Title)与所在公司

案例背景:Snowflake Summit 2025 页面解析

Session Catalog URL(会话列表页面):

https://reg.snowflake.com/flow/snowflake/summit25/sessions/page/catalog
  • 页面动态渲染,需要不断点击“Show more”按钮加载后续卡片

  • 每个会话卡片的 HTML 结构大致:

<div class="rf-tile-wrapper">
  <div class="rf-tile">
    <!-- Banner 区:Track 信息在 img src 中可见 -->
    <div class="rf-tile-banner">
      <img src="…_Breakout-Session_….png" alt="… banner">
    </div>
    <!-- Body 区:Avatar + 标题 + 描述 -->
    <div class="rf-tile-body">
      <div class="rf-tile-avatars">
        <button class="rf-tile-avatar" aria-label="Virendra Singh speaker for the '…' session">
          <img class="rf-tile-avatar-img" src="…/Virendra.jpg">
        </button>
        <!-- 可能多个 avatar -->
      </div>
      <h4 class="rf-tile-title">
        <a>会议议题名称,带编号</a>
      </h4>
      <p class="rf-tile-info rf-tile-line-two">……Session Description……</p>
    </div>
    <!-- Footer 区:Learn More 按钮(略) -->
  </div>
</div>
  • 每点击一次“Show more”,就会动态加载约50 条卡片,最终总条数 572 条

  • Speaker Catalog URL(嘉宾列表页面):

https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog
  • 页面一次性加载完毕(暂不考虑分页/懒加载)

  • 每个嘉宾卡片结构示例:speaker-tile-container

内部包含姓名、职位、公司:

<div class="speaker-tile-container" tabindex="-1">
  <div class="attendee-tile no-border">
    <div class="attendee-tile-image no-avatar" role="button">
      <img src="…/christiankleinerman.jpg" alt="Christian Kleinerman">
    </div>
    <div class="attendee-tile-text-container">
      <button class="attendee-tile-name" aria-label="Christian Kleinerman">Christian Kleinerman</button>
      <p class="attendee-tile-role">
        <span class="attendee-tile-role-job-title">EVP of Product</span><br>
        <span class="attendee-tile-role-company">Snowflake</span>
      </p>
    </div>
  </div>
</div>

只要抓到以上两个页面的 DOM,就能把 572 条 Session + 160+ 位 Speaker 的所有信息一网打尽。

准备工作:环境与依赖安装

1.Python 版本

    推荐使用 Python 3.8 及以上。

2.安装 Selenium

pip install selenium

3.下载 ChromeDriver

  • 版本需与本地 Chrome 浏览器匹配,一般可到 https://chromedriver.chromium.org/ 下载并解压到 PATH 可访问位置。

  • Windows 用户可放到 C:\Windows\ 或者直接在脚本里传绝对路径。

第一步:抓取演讲嘉宾(Speakers)详细信息

我们先开一个独立的 WebDriver,专门去“Speaker Catalog”页面,把每位嘉宾的姓名、职位、公司采集下来,存到一个字典里,方便后面直接 lookup。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import re

def clean_text(raw: str) -> str:
    """
    清洗字符串:去除非 ASCII 可打印字符,规范破折号空格。
    """
    ascii_only = re.sub(r"[^\x20-\x7E]", "", raw)
    ascii_only = re.sub(r"\?\-|-\-", "-", ascii_only)
    ascii_only = re.sub(r"\s*-\s*", " - ", ascii_only)
    return ascii_only.strip()

def build_speaker_details_map(speakers_url: str, timeout: int = 30) -> dict:
    """
    1. 打开 Speaker Catalog 页面
    2. 等待至少一个嘉宾容器出现
    3. 遍历所有 .speaker-tile-container,取出姓名、职位、公司
    4. 返回 { "Speaker Name": {"job_title": "...", "company": "..."}, ... }
    """
    # 0. 启动 WebDriver
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=options)

    speaker_map = {}
    try:
        driver.get(speakers_url)
        # 等待页面加载出来至少一个嘉宾卡片
        WebDriverWait(driver, timeout).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.speaker-tile-container"))
        )

        # 获取所有 .speaker-tile-container
        containers = driver.find_elements(By.CSS_SELECTOR, "div.speaker-tile-container")
        print(f"[Speaker] 总共找到 {len(containers)} 位嘉宾。")

        for cont in containers:
            # 姓名
            try:
                name_btn = cont.find_element(By.CSS_SELECTOR, "button.attendee-tile-name")
                raw_name = name_btn.get_attribute("aria-label") or name_btn.text.strip()
                name = clean_text(raw_name)
            except NoSuchElementException:
                continue

            # 职位
            try:
                job_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-job-title")
                job_title = clean_text(job_elem.text.strip())
            except NoSuchElementException:
                job_title = ""

            # 公司
            try:
                comp_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-company")
                company = clean_text(comp_elem.text.strip())
            except NoSuchElementException:
                company = ""

            speaker_map[name] = {
                "job_title": job_title,
                "company": company
            }
    except TimeoutException:
        print("[Speaker] 等待超时,未能定位到任何嘉宾卡片,请检查 URL 或者选择器。")
    finally:
        driver.quit()

    return speaker_map

if __name__ == "__main__":
    speakers_url = "https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog"
    speaker_details_map = build_speaker_details_map(speakers_url, timeout=30)

    # 演示一下前 5 位嘉宾信息
    for i, (k, v) in enumerate(speaker_details_map.items()):
        if i >= 5: break
        print(f"{k} - 职位:{v['job_title']} ; 公司:{v['company']}")

把所有嘉宾信息过一遍,塞进 Python 字典 speaker_details_map,形如

{
  "Sridhar Ramaswamy": {"job_title": "Chief Executive Officer", "company": "Snowflake"},
  "Christian Kleinerman": {"job_title": "EVP of Product",     "company": "Snowflake"},
  ...
}

若线上页面存在分页或“Show more”逻辑,需要同样写一个循环点击的函数;但本实例中“Speakers”一次性加载完就能拿到全部。

第二步:加载会话列表(Sessions)并提取核心字段

接下来,用另一个独立的 WebDriver,访问 Session Catalog 页面,一次性拿到 572 条会话卡片。关键步骤:循环滚动 + 点击“Show more”,直到卡片总数不再增长为止,然后遍历每个 <div class="rf-tile-wrapper">,提取出:

  • Track(从 <img src="…_Breakout-Session_…png"> 中拆出)

  • Session Titleh4.rf-tile-title > a

  • Descriptionp.rf-tile-info.rf-tile-line-two

  • Speakers(按钮 button.rf-tile-avatar 中的 aria-label 拿到姓名 + 下一级 <img> 拿到头像 URL)

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import (
    NoSuchElementException,
    ElementClickInterceptedException,
    TimeoutException
)
import time
import re

def clean_text(raw: str) -> str:
    ascii_only = re.sub(r"[^\x20-\x7E]", "", raw)
    ascii_only = re.sub(r"\?\-|-\-", "-", ascii_only)
    ascii_only = re.sub(r"\s*-\s*", " - ", ascii_only)
    return ascii_only.strip()

def expand_all_sessions(driver, timeout=15):
    """
    循环点击“Show more”,直到加载不出新条目为止。
    """
    WebDriverWait(driver, timeout).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div.rf-tile-wrapper"))
    )
    last_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
    print(f"[Session] 初始会话卡片数:{last_count}")

    while True:
        try:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(1)

            btn = WebDriverWait(driver, timeout).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, "button.show-more-btn"))
            )
            btn.click()

            try:
                WebDriverWait(driver, timeout).until(
                    lambda d: len(d.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")) > last_count
                )
                new_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
                print(f"[Session] 成功加载新会话,当前总数:{new_count}")
                last_count = new_count
            except TimeoutException:
                print("[Session] 未能加载更多,已加载完毕。")
                break
        except (TimeoutException, NoSuchElementException):
            print("[Session] 找不到“Show more”按钮或超时,退出。")
            break
        except ElementClickInterceptedException:
            print("[Session] 点击被拦截,重试……")
            time.sleep(2)
            continue

def scrape_sessions(driver, speaker_map: dict):
    """
    提取每个会话卡片中的以下内容,并把 speaker_map 里的职务/公司信息补齐:
    - track, session_title, description, speakers: [ {"name","avatar_url","job_title","company"}, ... ]
    """
    sessions_data = []
    wrappers = driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")
    print(f"[Session] 最终共抓取到 {len(wrappers)} 条会话卡片。")

    for sess in wrappers:
        # ——— Track ———
        try:
            img = sess.find_element(By.CSS_SELECTOR, "div.rf-tile-banner img")
            src = img.get_attribute("src") or ""
            track = clean_text(src.split("_")[1] if "_" in src else "")
        except NoSuchElementException:
            track = ""

        # ——— Session Title ———
        try:
            raw_title = sess.find_element(By.CSS_SELECTOR, "h4.rf-tile-title a").text.strip()
            session_title = clean_text(raw_title)
        except NoSuchElementException:
            session_title = ""

        # ——— Description ———
        try:
            raw_desc = sess.find_element(By.CSS_SELECTOR, "p.rf-tile-info.rf-tile-line-two").text.strip()
            description = clean_text(raw_desc)
        except NoSuchElementException:
            description = ""

        # ——— Speakers ———
        speakers = []
        avatar_buttons = sess.find_elements(By.CSS_SELECTOR, "button.rf-tile-avatar")
        for btn in avatar_buttons:
            aria = btn.get_attribute("aria-label") or ""
            m = re.match(r"^(.+?) speaker\s+for\s+the\s+'(.+)' session$", aria)
            raw_name = m.group(1).strip() if m else ""
            name = clean_text(raw_name)

            try:
                avatar_img = btn.find_element(By.CSS_SELECTOR, "img.rf-tile-avatar-img")
                avatar_url = avatar_img.get_attribute("src")
            except NoSuchElementException:
                avatar_url = ""

            # 从 speaker_map 中获取职务/公司
            job_title = ""
            company   = ""
            if name in speaker_map:
                job_title = speaker_map[name].get("job_title", "")
                company   = speaker_map[name].get("company", "")

            if name:
                speakers.append({
                    "name": name,
                    "avatar_url": avatar_url,
                    "job_title": job_title,
                    "company": company
                })

        sessions_data.append({
            "track"        : track,
            "session_title": session_title,
            "description"  : description,
            "speakers"     : speakers
        })

    return sessions_data

if __name__ == "__main__":
    # —— 第一步:先抓取 Speaker 信息 —— 
    speakers_url = "https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog"
    speaker_map = build_speaker_details_map(speakers_url, timeout=30)

    # —— 第二步:抓 Session —— 
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(options=options)

    try:
        sessions_url = "https://reg.snowflake.com/flow/snowflake/summit25/sessions/page/catalog?tab.sessioncatalogtab=1714168666431001NNiH"
        driver.get(sessions_url)
        expand_all_sessions(driver, timeout=30)
        sessions_data = scrape_sessions(driver, speaker_map)
    finally:
        driver.quit()

    # —— 第三步:把结果写入 CSV —— 
    with open("sessions_full_with_speaker_details.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = [
            "Track",
            "Session Title",
            "Description",
            "Speaker Names",
            "Speaker Avatars",
            "Speaker Job Titles",
            "Speaker Companies"
        ]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        for sess in sessions_data:
            names      = "; ".join([sp["name"] for sp in sess["speakers"]])
            avatars    = "; ".join([sp["avatar_url"] for sp in sess["speakers"]])
            job_titles = "; ".join([sp["job_title"] for sp in sess["speakers"]])
            companies  = "; ".join([sp["company"] for sp in sess["speakers"]])

            writer.writerow({
                "Track"             : sess["track"],
                "Session Title"     : sess["session_title"],
                "Description"       : sess["description"],
                "Speaker Names"     : names,
                "Speaker Avatars"   : avatars,
                "Speaker Job Titles": job_titles,
                "Speaker Companies" : companies
            })

    print(f"共提取 {len(sessions_data)} 条会话信息(含嘉宾详细)并保存至 sessions_full_with_speaker_details.csv")

核心思路回顾与拆解:

  • 用两个独立的 WebDriver 实例 分别处理两个页面,避免 JS 框架冲突:

    • build_speaker_details_map(...) → 负责抓取所有嘉宾的「姓名、职位、公司」。

    • scrape_sessions(...) → 负责抓取所有会话卡片,并把每个演讲嘉宾的 name 当作 key,从前面字典里取出 job_title 与 company

  • 循环“Show more”直到卡片不增 才停止,才能保证拿到全部 572 条 会话。每次点击前后都打印当前卡片总数,方便调试与观察进度。

  • 清洗字符串 (clean_text):

    • 去掉“鈥淪/鈥?” 等乱码,保留 32~126 之间的 ASCII

    • 规范破折号、问号等标点前后空格

    • (仅作了简单去除处理)

  • 数据输出

      • CSV 四列基础字段:Track / Session Title / Description / Speaker Names

      • 每条 Session 可能有多位嘉宾,用 ; 分号拼接:

    第三步:导出 CSV

    Speaker Names、Speaker AvatarsSpeaker Job Titles 与 Speaker Companies 都用  将多位嘉宾信息拼成一个单元格,方便在 Excel 或 Pandas 里拆分。

    with open("sessions_full_with_speaker_details.csv", "w", newline="", encoding="utf-8") as csvfile:
        fieldnames = [
            "Track",
            "Session Title",
            "Description",
            "Speaker Names",
            "Speaker Avatars",
            "Speaker Job Titles",
            "Speaker Companies"
        ]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
    
        for sess in sessions_data:
            names      = "; ".join([sp["name"] for sp in sess["speakers"]])
            avatars    = "; ".join([sp["avatar_url"] for sp in sess["speakers"]])
            job_titles = "; ".join([sp["job_title"] for sp in sess["speakers"]])
            companies  = "; ".join([sp["company"] for sp in sess["speakers"]])
    
            writer.writerow({
                "Track"             : sess["track"],
                "Session Title"     : sess["session_title"],
                "Description"       : sess["description"],
                "Speaker Names"     : names,
                "Speaker Avatars"   : avatars,
                "Speaker Job Titles": job_titles,
                "Speaker Companies" : companies
            })
    

    CSV文件结果如下(由于要求,删除了部分数据):

    欢迎大家在评论区交流爬虫技术,在下爬虫小白,向评论区各位大佬学习

    附录:完整代码汇总

    # -*- coding: utf-8 -*-
    """
    一键实现:
      1. 抓取 Speakers 页面,获取 {姓名: {职位, 公司}} 的字典
      2. 抓取 Sessions 页面,循环“Show more”拿到 572 条会话
      3. 清洗标题 / 描述 / 嘉宾姓名乱码
      4. 在每条 Session 的每位演讲嘉宾中,补充“职位 + 公司”
      5. 最终写入 sessions_full_with_speaker_details.csv
    """
    
    import re
    import time
    import csv
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import (
        NoSuchElementException,
        ElementClickInterceptedException,
        TimeoutException
    )
    
    def clean_text(raw: str) -> str:
        ascii_only = re.sub(r"[^\x20-\x7E]", "", raw)
        ascii_only = re.sub(r"\?\-|-\-", "-", ascii_only)
        ascii_only = re.sub(r"\s*-\s*", " - ", ascii_only)
        return ascii_only.strip()
    
    def build_speaker_details_map(speakers_url: str, timeout: int = 30) -> dict:
        """
        抓取 Speaker Catalog,返回 { "Name": {"job_title": "...", "company": "..."}, ... }
        """
        options = Options()
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        driver = webdriver.Chrome(options=options)
    
        speaker_map = {}
        try:
            driver.get(speakers_url)
            WebDriverWait(driver, timeout).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.speaker-tile-container"))
            )
            containers = driver.find_elements(By.CSS_SELECTOR, "div.speaker-tile-container")
            print(f"[Speaker] 找到 {len(containers)} 个嘉宾卡片。")
    
            for cont in containers:
                try:
                    name_btn = cont.find_element(By.CSS_SELECTOR, "button.attendee-tile-name")
                    raw_name = name_btn.get_attribute("aria-label") or name_btn.text.strip()
                    name = clean_text(raw_name)
                except NoSuchElementException:
                    continue
    
                try:
                    job_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-job-title")
                    job_title = clean_text(job_elem.text.strip())
                except NoSuchElementException:
                    job_title = ""
    
                try:
                    comp_elem = cont.find_element(By.CSS_SELECTOR, "span.attendee-tile-role-company")
                    company = clean_text(comp_elem.text.strip())
                except NoSuchElementException:
                    company = ""
    
                speaker_map[name] = {"job_title": job_title, "company": company}
    
        except TimeoutException:
            print("[Speaker] 等待超时,未能定位到任何嘉宾卡片!")
        finally:
            driver.quit()
    
        return speaker_map
    
    def expand_all_sessions(driver, timeout: int = 30):
        """
        循环点击“Show more”,加载所有会话卡片,直到不再增长为止。
        """
        WebDriverWait(driver, timeout).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.rf-tile-wrapper"))
        )
        last_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
        print(f"[Session] 初始会话数:{last_count}")
    
        while True:
            try:
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                time.sleep(1)
    
                btn = WebDriverWait(driver, timeout).until(
                    EC.element_to_be_clickable((By.CSS_SELECTOR, "button.show-more-btn"))
                )
                btn.click()
    
                try:
                    WebDriverWait(driver, timeout).until(
                        lambda d: len(d.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")) > last_count
                    )
                    new_count = len(driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper"))
                    print(f"[Session] 成功加载新会话,当前总数:{new_count}")
                    last_count = new_count
                except TimeoutException:
                    print("[Session] 无法加载更多,已完全展开。")
                    break
            except (TimeoutException, NoSuchElementException):
                print("[Session] 未找到“Show more”按钮或超时,退出循环。")
                break
            except ElementClickInterceptedException:
                print("[Session] 点击被拦截,稍后重试……")
                time.sleep(2)
                continue
    
    def scrape_sessions(driver, speaker_map: dict) -> list:
        """
        提取所有会话信息,并为每位演讲嘉宾补充“职位+公司”。返回 list of dict
        """
        data = []
        wrappers = driver.find_elements(By.CSS_SELECTOR, "div.rf-tile-wrapper")
        print(f"[Session] 抓到 {len(wrappers)} 条会话卡片。")
    
        for sess in wrappers:
            # 1. Track
            try:
                img = sess.find_element(By.CSS_SELECTOR, "div.rf-tile-banner img")
                src = img.get_attribute("src") or ""
                track = clean_text(src.split("_")[1] if "_" in src else "")
            except NoSuchElementException:
                track = ""
    
            # 2. Session Title
            try:
                raw_title = sess.find_element(By.CSS_SELECTOR, "h4.rf-tile-title a").text.strip()
                session_title = clean_text(raw_title)
            except NoSuchElementException:
                session_title = ""
    
            # 3. Description
            try:
                raw_desc = sess.find_element(By.CSS_SELECTOR, "p.rf-tile-info.rf-tile-line-two").text.strip()
                description = clean_text(raw_desc)
            except NoSuchElementException:
                description = ""
    
            # 4. Speakers
            speakers = []
            avatar_buttons = sess.find_elements(By.CSS_SELECTOR, "button.rf-tile-avatar")
            for btn in avatar_buttons:
                aria = btn.get_attribute("aria-label") or ""
                m = re.match(r"^(.+?) speaker\s+for\s+the\s+'(.+)' session$", aria)
                raw_name = m.group(1).strip() if m else ""
                name = clean_text(raw_name)
    
                try:
                    avatar_img = btn.find_element(By.CSS_SELECTOR, "img.rf-tile-avatar-img")
                    avatar_url = avatar_img.get_attribute("src")
                except NoSuchElementException:
                    avatar_url = ""
    
                job_title = speaker_map.get(name, {}).get("job_title", "")
                company   = speaker_map.get(name, {}).get("company", "")
    
                if name:
                    speakers.append({
                        "name": name,
                        "avatar_url": avatar_url,
                        "job_title": job_title,
                        "company": company
                    })
    
            data.append({
                "track"        : track,
                "session_title": session_title,
                "description"  : description,
                "speakers"     : speakers
            })
    
        return data
    
    if __name__ == "__main__":
        # —— 1. 抓取 Speaker 信息 —— 
        speakers_url = "https://reg.snowflake.com/flow/snowflake/summit25/speakers/page/catalog"
        speaker_map = build_speaker_details_map(speakers_url, timeout=30)
    
        # —— 2. 抓取 Sessions 信息 —— 
        options = Options()
        options.add_argument("--headless")
        options.add_argument("--disable-gpu")
        options.add_argument("--no-sandbox")
        driver = webdriver.Chrome(options=options)
    
        try:
            sessions_url = "https://reg.snowflake.com/flow/snowflake/summit25/sessions/page/catalog?tab.sessioncatalogtab=1714168666431001NNiH"
            driver.get(sessions_url)
            expand_all_sessions(driver, timeout=30)
            sessions_data = scrape_sessions(driver, speaker_map)
        finally:
            driver.quit()
    
        # —— 3. 导出 CSV —— 
        with open("sessions_full_with_speaker_details.csv", "w", newline="", encoding="utf-8") as csvfile:
            fieldnames = [
                "Track",
                "Session Title",
                "Description",
                "Speaker Names",
                "Speaker Avatars",
                "Speaker Job Titles",
                "Speaker Companies"
            ]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
    
            for sess in sessions_data:
                names      = "; ".join([sp["name"] for sp in sess["speakers"]])
                avatars    = "; ".join([sp["avatar_url"] for sp in sess["speakers"]])
                job_titles = "; ".join([sp["job_title"] for sp in sess["speakers"]])
                companies  = "; ".join([sp["company"] for sp in sess["speakers"]])
    
                writer.writerow({
                    "Track"             : sess["track"],
                    "Session Title"     : sess["session_title"],
                    "Description"       : sess["description"],
                    "Speaker Names"     : names,
                    "Speaker Avatars"   : avatars,
                    "Speaker Job Titles": job_titles,
                    "Speaker Companies" : companies
                })
    
        print(f"共提取 {len(sessions_data)} 条会话信息(含嘉宾详细),已保存为 sessions_full_with_speaker_details.csv")
    

    评论
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值