首页
酷软
系统
游戏
媒体
- 电影
- 剧集
- 动画
- 记录
- 综艺
- MV
- 有声世界
云资源
源码
更多
- 文库
- web
- 站长帮
- 玩电脑
- 玩游戏
- 玩手机
- 涨姿势
- 玩软件
- 云图志
- 看漫画
- 微读书
- PS玩家
- 网文网语
- 硬件数码
- 编程开发
- 神秘之旅
- 福利线报
- 商业资源
- 网赚相关
- 健康加油站
赞助专区
云盘专区
资源阁
缘聚岛

[Python] 万维书刊网所有期刊邮箱地址爬取

作者：CC下载站日期：2021-07-30 00:00:00 浏览：78 分类：编程开发

由于之前要写论文，然后还要投稿，但是有些投稿还需要钱，所以我就爬取了某网站下的免版面费的所有期刊的邮箱地址。

然后就小写了一下代码，用以批量爬取，并保存到本地的表格，到时候可以直接批量发送邮件。

因为考虑到分类比较多，然后速度比较慢，所以直接上了多线程

# -*- coding: utf-8 -*-
"""
-------------------------------------------------
@ Author ：Lan
@ Blog ：www.lanol.cn
@ Date ： 2021/7/30
@ Description：I'm in charge of my Code
-------------------------------------------------
"""
import random
import time

import requests
import parsel
import threading


def start_down(target, value):
    html = parsel.Selector(requests.get(f'http://*.com/{target}').text)
    tou_di_url = html.xpath("//li[@class='bu'][email protected]").extract()
    with open(f'{value.replace("/", "-")}.csv', 'a+', encoding='gbk') as f:
        for content_url in tou_di_url:
            try:
                content_html = parsel.Selector(requests.get(f'http://*.com/{content_url}').text)
                title = content_html.xpath(
                    "//div[@class='jjianjie']/div[@class='jjianjietitle']/h1[@class='jname']/text()").extract_first()
                if 'Email投稿' in title:
                    contact = dict(zip((i.replace(' ', '').replace('\r', '').replace('\n', '') for i in
                                        content_html.xpath("//div[@class='sclistclass']//p[2]/text()").extract()),
                                       (i.replace(' ', '').replace('\r', '').replace('\n', '') for i in
                                        content_html.xpath("//div[@class='sclistclass']//p[3]/text()").extract())))
                    print(title, contact)
                    f.write(f'{title},{contact}\n')
                    time.sleep(random.randint(1, 4))
                    f.flush()
            except:
                time.sleep(random.randint(1, 4))


if __name__ == '__main__':
    url = 'http://*.com/NoLayoutFee.aspx?pg=1&hxid=8&typeid=27'
    type_html = parsel.Selector(requests.get(url).text)
    types = type_html.xpath("//div[@class='typenamelist']/p/a/text()").extract()
    urls = type_html.xpath("//div[@class='typenamelist'][email protected]").extract()
    for index, value, in enumerate(types):
        print(f'正在采集分类{value}')
        threading.Thread(target=start_down, args=(urls[index], value,)).start()