首页
酷软
系统
游戏
媒体
- 电影
- 剧集
- 动画
- 记录
- 综艺
- MV
- 有声世界
云资源
源码
更多
- 文库
- web
- 站长帮
- 玩电脑
- 玩游戏
- 玩手机
- 涨姿势
- 玩软件
- 云图志
- 看漫画
- 微读书
- PS玩家
- 网文网语
- 硬件数码
- 编程开发
- 神秘之旅
- 福利线报
- 商业资源
- 网赚相关
- 健康加油站
赞助专区
云盘专区
资源阁
缘聚岛

当前位置：网站首页 > 更多 > 编程开发 > 正文

[Python] 【爬虫】python爬取MSDN站所有P2P下载链接

作者：CC下载站日期：2020-03-27 00:00:00 浏览：64 分类：编程开发

今日，msdn的新网站开放注册，然后体验了一波，发现要强制观看30S的广告才可以下载，因此就想提前把资源爬取下来以便后用。

先来看下成果：

1，网站分析

1.1通过直接爬取：https://msdn.itellyou.cn/，可以获得8个ID，对应着侧边栏的八个分类

1.2没展开一个分类，会发送一个POST请求

传递的就是之前获取的8个ID之一

1.3查看这个请求的返回值，可以看到又获得一个ID，以及对应的资源名称。

1.4点击，展开一个资源可以发现，又多了两个POST请求

1.4.1第一个GETLang，经分析大概意思就是，获取资源的语言，然后这个请求也发送了一个ID，然后在返回值中又获得一个ID，这就是后文中的lang值

1.4.2第二个GetList，这个传递了三个参数：

（1）ID：经对比可发现这个ID就是我们之前一直在用的ID。

（2）lang，我后来才发现是language的缩写，就是语言的意思，我们从第一个GetLang的返回值可以获取，这个lang值。

（3）filter，翻译成中文就是过滤器的意思，对应图片坐下角的红色框框内是否勾选。

1.4.3到这里就以及在返回值中获得了下载地址了：

综上就是分析过程。然后就开始敲代码了

2,为了追求速度，选择了Scrapy框架。然后代码自己看吧。

爬虫.py：

#-*-coding:utf-8-*-
importjson
importscrapy
frommsdn.itemsimportMsdnItem


classMsdndownSpider(scrapy.Spider):
name='msdndown'
allowed_domains=['msdn.itellyou.cn']
start_urls=['http://msdn.itellyou.cn/']

defparse(self,response):
self.index=[iforiinresponse.xpath('//h4[@class="panel-title"][email protected]').extract()]
#self.index_title=[iforiinresponse.xpath('//h4[@class="panel-title"]/a/text()').extract()]
url='https://msdn.itellyou.cn/Category/Index'
foriinself.index:
yieldscrapy.FormRequest(url=url,formdata={'id':i},dont_filter=True,
callback=self.Get_Lang,meta={'id':i})

defGet_Lang(self,response):
id_info=json.loads(response.text)
url='https://msdn.itellyou.cn/Category/GetLang'
foriinid_info:#遍历软件列表
lang=i['id']#软件ID
title=i['name']#软件名
#进行下一次爬取，根据lang(语言)id获取软件语言ID列表
yieldscrapy.FormRequest(url=url,formdata={'id':lang},dont_filter=True,callback=self.Get_List,
meta={'id':lang,'title':title})

defGet_List(self,response):
lang=json.loads(response.text)['result']
id=response.meta['id']
title=response.meta['title']
url='https://msdn.itellyou.cn/Category/GetList'
#如果语言为空则跳过，否则进行下次爬取下载地址
iflen(lang)!=0:
#遍历语言列表ID
foriinlang:
data={
'id':id,
'lang':i['id'],
'filter':'true'
}
yieldscrapy.FormRequest(url=url,formdata=data,dont_filter=True,callback=self.Get_Down,
meta={'name':title,'lang':i['lang']})
else:
pass

defGet_Down(self,response):
response_json=json.loads(response.text)['result']
item=MsdnItem()
foriinresponse_json:
item['name']=i['name']
item['url']=i['url']
print(i['name']+"--------------"+i['url'])#测试输出，为了运行时不太无聊
returnitem

items.py:

#-*-coding:utf-8-*-

#Defineherethemodelsforyourscrapeditems
#
#Seedocumentationin:
#https://docs.scrapy.org/en/latest/topics/items.html

importscrapy


classMsdnItem(scrapy.Item):
#definethefieldsforyouritemherelike:
name=scrapy.Field()
url=scrapy.Field()

settings.py:

#-*-coding:utf-8-*-

#Scrapysettingsformsdnproject
#
#Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor
#commonlyused.Youcanfindmoresettingsconsultingthedocumentation:
#
#https://docs.scrapy.org/en/latest/topics/settings.html
#https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME='msdn'

SPIDER_MODULES=['msdn.spiders']
NEWSPIDER_MODULE='msdn.spiders'

#Crawlresponsiblybyidentifyingyourself(andyourwebsite)ontheuser-agent
#USER_AGENT='msdn(+http://www.yourdomain.com)'

#Obeyrobots.txtrules
ROBOTSTXT_OBEY=False

#ConfiguremaximumconcurrentrequestsperformedbyScrapy(default:16)
#CONCURRENT_REQUESTS=32

#Configureadelayforrequestsforthesamewebsite(default:0)
#Seehttps://docs.scrapy.org/en/latest/topics/settings.html#download-delay
#Seealsoautothrottlesettingsanddocs
DOWNLOAD_DELAY=0.1
#Thedownloaddelaysettingwillhonoronlyoneof:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16

#Disablecookies(enabledbydefault)
#COOKIES_ENABLED=False

#DisableTelnetConsole(enabledbydefault)
#TELNETCONSOLE_ENABLED=False

#Overridethedefaultrequestheaders:
DEFAULT_REQUEST_HEADERS={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'en',
'User-Agent':'Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/80.0.3987.149Safari/537.36'
}

#Enableordisablespidermiddlewares
#Seehttps://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES={
#'msdn.middlewares.MsdnSpiderMiddleware':543,
#}

#Enableordisabledownloadermiddlewares
#Seehttps://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES={
#'msdn.middlewares.MsdnDownloaderMiddleware':543,
#}

#Enableordisableextensions
#Seehttps://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS={
#'scrapy.extensions.telnet.TelnetConsole':None,
#}

#Configureitempipelines
#Seehttps://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES={
'msdn.pipelines.MsdnPipeline':300,
}

#EnableandconfiguretheAutoThrottleextension(disabledbydefault)
#Seehttps://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED=True
#Theinitialdownloaddelay
#AUTOTHROTTLE_START_DELAY=5
#Themaximumdownloaddelaytobesetincaseofhighlatencies
#AUTOTHROTTLE_MAX_DELAY=60
#TheaveragenumberofrequestsScrapyshouldbesendinginparallelto
#eachremoteserver
#AUTOTHROTTLE_TARGET_CONCURRENCY=1.0
#Enableshowingthrottlingstatsforeveryresponsereceived:
#AUTOTHROTTLE_DEBUG=False

#EnableandconfigureHTTPcaching(disabledbydefault)
#Seehttps://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py:

#-*-coding:utf-8-*-

#Defineyouritempipelineshere
#
#Don'tforgettoaddyourpipelinetotheITEM_PIPELINESsetting
#See:https://docs.scrapy.org/en/latest/topics/item-pipeline.html


classMsdnPipeline(object):
def__init__(self):
self.file=open('msdnc.csv','a+',encoding='utf8')

defprocess_item(self,item,spider):
title=item['name']
url=item['url']
self.file.write(title+'*'+url+'
')

defdown_item(self,item,spider):
self.file.close()

main.py(启动文件）:

fromscrapy.cmdlineimportexecute

execute(['scrapy','crawl','msdndown'])

3，成品打包地址点击进入：

csdn密码：lan666|大小：60kb已经过安全软件检测无毒，请您放心下载。

上一篇：[算法刷题] 试题算法提高不同单词个数统计
下一篇：[硬件测试] PassMark KeyboardTest破解版(键盘测试器)V2.3 中文版

猜你还喜欢

您需要登录账户后才能发表评论

取消回复欢迎你发表评论:

: 主域名：https://www.cdz423.com/

精品推荐！: 115联盟：在家兼职也能月入过万

任推帮：不扣量的项目拉新平台

辰讯云服务器：高性能、可靠的云计算平台

: 欢迎访客登录没有账号？

最新文章
热门文章
热评文章

最新评论

找了好久的资源，终于在这里找到了。感谢本站的资源和分享。谢谢285552528 评论于：11-09
找了好久的资源bjzchzch12 评论于：11-07
谢谢分享感谢ppy2016 评论于：11-05
谢谢分享感谢ppy2016 评论于：11-05
有靳东！嘻嘻奥古斯都.凯撒评论于：10-28
流星花园是F4处女作也是4人集体搭配的唯一一部！奥古斯都.凯撒评论于：10-28
找了好久的资源，终于在这里找到了。感谢本站的资源和分享。谢谢AAAAA 评论于：10-26
找了好久的资源，终于在这里找到了。感谢本站的资源和分享。谢谢password63 评论于：10-26
找了好久的资源，终于在这里找齐了！！！！blog001 评论于：10-21
找了好久的资源，终于在这里找齐了！！！！blog001 评论于：10-21

友情链接

[Python] 【爬虫】python爬取MSDN站所有P2P下载链接

猜你还喜欢

取消回复欢迎 你 发表评论:

取消回复欢迎你发表评论: