当前位置:网站首页 > 更多 > 编程开发 > 正文

[Python] 【爬虫】python爬取MSDN站所有P2P下载链接

作者:CC下载站 日期:2020-03-27 00:00:00 浏览:64 分类:编程开发

今日,msdn的新网站开放注册,然后体验了一波,发现要强制观看30S的广告才可以下载,因此就想提前把资源爬取下来以便后用。

先来看下成果:

1,网站分析

1.1通过直接爬取:https://msdn.itellyou.cn/,可以获得8个ID,对应着侧边栏的八个分类

1.2没展开一个分类,会发送一个POST请求

传递的就是之前获取的8个ID之一

1.3查看这个请求的返回值,可以看到又获得一个ID,以及对应的资源名称。

1.4点击,展开一个资源可以发现,又多了两个POST请求


1.4.1第一个GETLang,经分析大概意思就是,获取资源的语言,然后这个请求也发送了一个ID,然后在返回值中又获得一个ID,这就是后文中的lang值

1.4.2第二个GetList,这个传递了三个参数:

(1)ID:经对比可发现这个ID就是我们之前一直在用的ID。

(2)lang,我后来才发现是language的缩写,就是语言的意思,我们从第一个GetLang的返回值可以获取,这个lang值。

(3)filter,翻译成中文就是过滤器的意思,对应图片坐下角的红色框框内是否勾选。

1.4.3到这里就以及在返回值中获得了下载地址了:

综上就是分析过程。然后就开始敲代码了

2,为了追求速度,选择了Scrapy框架。然后代码自己看吧。

爬虫.py:

#-*-coding:utf-8-*-
importjson
importscrapy
frommsdn.itemsimportMsdnItem


classMsdndownSpider(scrapy.Spider):
name='msdndown'
allowed_domains=['msdn.itellyou.cn']
start_urls=['http://msdn.itellyou.cn/']

defparse(self,response):
self.index=[iforiinresponse.xpath('//h4[@class="panel-title"][email protected]').extract()]
#self.index_title=[iforiinresponse.xpath('//h4[@class="panel-title"]/a/text()').extract()]
url='https://msdn.itellyou.cn/Category/Index'
foriinself.index:
yieldscrapy.FormRequest(url=url,formdata={'id':i},dont_filter=True,
callback=self.Get_Lang,meta={'id':i})

defGet_Lang(self,response):
id_info=json.loads(response.text)
url='https://msdn.itellyou.cn/Category/GetLang'
foriinid_info:#遍历软件列表
lang=i['id']#软件ID
title=i['name']#软件名
#进行下一次爬取,根据lang(语言)id获取软件语言ID列表
yieldscrapy.FormRequest(url=url,formdata={'id':lang},dont_filter=True,callback=self.Get_List,
meta={'id':lang,'title':title})

defGet_List(self,response):
lang=json.loads(response.text)['result']
id=response.meta['id']
title=response.meta['title']
url='https://msdn.itellyou.cn/Category/GetList'
#如果语言为空则跳过,否则进行下次爬取下载地址
iflen(lang)!=0:
#遍历语言列表ID
foriinlang:
data={
'id':id,
'lang':i['id'],
'filter':'true'
}
yieldscrapy.FormRequest(url=url,formdata=data,dont_filter=True,callback=self.Get_Down,
meta={'name':title,'lang':i['lang']})
else:
pass

defGet_Down(self,response):
response_json=json.loads(response.text)['result']
item=MsdnItem()
foriinresponse_json:
item['name']=i['name']
item['url']=i['url']
print(i['name']+"--------------"+i['url'])#测试输出,为了运行时不太无聊
returnitem

items.py:

#-*-coding:utf-8-*-

#Defineherethemodelsforyourscrapeditems
#
#Seedocumentationin:
#https://docs.scrapy.org/en/latest/topics/items.html

importscrapy


classMsdnItem(scrapy.Item):
#definethefieldsforyouritemherelike:
name=scrapy.Field()
url=scrapy.Field()

settings.py:

#-*-coding:utf-8-*-

#Scrapysettingsformsdnproject
#
#Forsimplicity,thisfilecontainsonlysettingsconsideredimportantor
#commonlyused.Youcanfindmoresettingsconsultingthedocumentation:
#
#https://docs.scrapy.org/en/latest/topics/settings.html
#https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME='msdn'

SPIDER_MODULES=['msdn.spiders']
NEWSPIDER_MODULE='msdn.spiders'

#Crawlresponsiblybyidentifyingyourself(andyourwebsite)ontheuser-agent
#USER_AGENT='msdn(+http://www.yourdomain.com)'

#Obeyrobots.txtrules
ROBOTSTXT_OBEY=False

#ConfiguremaximumconcurrentrequestsperformedbyScrapy(default:16)
#CONCURRENT_REQUESTS=32

#Configureadelayforrequestsforthesamewebsite(default:0)
#Seehttps://docs.scrapy.org/en/latest/topics/settings.html#download-delay
#Seealsoautothrottlesettingsanddocs
DOWNLOAD_DELAY=0.1
#Thedownloaddelaysettingwillhonoronlyoneof:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16

#Disablecookies(enabledbydefault)
#COOKIES_ENABLED=False

#DisableTelnetConsole(enabledbydefault)
#TELNETCONSOLE_ENABLED=False

#Overridethedefaultrequestheaders:
DEFAULT_REQUEST_HEADERS={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'en',
'User-Agent':'Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/80.0.3987.149Safari/537.36'
}

#Enableordisablespidermiddlewares
#Seehttps://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES={
#'msdn.middlewares.MsdnSpiderMiddleware':543,
#}

#Enableordisabledownloadermiddlewares
#Seehttps://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES={
#'msdn.middlewares.MsdnDownloaderMiddleware':543,
#}

#Enableordisableextensions
#Seehttps://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS={
#'scrapy.extensions.telnet.TelnetConsole':None,
#}

#Configureitempipelines
#Seehttps://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES={
'msdn.pipelines.MsdnPipeline':300,
}

#EnableandconfiguretheAutoThrottleextension(disabledbydefault)
#Seehttps://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED=True
#Theinitialdownloaddelay
#AUTOTHROTTLE_START_DELAY=5
#Themaximumdownloaddelaytobesetincaseofhighlatencies
#AUTOTHROTTLE_MAX_DELAY=60
#TheaveragenumberofrequestsScrapyshouldbesendinginparallelto
#eachremoteserver
#AUTOTHROTTLE_TARGET_CONCURRENCY=1.0
#Enableshowingthrottlingstatsforeveryresponsereceived:
#AUTOTHROTTLE_DEBUG=False

#EnableandconfigureHTTPcaching(disabledbydefault)
#Seehttps://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'

pipelines.py:

#-*-coding:utf-8-*-

#Defineyouritempipelineshere
#
#Don'tforgettoaddyourpipelinetotheITEM_PIPELINESsetting
#See:https://docs.scrapy.org/en/latest/topics/item-pipeline.html


classMsdnPipeline(object):
def__init__(self):
self.file=open('msdnc.csv','a+',encoding='utf8')

defprocess_item(self,item,spider):
title=item['name']
url=item['url']
self.file.write(title+'*'+url+'
')

defdown_item(self,item,spider):
self.file.close()

main.py(启动文件):

fromscrapy.cmdlineimportexecute

execute(['scrapy','crawl','msdndown'])

3,成品打包地址点击进入:


csdn密码:lan666|大小:60kb已经过安全软件检测无毒,请您放心下载。

您需要 登录账户 后才能发表评论

取消回复欢迎 发表评论:

关灯