我正在浏览马德里议会的网站,它是用aspx构建的,我不知道如何模拟点击链接,我需要从这些链接中获取相应的政治家。
import scrapy
class AsambleaMadrid(scrapy.Spider):
name = "Asamblea_Madrid"
start_urls = ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']
def parse(self, response):
for id in response.css('div#moduloBusqueda div.sangria div.sangria ul li a::attr(id)'):
target = id.extract()
url = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"
formdata= {'__EVENTTARGET': target,
'__VIEWSTATE': '/wEPDwUBMA9kFgJmD2QWAgIBD2QWBAIBD2QWAgIGD2QWAmYPZBYCAgMPZBYCAgMPFgIeE1ByZXZpb3VzQ29udHJvbE1vZGULKYgBTWljcm9zb2Z0LlNoYXJlUG9pbnQuV2ViQ29udHJvbHMuU1BDb250cm9sTW9kZSwgTWljcm9zb2Z0LlNoYXJlUG9pbnQsIFZlcnNpb249MTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49NzFlOWJjZTExMWU5NDI5YwFkAgMPZBYMAgMPZBYGBSZnXzM2ZWEwMzEwXzg5M2RfNGExOV85ZWQxXzg4YTEzM2QwNjQyMw9kFgJmD2QWAgIBDxYCHgtfIUl0ZW1Db3VudAIEFghmD2QWAgIBDw8WBB4PQ29tbWFuZEFyZ3VtZW50BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkHgRUZXh0BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkZGQCAQ9kFgICAQ8PFgQfAgUeR3J1cG8gUGFybGFtZW50YXJpbyBTb2NpYWxpc3RhHwMFHkdydXBvIFBhcmxhbWVudGFyaW8gU29jaWFsaXN0YWRkAgIPZBYCAgEPDxYEHwIFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkHwMFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkZGQCAw9kFgICAQ8PFgQfAgUhR3J1cG8gUGFybGFtZW50YXJpbyBkZSBDaXVkYWRhbm9zHwMFIUdydXBvIFBhcmxhbWVudGFyaW8gZGUgQ2l1ZGFkYW5vc2RkBSZnX2MxNTFkMGIxXzY2YWZfNDhjY185MWM3X2JlOGUxMTZkN2Q1Mg9kFgRmDxYCHgdWaXNpYmxlaGQCAQ8WAh8EaGQFJmdfZTBmYWViMTVfOGI3Nl80MjgyX2ExYjFfNTI3ZDIwNjk1ODY2D2QWBGYPFgIfBGhkAgEPFgIfBGhkAhEPZBYCAgEPZBYEZg9kFgICAQ8WAh8EaBYCZg9kFgQCAg9kFgQCAQ8WAh8EaGQCAw8WCB4TQ2xpZW50T25DbGlja1NjcmlwdAW7AWphdmFTY3JpcHQ6Q29yZUludm9rZSgnVGFrZU9mZmxpbmVUb0NsaWVudFJlYWwnLDEsIDEsICdodHRwOlx1MDAyZlx1MDAyZnd3dy5hc2FtYmxlYW1hZHJpZC5lc1x1MDAyZkVTXHUwMDJmUXVlRXNMYUFzYW1ibGVhXHUwMDJmQ29tcG9zaWNpb25kZWxhQXNhbWJsZWFcdTAwMmZMb3NEaXB1dGFkb3MnLCAtMSwgLTEsICcnLCAnJykeGENsaWVudE9uQ2xpY2tOYXZpZ2F0ZVVybGQeKENsaWVudE9uQ2xpY2tTY3JpcHRDb250YWluaW5nUHJlZml4ZWRVcmxkHgxIaWRkZW5TY3JpcHQFIVRha2VPZmZsaW5lRGlzYWJsZWQoMSwgMSwgLTEsIC0xKWQCAw8PFgoeCUFjY2Vzc0tleQUBLx4PQXJyb3dJbWFnZVdpZHRoAgUeEEFycm93SW1hZ2VIZWlnaHQCAx4RQXJyb3dJbWFnZU9mZnNldFhmHhFBcnJvd0ltYWdlT2Zmc2V0WQLrA2RkAgEPZBYCAgUPZBYCAgEPEBYCHwRoZBQrAQBkAhcPZBYIZg8PFgQfAwUPRW5nbGlzaCBWZXJzaW9uHgtOYXZpZ2F0ZVVybAVfL0VOL1F1ZUVzTGFBc2FtYmxlYS9Db21wb3NpY2lvbmRlbGFBc2FtYmxlYS9Mb3NEaXB1dGFkb3MvUGFnZXMvUmVsYWNpb25BbGZhYmV0aWNhRGlwdXRhZG9zLmFzcHhkZAICDw8WBB8DBQZQcmVuc2EfDgUyL0VTL0JpZW52ZW5pZGFQcmVuc2EvUGFnaW5hcy9CaWVudmVuaWRhUHJlbnNhLmFzcHhkZAIEDw8WBB8DBRpJZGVudGlmaWNhY2nDs24gZGUgVXN1YXJpbx8OBTQvRVMvQXJlYVVzdWFyaW9zL1BhZ2luYXMvSWRlbnRpZmljYWNpb25Vc3Vhcmlvcy5hc3B4ZGQCBg8PFgQfAwUGQ29ycmVvHw4FKGh0dHA6Ly9vdXRsb29rLmNvbS9vd2EvYXNhbWJsZWFtYWRyaWQuZXNkZAIlD2QWAgIDD2QWAgIBDxYCHwALKwQBZAI1D2QWAgIHD2QWAgIBDw8WAh8EaGQWAgIDD2QWAmYPZBYCAgMPZBYCAgUPDxYEHgZIZWlnaHQbAAAAAAAAeUABAAAAHgRfIVNCAoABZBYCAgEPPCsACQEADxYEHg1QYXRoU2VwYXJhdG9yBAgeDU5ldmVyRXhwYW5kZWRnZGQCSQ9kFgICAg9kFgICAQ9kFgICAw8WAh8ACysEAWQYAgVBY3RsMDAkUGxhY2VIb2xkZXJMZWZ0TmF2QmFyJFVJVmVyc2lvbmVkQ29udGVudDMkVjRRdWlja0xhdW5jaE1lbnUPD2QFKUNvbXBvc2ljacOzbiBkZSBsYSBBc2FtYmxlYVxMb3MgRGlwdXRhZG9zZAVHY3RsMDAkUGxhY2VIb2xkZXJUb3BOYXZCYXIkUGxhY2VIb2xkZXJIb3Jpem9udGFsTmF2JFRvcE5hdmlnYXRpb25NZW51VjQPD2QFGkluaWNpb1xRdcOpIGVzIGxhIEFzYW1ibGVhZJ',
'__EVENTVALIDATION': '/wEWCALIhqvYAwKh2YVvAuDF1KUDAqCK1bUOAqCKybkPAqCKnbQCAqCKsZEJAvejv84Dtkx5dCFr3QGqQD2wsFQh8nP3iq8',
'__VIEWSTATEGENERATOR': 'BAB98CB3',
'__REQUESTDIGEST': '0x476239970DCFDABDBBDF638A1F9B026BD43022A10D1D757B05F1071FF3104459B4666F96A47B4845D625BCB2BE0D88C6E150945E8F5D82C189B56A0DA4BC859D'}
yield scrapy.FormRequest(url=url, formdata= formdata, callback=self.takeEachParty)
def takeEachParty(self, response):
print response.css('ul.listadoVert02 ul li::text').extract()
进入网站的源代码,我可以看到链接的样子,以及它们如何发送JavaScript查询。这是我需要访问的链接之一:
<a id="ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater1_ctl00_lnk_Grupo" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl00$lnk_Grupo", "", true, "", "", false, true))">Grupo Parlamentario Popular de la Asamblea de Madrid</a>
我已经读了这么多关于的文章,但可能问题是我在这方面的无知。
先谢了。
- 编辑日期:**
- 解决方案:**我终于做到了!把Padraic Cunningham的非常有用的代码翻译成Scrapy的方式。由于我为Scrapy指定了问题,我想发布结果,以防有人遇到和我一样的问题。
所以现在开始了:
import scrapy
import js2xml
class AsambleaMadrid(scrapy.Spider):
name = "AsambleaMadrid"
start_urls = ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']
def parse(self, response):
source = response
hrefs = response.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href").extract()
form_data = self.validate(source)
for ref in hrefs:
# js2xml allows us to parse the JS function and params, and so to grab the __EVENTTARGET
js_xml = js2xml.parse(ref)
_id = js_xml.xpath(
"//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]
form_data["__EVENTTARGET"] = _id.text
url_diputado = 'http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx'
# The proper way to send a POST in scrapy is by using the FormRequest
yield scrapy.FormRequest(url=url_diputado, formdata=form_data, callback=self.extract_parties, method='POST')
def validate(self, source):
# these fields are the minimum required as cannot be hardcoded
data = {"__VIEWSTATEGENERATOR": source.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0].extract(),
"__EVENTVALIDATION": source.xpath("//*[@id='__EVENTVALIDATION']/@value")[0].extract(),
"__VIEWSTATE": source.xpath("//*[@id='__VIEWSTATE']/@value")[0].extract(),
" __REQUESTDIGEST": source.xpath("//*[@id='__REQUESTDIGEST']/@value")[0].extract()}
return data
def extract_parties(self, response):
source = response
name = source.xpath("//ul[@class='listadoVert02']/ul/li/a/text()").extract()
print name
我希望是清楚的。再次感谢大家!
4条答案
按热度按时间p3rjfoxz1#
如果您查看在chrome或firebug中发布到表单的数据,您可以看到在发布请求中传递了许多字段,其中有一些是必需的,必须从原始页面解析,解析
div.sangria ul li a
标记中的id是不够的,因为实际发布的数据略有不同,发布的是Javascript函数,WebForm_DoPostBackWithOptions
,它位于 href 中,而不是 id 属性中:有时所有的下划线都被替换为美元符号,所以很容易执行 str.replace 来使它们处于正确的顺序,但在这种情况下,我们可以使用正则表达式来解析,但我喜欢js2xml库,它可以将javascript函数及其参数解析为xml树。
下面使用请求的代码演示如何从初始请求获取数据并到达所需的所有页:
如果我们运行上面的代码,我们会看到所有锚标记的不同文本输出:
你可以给你的蜘蛛添加完全相同的逻辑,我只是用请求来给你展示一个工作示例。你还应该知道,不是每个asp.net网站的行为都一样,你可能需要像这个相关答案中那样重新验证每个帖子。
xtfmy6hx2#
我认为scrappy的
from_response
可以帮你很多(也许这不是最好的,但你会明白的),试试这样的东西:h7wcgrx33#
处理ASP.NET站点的最好方法是使用
scrapy.FormRequest
类。你不必担心跟踪以__
开头的变量[请某人编辑并编写实际的技术术语]。The FormRequest class extends the base Request with functionality for dealing with HTML forms. It uses lxml.html forms to pre-populate form fields with form data from Response objects.
来源:https://docs.scrapy.org/en/latest/topics/request-response.html以下是如何将其用于您的项目-请按照本教程操作:
f0ofjuux4#
同意ELRuLL的观点- Firebug是你在抓取时最好的朋友。如果你想避免JS模拟,那么你需要小心地复制所有正在发送的参数/头文件。
例如,从我看到的__EVENTTARGET中,您只发送了
id="ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater2_ctl01_lnk_Diputado")
通过Firebug我们可以看到:
x1c 0d1x这可能是原因,也可能不是,只是重复和测试。
Firebug link以防万一。