4 Finish The LegoSpyder
4.1 Define Items
通过分析Search API返回的json数据格式,得出的结果数据格式如下:
考虑到所需要保存的数据和便于下载文件,最终设计的items类如下图:
最终items.py的代码如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21from scrapy import Field, Item
class LegoBaseItem(Item):
productId = Field()
file_urls = Field()
files = Field()
file_paths = Field()
class LegoProductItem(LegoBaseItem):
# define the fields for your item here like:
# name = Field()
productName = Field()
productImage = Field()
theme = Field()
year = Field()
class LegoBuildingInstructionsItem(LegoBaseItem):
description = Field()
4.2 Modify Pipelines
因为我们只需要下载图片和PDF,因此可以用一个FilesPipeline搞定,并且为了文档管理需将不同类型的文件存放在不同的目录下,pipelines.py的最终代码修改如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21class LegoFilePipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
file_guid = request.url.split('/')[-1]
group = request.url.split('.')[-1]
#group is the subdirectory of the downloaded files or images
if group not in ['jpg','jpeg','png','gif','pdf']:
group = 'unknown'
return '%s/%s' % (group, file_guid)
def get_media_requests(self, item, info):
if 'file_urls' in item:
for file_url in item['file_urls']:
yield Request(file_url)
def item_completed(self, results, item, info):
if 'file_urls' in item:
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem('File download failed: %s' % file_paths)
item['file_paths'] = file_paths
return item
4.3 Refine the Spider
最后需要修正spider程序BuildingInstructionSpider.py,使其能够递归循环爬取所有结果中的图片和PDF。由于部分分类下的Product超过10个(一次Search返回的结果数),所以需要根据moreData来判断是否还要继续同条件下的搜索:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38class BuildingInstructionsSpider(scrapy.Spider):
name = 'lego'
search_base_url = "https://wwwsecure.us.lego.com//service/biservice/searchbythemeandyear?fromIndex=%s&onlyAlternatives=false&theme=%s&year=%s"
allowed_domains = ['lego.com']
start_urls = [
'https://wwwsecure.us.lego.com/en-us/service/buildinginstructions',
]
def parse(self, response):
year_str = response.css('div.product-search::attr(data-search-years)').extract_first()
theme_str = response.css('div.product-search::attr(data-search-themes)').extract_first()
for theme in json.loads(theme_str):
for year in json.loads(year_str):
search_url = self.search_base_url % (0, theme['Key'], year)
yield scrapy.Request(search_url, callback=self.parse_search)
def parse_search(self, response):
result = json.loads(response.body)
param = parse_qs(urlparse(response.url).query)
if result['count']:
for p in result['products']:
product = LegoProductItem()
product['theme'] = param["theme"][0]
product['year'] = param["year"][0]
product['productId'] = p['productId']
product['productName'] = p['productName']
product['productImage'] = p['productImage']
product['file_urls'] = [p['productImage']]
yield product
for bi in p['buildingInstructions']:
instruction = LegoBuildingInstructionsItem()
instruction['productId'] = p['productId']
instruction['description'] = bi['description']
instruction['file_urls'] = [bi['pdfLocation'], bi['frontpageInfo']]
yield instruction
if result['moreData']:
search_next = self.search_base_url % (int(param['fromIndex'][0]) + int(result['count']), param["theme"][0], param["year"][0])
yield scrapy.Request(search_next, callback=self.parse_search
5 Run!
将main.py修改为:1
cmdline.execute("scrapy crawl lego -o lego.jl".split())
运行就可以看到Download目录下源源不断生成的jpg、png和pdf文件,而且最后所有item的数据都会存放在lego.jl文件中:
{“file_paths”: [“jpg/4655-0000-XX-12-1.jpg”], “productName”: “Quick Fix Station”, “productImage”: “http://cache.lego.com/images/shop/prod/4655-0000-XX-12-1.jpg", “theme”: “10000-20070”, “file_urls”: [“http://cache.lego.com/images/shop/prod/4655-0000-XX-12-1.jpg"], “year”: “2003”, “productId”: “4655”}
{“productId”: “4655”, “file_paths”: [“pdf/4234989.pdf”, “png/4234989.png”], “file_urls”: [“http://cache.lego.com/bigdownloads/buildinginstructions/4234989.pdf", “http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4234989.png"], “description”: “BI, 4655 IN”}
{“productId”: “4655”, “file_paths”: [“pdf/4235788.pdf”, “png/4235788.png”], “file_urls”: [“http://cache.lego.com/bigdownloads/buildinginstructions/4235788.pdf", “http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4235788.png"], “description”: “BI, 4655 NA”}
{“file_paths”: [“jpg/4657-0000-XX-12-1.jpg”], “productName”: “Fire Squad HQ”, “productImage”: “http://cache.lego.com/images/shop/prod/4657-0000-XX-12-1.jpg", “theme”: “10000-20070”, “file_urls”: [“http://cache.lego.com/images/shop/prod/4657-0000-XX-12-1.jpg"], “year”: “2003”, “productId”: “4657”}
{“productId”: “4657”, “file_paths”: [“pdf/4234999.pdf”, “png/4234999.png”], “file_urls”: [“http://cache.lego.com/bigdownloads/buildinginstructions/4234999.pdf", “http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4234999.png"], “description”: “BI, 4657 IN”}
…
6 Some Bugs
6.1 Error processing file from
2016-10-23 21:14:35 [scrapy] ERROR: File (unknown-error): Error processing file from <GET https://mi-od-live-s.legocdn.com/r/www/r/service/-/media/franchises/customer%20service/technic/8052_b_part2.pdf?l.r2=704137121> referred in <None>
由于部分下载链接带有url参数,因此需要将LegoFiePipline.py文件如下内容:1
2file_guid = request.url.split('/')[-1]
group = request.url.split('.')[-1]
修改为:1
2file_guid = request.url.split('?')[0].split('/')[-1]
group = request.url.split('?')[0].split('.')[-1]
6.2 WARNING: Received more bytes than download warn size (33554432) in request
因为部分PDF文件较大,下载时间较长,超过了预设的timeout=180,造成部分文件下载失败,需要修改settings.py:1
2DOWNLOAD_MAXSIZE = 0
DOWNLOAD_TIMEOUT = 18000