Building LegoSpyder Part III - Master Gloomymoon's R2D2

4 Finish The LegoSpyder

4.1 Define Items

通过分析Search API返回的json数据格式，得出的结果数据格式如下：

考虑到所需要保存的数据和便于下载文件，最终设计的items类如下图：

最终items.py的代码如下：

from scrapy import Field, Item


class LegoBaseItem(Item):
    productId = Field()
    file_urls = Field()
    files = Field()
    file_paths = Field()


class LegoProductItem(LegoBaseItem):
    # define the fields for your item here like:
    # name = Field()
    productName = Field()
    productImage = Field()
    theme = Field()
    year = Field()


class LegoBuildingInstructionsItem(LegoBaseItem):
    description = Field()

4.2 Modify Pipelines

因为我们只需要下载图片和PDF，因此可以用一个FilesPipeline搞定，并且为了文档管理需将不同类型的文件存放在不同的目录下，pipelines.py的最终代码修改如下：

class LegoFilePipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_guid = request.url.split('/')[-1]
        group = request.url.split('.')[-1]
        #group is the subdirectory of the downloaded files or images
        if group not in ['jpg','jpeg','png','gif','pdf']:
            group = 'unknown'
        return '%s/%s' % (group, file_guid)

    def get_media_requests(self, item, info):
        if 'file_urls' in item:
            for file_url in item['file_urls']:
                yield Request(file_url)

    def item_completed(self, results, item, info):
        if 'file_urls' in item:
            file_paths = [x['path'] for ok, x in results if ok]
            if not file_paths:
                raise DropItem('File download failed: %s' % file_paths)
            item['file_paths'] = file_paths
        return item

4.3 Refine the Spider

最后需要修正spider程序BuildingInstructionSpider.py，使其能够递归循环爬取所有结果中的图片和PDF。由于部分分类下的Product超过10个（一次Search返回的结果数），所以需要根据moreData来判断是否还要继续同条件下的搜索：

class BuildingInstructionsSpider(scrapy.Spider):
    name = 'lego'
    search_base_url = "https://wwwsecure.us.lego.com//service/biservice/searchbythemeandyear?fromIndex=%s&onlyAlternatives=false&theme=%s&year=%s"
    allowed_domains = ['lego.com']
    start_urls = [
        'https://wwwsecure.us.lego.com/en-us/service/buildinginstructions',
    ]

    def parse(self, response):
        year_str = response.css('div.product-search::attr(data-search-years)').extract_first()
        theme_str = response.css('div.product-search::attr(data-search-themes)').extract_first()
        for theme in json.loads(theme_str):
            for year in json.loads(year_str):
                search_url = self.search_base_url % (0, theme['Key'], year)
                yield scrapy.Request(search_url, callback=self.parse_search)

    def parse_search(self, response):
        result = json.loads(response.body)
        param = parse_qs(urlparse(response.url).query)
        if result['count']:
            for p in result['products']:
                product = LegoProductItem()
                product['theme'] = param["theme"][0]
                product['year'] = param["year"][0]
                product['productId'] = p['productId']
                product['productName'] = p['productName']
                product['productImage'] = p['productImage']
                product['file_urls'] = [p['productImage']]
                yield product
                for bi in p['buildingInstructions']:
                    instruction = LegoBuildingInstructionsItem()
                    instruction['productId'] = p['productId']
                    instruction['description'] = bi['description']
                    instruction['file_urls'] = [bi['pdfLocation'], bi['frontpageInfo']]
                    yield instruction
        if result['moreData']:
            search_next = self.search_base_url % (int(param['fromIndex'][0]) + int(result['count']), param["theme"][0], param["year"][0])
            yield scrapy.Request(search_next, callback=self.parse_search

5 Run！

将main.py修改为：

1	cmdline.execute("scrapy crawl lego -o lego.jl".split())

运行就可以看到Download目录下源源不断生成的jpg、png和pdf文件，而且最后所有item的数据都会存放在lego.jl文件中：

{“file_paths”: [“jpg/4655-0000-XX-12-1.jpg”], “productName”: “Quick Fix Station”, “productImage”: “http://cache.lego.com/images/shop/prod/4655-0000-XX-12-1.jpg", “theme”: “10000-20070”, “file_urls”: [“http://cache.lego.com/images/shop/prod/4655-0000-XX-12-1.jpg"], “year”: “2003”, “productId”: “4655”}
{“productId”: “4655”, “file_paths”: [“pdf/4234989.pdf”, “png/4234989.png”], “file_urls”: [“http://cache.lego.com/bigdownloads/buildinginstructions/4234989.pdf", “http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4234989.png"], “description”: “BI, 4655 IN”}
{“productId”: “4655”, “file_paths”: [“pdf/4235788.pdf”, “png/4235788.png”], “file_urls”: [“http://cache.lego.com/bigdownloads/buildinginstructions/4235788.pdf", “http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4235788.png"], “description”: “BI, 4655 NA”}
{“file_paths”: [“jpg/4657-0000-XX-12-1.jpg”], “productName”: “Fire Squad HQ”, “productImage”: “http://cache.lego.com/images/shop/prod/4657-0000-XX-12-1.jpg", “theme”: “10000-20070”, “file_urls”: [“http://cache.lego.com/images/shop/prod/4657-0000-XX-12-1.jpg"], “year”: “2003”, “productId”: “4657”}
{“productId”: “4657”, “file_paths”: [“pdf/4234999.pdf”, “png/4234999.png”], “file_urls”: [“http://cache.lego.com/bigdownloads/buildinginstructions/4234999.pdf", “http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4234999.png"], “description”: “BI, 4657 IN”}
…

6 Some Bugs

6.1 Error processing file from

2016-10-23 21:14:35 [scrapy] ERROR: File (unknown-error): Error processing file from <GET https://mi-od-live-s.legocdn.com/r/www/r/service/-/media/franchises/customer%20service/technic/8052_b_part2.pdf?l.r2=704137121> referred in <None>
由于部分下载链接带有url参数，因此需要将LegoFiePipline.py文件如下内容：

1 2	file_guid = request.url.split('/')[-1] group = request.url.split('.')[-1]

修改为：

1 2	file_guid = request.url.split('?')[0].split('/')[-1] group = request.url.split('?')[0].split('.')[-1]

6.2 WARNING: Received more bytes than download warn size (33554432) in request

因为部分PDF文件较大，下载时间较长，超过了预设的timeout=180，造成部分文件下载失败，需要修改settings.py：

1 2	DOWNLOAD_MAXSIZE = 0 DOWNLOAD_TIMEOUT = 18000