Building LegoSpyder Part II - Master Gloomymoon's R2D2

2. Installing Scrapy

直接用pip安装即可：

1	$ pip install Scrapy

如果出现如下错误：
exceptions.ImportError: No module named win32api

需要手动安装pypiwin32:

1	$ pip install pypiwin32

如果需要生成下载预览图片的缩略图，请手动安装image：

1	$ pip install image

3 Building Spider

3.0 Create Scrapy Project

进入项目目录，在命令行输入如下命令就可以创建一个空白的Scrapy项目LegoSpyder：

1	$ scrapy startproject LegoSpyder

生成的项目框架如下所示：

LegoSpyder/
    scrapy.cfg
    LegoSpyder/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

3.1 Create Spider

然后在spiders目录下新建BuildingInstructionsSpider.py文件，这就是我们爬虫程序。
根据之前对Lego网站的分析，就能够获取所有Set的相关json数据：

import scrapy
import json


class BuildingInstructionsSpider(scrapy.Spider):
    name = 'lego'
    search_base_url = "https://wwwsecure.us.lego.com//service/biservice/searchbythemeandyear?fromIndex=0&onlyAlternatives=false&theme=%s&year=%s"
    allowed_domains = ['lego.com']
    start_urls = [
        'https://wwwsecure.us.lego.com/en-us/service/buildinginstructions',
    ]

    def parse(self, response):
        year_str = response.css('div.product-search::attr(data-search-years)').extract_first()
        theme_str = response.css('div.product-search::attr(data-search-themes)').extract_first()
        for theme in json.loads(theme_str):
            #print theme
            for year in json.loads(year_str):
                #print year
                search_url = self.search_base_url % (theme['Key'], year)
                yield scrapy.Request(search_url, callback=self.parse_search)


    def parse_search(self, response):
        print response.body

为了方便在PyCharm中进行调试，在项目的根目录创建main.py：

1 2	from scrapy import cmdline cmdline.execute("scrapy crawl lego".split())

这样从main.py运行就能够调试spider的脚本了，而不用从命令行用scrapy crawl的方式运行爬虫。
Run之后就能看到所有的Set信息：
2016-10-22 11:19:15 [scrapy] INFO: Spider opened 2016-10-22 11:19:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-10-22 11:19:15 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 2016-10-22 11:19:17 [scrapy] DEBUG: Crawled (200) <GET https://wwwsecure.us.lego.com/en-us/service/buildinginstructions> (referer: None) 2016-10-22 11:19:17 [scrapy] DEBUG: Crawled (200) <GET https://wwwsecure.us.lego.com//service/biservice/searchbythemeandyear?fromIndex=0&onlyAlternatives=false&theme=10000-20070&year=1995> (referer: https://wwwsecure.us.lego.com/en-us/service/buildinginstructions) {"count":0,"moreData":false,"products":[],"totalCount":0,"years":["2004","2003"],"themes":["10000-20032","500-346"]} 2016-10-22 11:19:17 [scrapy] DEBUG: Crawled (200) <GET https://wwwsecure.us.lego.com//service/biservice/searchbythemeandyear?fromIndex=0&onlyAlternatives=false&theme=10000-20070&year=2003> (referer: https://wwwsecure.us.lego.com/en-us/service/buildinginstructions) {"count":2,"moreData":false,"products":[{"productId":"4655","productName":"Quick Fix Station","productImage":"http://cache.lego.com/images/shop/prod/4655-0000-XX-12-1.jpg","buildingInstructions":[{"description":"BI, 4655 IN","pdfLocation":"http://cache.lego.com/bigdownloads/buildinginstructions/4234989.pdf","downloadSize":"1.23 Mb","frontpageInfo":"http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4234989.png","isAlternative":false},{"description":"BI, 4655 NA","pdfLocation":"http://cache.lego.com/bigdownloads/buildinginstructions/4235788.pdf","downloadSize":"1.26 Mb","frontpageInfo":"http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/4235788.png","isAlternative":false}],"themeName":"4juniors","launchYear":2003},{"productId": ... } ],"totalCount":2,"years":["2004","2003"],"themes":["10000-20070","10000-20031","10000-20051","10000-20057","10000-20009","10000-20017","10000-20016","10000-20008","10000-20216","10000-20106","10000-20105","10000-20084","10000-20129","10000-20238","10000-20003","10000-20189","10000-20039","10000-20081","10000-20040","10000-20080","10000-20037","10000-20035","10000-20056","10000-20002","10000-20124","10000-20071"]} ...

3.2 Download Images

Scrapy自带File和image下载的模块，我们先创建一个TestSpider来实现封面图片的下载功能。
修改items.py，新增LegoImageItem类：

import scrapy

class LegoImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()

修改pipelines.py，新增LegoImagePipeline类：

class LegoImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image download failed: %s' % image_paths)
        item['image_paths'] = image_paths
        return item

修改spiders.BuildingInstructionsSpider.py，增加一个新的TestSpider：

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'https://wwwsecure.us.lego.com//service/biservice/searchbythemeandyear?fromIndex=0&onlyAlternatives=false&theme=10000-20127&year=2015',
    ]

    def parse(self, response):
        result = json.loads(response.body)
        print result['count'], result['moreData']
        item = LegoImageItem()
        url_list = []
        for product in result['products']:
            #print product
            url_list.append(product['productImage'])
        print url_list
        item['image_urls'] = url_list
        yield item

最后修改settings.py，增加相关的配置信息：

ITEM_PIPELINES = {
    'LegoSpyder.pipelines.LegoImagePipeline': 1,
}
IMAGES_STORE = '\\Download\\Test' #图片下载目录
IMAGES_EXPIRES = 90

运行后就可以在相应的目录下看到抓取的图片文件了

2. Installing Scrapy

3 Building Spider

3.0 Create Scrapy Project

3.1 Create Spider

3.2 Download Images

FEATURED TAGS