0. Preface
10月8日收到了新买的Architecture Studo,前两天无意中发现了Lego.com竟然提供了所有Set的Building Instructions。
这是一个搜索页面,可以根据主题和年份搜索到所有Set的图纸,于是产生了一个邪恶的想法:把所有的图纸用爬虫Download下来。
1. 分析页面
在开始爬之前首先需要分析页面的结构,根据firebug监控Net流量发现,实际查询请求的URL为:
其返回结果是一个简单的json文件:
{“count”:8,”moreData”:false,”products”:[{“productId”:”70166”,”productName”:”Spyclops Infiltration”,”productImage”
:”http://cache.lego.com/images/shop/prod/70166-0000-XX-12-1.jpg","buildingInstructions":[{"description"
:”BI 3003/32- 70166 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions/6112710
.pdf”,”downloadSize”:”4.77 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions
/fpimg/6112710.png”,”isAlternative”:false},{“description”:”BI 3003/32- 70166 V39”,”pdfLocation”:”http
://cache.lego.com/bigdownloads/buildinginstructions/6112711.pdf”,”downloadSize”:”4.71 Mb”,”frontpageInfo”
:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6112711.png","isAlternative":false}]
,”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70167”,”productName”:”Invizable Gold Getaway”
,”productImage”:”http://cache.lego.com/images/shop/prod/70167-0000-XX-12-1.jpg","buildingInstructions"
:[{“description”:”BI 3004/60+4/65+115g 70167 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions
/6112723.pdf”,”downloadSize”:”5.85 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions
/fpimg/6112723.png”,”isAlternative”:false},{“description”:”BI 3004/60+4/65+115g 70167 V39”,”pdfLocation”
:”http://cache.lego.com/bigdownloads/buildinginstructions/6112724.pdf","downloadSize":"5.79 Mb”,”frontpageInfo”
:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6112724.png","isAlternative":false}]
,”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70168”,”productName”:”Drillex Diamond Job”,”productImage”
:”http://cache.lego.com/images/shop/prod/70168-0000-XX-12-1.jpg","buildingInstructions":[{"description"
:”BI 3004/72+4- 70168 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions/6119298
.pdf”,”downloadSize”:”7.06 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions
/fpimg/6119298.png”,”isAlternative”:false},{“description”:”BI 3004/72+4- 70168 V39”,”pdfLocation”:”http
://cache.lego.com/bigdownloads/buildinginstructions/6119299.pdf”,”downloadSize”:”7 Mb”,”frontpageInfo”
:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6119299.png","isAlternative":false}]
,”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70169”,”productName”:”Agent Stealth Patrol”,”productImage”
:”http://cache.lego.com/images/shop/prod/70169-0000-XX-12-1.jpg","buildingInstructions":[{"description"
:”BI 3017/100+4/65+200g- 70169 V39”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions
/6115430.pdf”,”downloadSize”:”11.22 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions
/fpimg/6115430.png”,”isAlternative”:false},{“description”:”BI 3017/100+4/65+200g-70169 V29”,”pdfLocation”
:”http://cache.lego.com/bigdownloads/buildinginstructions/6115427.pdf","downloadSize":"12.11 Mb”,”frontpageInfo”
:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6115427.png","isAlternative":false}]
,”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70170”,”productName”:”UltraCopter vs. AntiMatter”
,”productImage”:”http://cache.lego.com/images/shop/prod/70170-0000-XX-12-1.jpg","buildingInstructions"
:[{“description”:”BI 3019/100+4/65+200g- 70170 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads
/buildinginstructions/6115458.pdf”,”downloadSize”:”17.36 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads
/buildinginstructions/fpimg/6115458.png”,”isAlternative”:false},{“description”:”BI 3019/100+4/65+200g-
70170 V39”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions/6115472.pdf","downloadSize"
:”14.74 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6115472.png"
,”isAlternative”:false}],”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70171”,”productName”:”Ultrasonic
Showdown”,”productImage”:”http://cache.lego.com/images/shop/prod/70171-0000-XX-12-1.jpg","buildingInstructions"
:[{“description”:”BI 3004/60+4/65+115g - 70171 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads
/buildinginstructions/6127741.pdf”,”downloadSize”:”5.83 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads
/buildinginstructions/fpimg/6127741.png”,”isAlternative”:false},{“description”:”BI 3004/60+4/65+115g
- 70171 V39”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions/6127743.pdf","downloadSize"
:”5.79 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6127743.png"
,”isAlternative”:false}],”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70172”,”productName”:”AntiMatter’s
Portal Hideout”,”productImage”:”http://cache.lego.com/images/shop/prod/70172-0000-XX-12-1.jpg","buildingInstructions"
:[{“description”:”BI 3016/96+4/65+200g - 70172 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads
/buildinginstructions/6129665.pdf”,”downloadSize”:”12.45 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads
/buildinginstructions/fpimg/6129665.png”,”isAlternative”:false},{“description”:”BI 3016/96+4/65+200g
- 70172 V39”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions/6129666.pdf","downloadSize"
:”10.57 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6129666.png"
,”isAlternative”:false}],”themeName”:”Agents”,”launchYear”:2015},{“productId”:”70173”,”productName”:”Ultra
Agents Ocean HQ”,”productImage”:”http://cache.lego.com/images/shop/prod/70173-0000-XX-12-1.jpg","buildingInstructions"
:[{“description”:”BI 3019/176+4/65+200 - 70173 V29”,”pdfLocation”:”http://cache.lego.com/bigdownloads
/buildinginstructions/6129674.pdf”,”downloadSize”:”24 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads
/buildinginstructions/fpimg/6129674.png”,”isAlternative”:false},{“description”:”BI 3019/176+4/65+200
- 70173 V39”,”pdfLocation”:”http://cache.lego.com/bigdownloads/buildinginstructions/6129675.pdf","downloadSize"
:”22.14 Mb”,”frontpageInfo”:”http://cache.lego.com/bigdownloads/buildinginstructions/fpimg/6129675.png"
,”isAlternative”:false}],”themeName”:”Agents”,”launchYear”:2015}],”totalCount”:8,”years”:[“2015”,”2014”
,”2009”,”2008”],”themes”:[“10000-20127”,”10000-20057”,”10000-20009”,”10000-20263”,”10000-20264”,”10000-20248”
,”10000-20147”,”10000-20246”,”10000-20243”,”10000-20020”,”10000-20226”,”10000-20018”,”10000-20062”,”10000-20016”
,”10000-20008”,”10000-20174”,”10000-20216”,”10000-20229”,”10000-20218”,”10000-20090”,”10000-20194”,”10000-20245”
,”10000-20155”,”10000-20019”,”10000-20094”,”10000-20230”,”10000-20223”,”10000-20005”,”10000-20239”,”10000-20219”
,”10000-20244”,”10000-20139”,”10000-20238”,”10000-20003”,”10000-20055”,”10000-20000”,”10000-20140”,”10000-20128”
,”10000-20221”,”10000-20041”,”10000-20231”,”10000-20225”,”10000-20049”,”10000-20039”,”10000-20242”,”10000-20222”
,”10000-20241”,”10000-20056”,”10000-20101”,”10000-20002”]}
非常简单的方式就能够获取所有Set的主题、发售年份、名称、编号、包装图片和最重要的pdf版搭建图纸,唯一的问题是URL参数中theme的取值范围是多少。
再次分析搜索的HTML页面,发现搜索框的DIV中已经完全告诉我们theme及对应的参数:
<div class=”product-search ng-scope” … data-search-themes=”[{“Label”:”4juniors”,”Key”:”10000-20070”},{“Label”:”Adventurers”,”Key”:”10000-20031”},{“Label”:”Agents”,”Key”:”10000-20127”},{“Label”:”Alpha Team”,”Key”:”10000-20085”},{“Label”:”Angry Birds”,”Key”:”10000-20251”},{“Label”:”Aqua Raiders”,”Key”:”10000-20113”},{“Label”:”Aquazone”,”Key”:”10000-20032”},{“Label”:”Artic”,”Key”:”10000-20097”},{“Label”:”Atlantis”,”Key”:”10000-20115”},{“Label”:”Avatar TM”,”Key”:”10000-20068”},{“Label”:”Batman TM”,”Key”:”10000-20114”},{“Label”:”Belville”,”Key”:”10000-20051”},{“Label”:…,”Key”:”10000-20002”},{“Label”:”Time Cruisers”,”Key”:”10000-20096”},{“Label”:”Town”,”Key”:”500-714”},{“Label”:”Toy Story TM”,”Key”:”10000-20109”},{“Label”:”Trains”,”Key”:”500-717”},{“Label”:”Transport”,”Key”:”500-720”},{“Label”:”Vikings”,”Key”:”10000-20102”},{“Label”:”Wild West”,”Key”:”10000-20015”},{“Label”:”Williams TM”,”Key”:”10000-20124”},{“Label”:”World City”,”Key”:”10000-20071”},{“Label”:”World Racers”,”Key”:”10000-20130”},{“Label”:”X-Treme”,”Key”:”10000-20098”},{“Label”:”ZNAP”,”Key”:”10000-20064”}]
一切就绪,下一步就是要规划爬取数据的步骤了:
- 首先获取所有theme的label和对应的key
- 将key和year代入遍历所有的URL,发送request
- 根据response获取所有set的信息
- 若本地没有,则下载set的图片和pdf资源
- 每隔一段时间从1开始重新爬取,就可以保证和官网资源同步