Scrapy爬虫常见问题总结

spider方便运行/调试

在spider文件中,加入cmdline的调用方法

1
2
3
4
5
6
7
8
9
import scrapy.cmdline
#Your Spider Class...
def main():
scrapy.cmdline.execute(['scrapy', 'crawl', 'your_spider_name'])
if __name__ == '__main__':
main()

item数据存入MySQL数据库

pipelines.py创建MySQLStorePipeline,并加入settings.py的ITEM_PIPELINES

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import MySQLdb
from MyProject.settings import MYSQL_HOST,MYSQL_DBNAME,MYSQL_USER,MYSQL_PASSWD
class MySQLStorePipeline(object):
"""A pipeline to store the item in a MySQL database."""
def __init__(self):
self.conn = MySQLdb.connect(host=MYSQL_HOST, user=MYSQL_USER, passwd=MYSQL_PASSWD, db=MYSQL_DBNAME, charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor() #MySQLdb.cursors.DictCursor
def process_item(self, item, spider):
try:
self.cursor.execute("""
INSERT INTO books(name, author)
VALUES (%s, %s)
""",(item['name'].encode('utf-8'), item['author'].encode('utf-8')))
self.conn.commit()
except MySQLdb.Error, e:
print 'Error %d: %s'% (e.args[0], e.args[1])
return item

MySQL数据库中读取数据作为start_requests

1
2
3
4
5
6
7
8
9
def start_requests(self):
"""yield start requests from MySQL databse"""
conn = MySQLdb.connect(host=MYSQL_HOST, user=MYSQL_USER, passwd=MYSQL_PASSWD, db=MYSQL_DBNAME, charset='utf8', use_unicode=True)
cursor = conn.cursor(MySQLdb.cursors.DictCursor)
cursor.execute('SELECT * FROM scrapyed_table')
rows = cursor.fetchall()
for row in rows:
yield scrapy.Request(row['url'], callback=self.parse)
cursor.close()

利用Request中的meta参数传递信息

由于重定向或是其他原因,会导致原始的start_url发生改变,可以利用Request中的meta参数传递信息。

1
2
3
4
5
6
7
8
def start_requests(self):
start_url = 'your_scrapy_start_url'
yield Request(start_url, self.parse, meta={'start_url':start_url})
def parse(self, response):
item = YourItem()
item['start_url'] = response.meta['start_url']
yield item

LOG信息输出到文件

settings.py设置LOG_FILE

1
LOG_FILE="log.txt" #log信息输出到log.txt文件

大师兄 wechat
欢迎关注我的微信公众号:Python大师兄