Python实用脚本清单

1、从html文件中去除所有html标签只保留纯文本

1
2
3
4
5
def remove_html_tags(text):
"""Remove html tags from a string"""
import re
clean = re.compile('<.*?>')
return re.sub(clean, '', text)

2、从URL中提取域名

使用urlparse模块:Parse URLs into components

1
2
3
4
5
6
def extractDomainFromURL(url):
"""Get domain name from url"""
from urlparse import urlparse
parsed_uri = urlparse(url)
domain = '{uri.netloc}'.format(uri=parsed_uri)
return domain

3、获取HTTP请求的状态码(200,404等)

http不只有get方法(请求头部+正文),还有head方法,只请求头部
使用httplib模块:HTTP protocol client

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import httplib
def get_status_code(host, path="/"):
""" This function retreives the status code of a website by requesting
HEAD data from the host. This means that it only requests the headers.
If the host cannot be reached or something else goes wrong, it returns
None instead.
"""
try:
conn = httplib.HTTPConnection(host)
conn.request("HEAD", path)
return conn.getresponse().status
except StandardError:
return None
print get_status_code("www.revotu.com") # prints 200
print get_status_code("www.revotu.com", "/nonexistant") # prints 404

使用requests模块:HTTP for Humans

1
2
3
4
5
6
7
8
9
10
11
import requests
def get_status_code(url):
try:
r = requests.head(url)
return r.status_code
except StandardError:
return None
print get_status_code('http://www.revotu.com/') # prints 200
print get_status_code('http://www.revotu.com/nonexistant') # prints 404

大师兄 wechat
欢迎关注我的微信公众号:Python大师兄