4.字符串和文本

标签： python

4.1.1 问题

将一个字段分隔成多个字段，但是分隔符不确定，

4.1.2 解决方案

string对象的split()方法只适应简单的字段分割情况，当有多个分隔符，或者分隔符周围又不确定的空格时；最好使用re.split

line = 'asdf fjdk; afed, fjek,asdf, foo'
import re
a = re.split(r'[;,\s]\s*', line)
print a

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

以上分隔符：
[
；
，
\s
\s* 多个空格符

4.2.1 问题

匹配字符串开头或者结尾

4.2.2 解决方案

str.startswith()        # 匹配开头
str.endswith()          # 匹配结尾

>>> filename = 'sapm.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.zhourudong.cn'
>>> url.startswith('http:')
True

匹配多种规则

>>> import os
>>> filenames = os.listdir('.')
>>> filenames
[ 'Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h' ]
>>> [name for name in filenames if name.endswith(('.c', '.h')) ]
['foo.c', 'spam.c', 'spam.h'
>>> any(name.endswith('.py') for name in filenames)
True
>>>

类似方法

# 方法1
>>> filename = 'spam.txt'
>>> filename[-4:] == '.txt'
True
>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True
>>>
# 方法2: 使用正则表达式匹配
>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:jhttps:jftp:', url)
<_sre.SRE_Match object at 0x101253098>

检测目录内是否有指定的文件类型(结尾)

if any(name.endswith(('.c', '.h')) for name in listdir(dirname)):
...

4.3 用shell通配符匹配字符串

4.3.1 问题

使用 Unix Shell 中常用的通配符 (比如 .py , Dat[0-9].csv 等) 去匹配文本字符串

4.3.2 解决方案

fnmatch 模块提供了两个函数—— fnmatch() 和 fnmatchcase() ，可以用来实现这样的匹配。用法如下：

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']
>>>

# 注意*nux 系统区分大小写
>>> # On OS X (Mac)
>>> fnmatch('foo.txt', '*.TXT')
False
>>> # On Windows
>>> fnmatch('foo.txt', '*.TXT')
True
>>>

4.4 字符串匹配和搜索

4.4.1 问题

需要匹配或者搜索特定模式的文本

4.4.2 解决方案

你想匹配的是字面字符串，那么你通常只需要调用基本字符串方法就行，比如str.find() , str.endswith() , str.startswith() 或者类似的方法：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> # Exact match
>>> text == 'yeah'
False
>>> # Match at start or end
>>> text.startswith('yeah')
True
>>> text.endswith('no')
False
>>> # Search for the location of the first occurrence
>>> text.find('no')
10
>>>

对于复杂的匹配需要使用正则表达式和 re 模块。为了解释正则表达式的基本原理，假设你想匹配数字格式的日期字符串比如 11/27/2012 ，你可以这样做：

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
... print('yes')
... else:
... print('no')
...
no
>>>

4.5 字符串搜索和替换

4.5.1

你想在字符串中搜索和匹配指定的文本模式

4.5.2 解决方案

对于简单的字面模式，直接使用 str.repalce() 方法即可，比如：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'
>>>

对于复杂的模式，请使用 re 模块中的 sub() 函数。为了说明这个，假设你想将形式为 11/27/2012 的日期字符串改成 2012-11-27 。示例如下：

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

说明：sub() 函数中的第一个参数是被匹配的模式，第二个参数是替换模式。反斜杠数字比如 n3 指向前面模式的捕获组号。

4.6 字符串忽略大小写的搜索替换

4.6.1 问题

你需要以忽略大小写的方式搜索与替换文本字符串

4.6.2 解决方案

为了在文本操作时忽略大小写，你需要在使用 re模块的时候给这些操作提供re.IGNORECASE 标志参数。比如：

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'
>>>

最后的那个例子揭示了一个小缺陷，替换字符串并不会自动跟被匹配字符串的大小写保持一致。为了修复这个，你可能需要一个辅助函数，就像下面的这样：

def matchcase(word):
	def replace(m):
		text = m.group()
		if text.isupper():
			return word.upper()
		elif text.islower():
			return word.lower()
		elif text[0].isupper():
			return word.capitalize()
		else:
			return word
	return replace

下面使用上述函数的方法：

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
'UPPER SNAKE, lower snake, Mixed Snake'
>>>

4.8 多行匹配

4.8.1 问题

你正在试着使用正则表达式去匹配一大块的文本，而你需要跨越多行去匹配。

4.8.2 解决方案

这个问题很典型的出现在当你用点 (.) 去匹配任意字符的时候，忘记了点 (.) 不能匹配换行符的事实。比如，假设你想试着去匹配 C 语言分割的注释：

>>> comment = re.compile(r'/\*(.*?)\*/')
>>> text1 = '/* this is a comment */'
>>> text2 = '''/* this is a
... multiline comment */
... '''
>>>
>>> comment.findall(text1)
[' this is a comment ']
>>> comment.findall(text2)
[]
>>>
# 解决方法
>>> comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
>>> comment.findall(text2)
[' this is a\n multiline comment ']

4.11 删除字符串不需要的字符

4.11.1 问题

你想去掉文本字符串开头，结尾或者中间不想要的字符，比如空白。

4.字符串和文本

4.字符串和文本

4.1.1 问题

4.1.2 解决方案

4.2.1 问题

4.2.2 解决方案

4.3 用shell通配符匹配字符串

4.3.1 问题

4.3.2 解决方案

4.4 字符串匹配和搜索

4.4.1 问题

4.4.2 解决方案

4.5 字符串搜索和替换

4.5.1

4.5.2 解决方案

4.6 字符串忽略大小写的搜索替换

4.6.1 问题

4.6.2 解决方案

4.8 多行匹配

4.8.1 问题

4.8.2 解决方案

4.11 删除字符串不需要的字符

4.11.1 问题

谢谢大爷~