高效文本搜索利器:掌握 grep 与正则表达式

本文深入探讨 Linux 命令行工具 grep 的强大功能,介绍如何结合正则表达式(Regular Expressions)进行高效文本搜索和模式匹配。涵盖 grep 的基本用法、常用选项、高级正则表达式语法(如锚点、字符组、量词)以及 egrep、fgrep 的区别。通过实际应用案例,帮助开发者掌握 grep 提升命令行效率。

grep 命令是 Linux 终端环境中功能最强大的命令之一。grep 代表 “global regular expression print”(全局正则表达式打印),它允许用户根据指定的模式匹配输入文本,并基于复杂的规则进行文本筛选。本指南将深入探讨 grep 命令的常用选项,并介绍如何结合正则表达式(Regular Expressions)进行更高级的文本搜索。

前提条件

要学习本指南,你需要一台运行 Linux 操作系统的计算机,可以是远程服务器(通过 SSH 连接)或你的本地机器。本教程的示例在 Ubuntu 20.04 上验证通过,但应适用于任何主流 Linux 发行版。

获取示例文件

本教程将使用 GNU 通用公共许可证第三版 (GPL-3) 和 BSD 许可证文件进行演示。请按照以下步骤获取这些文件:

  • 获取 GPL-3 文件:

    • 对于 Ubuntu 系统:
      cp /usr/share/common-licenses/GPL-3 .
      
    • 对于其他 Linux 系统,或通过 curl 命令:
      curl -o GPL-3 https://www.gnu.org/licenses/gpl-3.0.txt
      
  • 获取 BSD 许可证文件:

    • 对于 Linux 系统:
      cp /usr/share/common-licenses/BSD .
      
    • 对于其他系统,可以通过 cat 命令创建:
      cat << 'EOF' > BSD
      Copyright (c) The Regents of the University of California.
      All rights reserved.
      
      Redistribution and use in source and binary forms, with or without
      modification, are permitted provided that the following conditions
      are met:
      1. Redistributions of source code must retain the above copyright
         notice, this list of conditions and the following disclaimer.
      2. Redistributions in binary form must reproduce the above copyright
         notice, this list of conditions and the following disclaimer in the
         documentation and/or other materials provided with the distribution.
      3. Neither the name of the University nor the names of its contributors
         may be used to endorse or promote products derived from this software
         without specific prior written permission.
      
      THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
      ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
      IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
      ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
      FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
      DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
      OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
      HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
      LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
      OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
      SUCH DAMAGE.
      EOF
      

grep 基本用法

grep 最基本的用法是在文本文件中匹配字面模式(Literal Pattern)。

基本搜索

grep "GNU" GPL-3

输出示例:

                    GNU GENERAL PUBLIC LICENSE
  The GNU General Public License is a free, copyleft license for
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it applies also to
  Developers that use the GNU GPL protect your rights with two steps:
  "This License" refers to version 3 of the GNU General Public License.
  13. Use with the GNU Affero General Public License.
under version 3 of the GNU Affero General Public License into a single
...

常用选项

grep 提供了多个选项来优化搜索行为:

  • -i--ignore-case (忽略大小写): 搜索时忽略模式的大小写差异。

    grep -i "license" GPL-3
    

    输出示例(包含 LICENSE, license, License 等):

                        GNU GENERAL PUBLIC LICENSE
     of this license document, but changing it is not allowed.
      The GNU General Public License is a free, copyleft license for
      The licenses for most software and other practical works are designed
    the GNU General Public License is intended to guarantee your freedom to
    ...
    
  • -v--invert-match (反向匹配): 查找不包含指定模式的所有行。

    grep -v "the" BSD
    

    输出示例(不含小写 “the” 的行):

    All rights reserved.
    
    Redistribution and use in source and binary forms, with or without
    are met:
        may be used to endorse or promote products derived from this software
        without specific prior written permission.
    
    THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
    ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    ...
    
  • -n--line-number (显示行号): 显示匹配行及其对应的行号。

    grep -vn "the" BSD
    

    输出示例:

    2:All rights reserved.
    3:
    4:Redistribution and use in source and binary forms, with or without
    6:are met:
    13:   may be used to endorse or promote products derived from this software
    14:   without specific prior written permission.
    15:
    16:THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
    17:ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    ...
    

正则表达式(Regular Expressions)

正则表达式是描述特定搜索模式的文本字符串,是 grep 强大的核心。

字面匹配 (Literal Matching)

字面匹配是指精确指定要匹配的字符模式。例如,GNUthe 都是字面匹配的例子。

锚点匹配 (Anchors)

锚点是指定匹配必须发生在何处的特殊字符:

  • ^ (行首): 匹配行开头的模式。

    grep "^GNU" GPL-3
    

    输出示例:

    GNU General Public License for most of our software; it applies also to
    GNU General Public License, you may choose any version ever published
    
  • $ (行尾): 匹配行末尾的模式。

    grep "and$" GPL-3
    

    输出示例:

    that there is no warranty for this free software.  For both users' and
      The precise terms and conditions for copying, distribution and
      License.  Each licensee is addressed as "you".  "Licensees" and
    receive it, in any medium, provided that you conspicuously and
    ...
    

匹配任意字符 (Any Character)

  • . (任意单个字符): 匹配指定位置的任何单个字符(除了换行符)。
    grep "..cept" GPL-3
    
    输出示例:
    use, which is precisely where it is most unacceptable.  Therefore, we
    infringement under applicable copyright law, except executing it on a
    tells the user that there is no warranty for the work (except to the
    License by making exceptions from one or more of its conditions.
    ...
    

字符组表达式 (Character Sets)

使用方括号 [] 来指定该位置可以是括号内的任何一个字符。

  • [wo] (匹配 ‘w’ 或 ‘o’):

    grep "t[wo]o" GPL-3
    

    输出示例:

    your programs, too.
    freedoms that you received.  You must make sure that they, too, receive
      Developers that use the GNU GPL protect your rights with two steps:
    a computer network, with no transfer of a copy, is not conveying.
    ...
    
  • [^c] (匹配除 ‘c’ 以外的任何字符):

    grep "[^c]ode" GPL-3
    

    输出示例:

      1. Source Code.
        model, to give anyone who possesses the object code either (1) a
    the only significant mode of use of the product.
    notice like this when it starts in an interactive mode:
    
  • [A-Z] (字符范围): 匹配大写字母 A 到 Z。

    grep "^[A-Z]" GPL-3
    

    输出示例:

    GNU General Public License for most of our software; it applies also to
    States should not allow patents to restrict development and use of
    License.  Each licensee is addressed as "you".  "Licensees" and
    Component, and (b) serves only to enable use of the work with that
    ...
    

    提示: 使用 POSIX 字符类 [[:upper:]] 通常更准确和推荐,因为它能正确处理各种语言环境下的字母,例如 grep "^[[:upper:]]" GPL-3

重复模式 (Repetition)

  • * (零次或多次): 重复前一个字符或表达式零次或多次。
    grep "([A-Za-z ]*)" GPL-3
    
    输出示例:
     Copyright (C) 2007 Free Software Foundation, Inc.
    distribution (with or without modification), making available to the
    than the work as a whole, that (a) is included in the normal form of
    Component, and (b) serves only to enable use of the work with that
    ...
    

转义元字符 (Escaping Metacharacters)

使用反斜杠 \ 来转义(Escape)具有特殊含义的元字符,使其作为字面字符匹配。

grep "^[A-Z].*\.$" GPL-3

输出示例(匹配以大写字母开头并以字面句号结尾的行):

Source.
License by making exceptions from one or more of its conditions.
License would be to refrain entirely from conveying the Program.
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
SUCH DAMAGES.
Also add information on how to contact you by electronic and paper mail.

扩展正则表达式 (grep -E / egrep)

使用 grep -E 选项或直接使用 egrep 命令可以启用扩展正则表达式(Extended Regular Expressions),提供更强大的匹配功能。

分组 (Grouping)

使用圆括号 () 将表达式组合成一个单元。

grep -E "(grouping)" file.txt
# 等同于 grep "\(grouping\)" file.txt 或 egrep "(grouping)" file.txt

逻辑或 (Alternation)

使用 | 字符表示“或”关系,通常与分组结合使用。

grep -E "(GPL|General Public License)" GPL-3

输出示例:

  The GNU General Public License is a free, copyleft license for
the GNU General Public License is intended to guarantee your freedom to
GNU General Public License for most of our software; it applies also to
price.  Our General Public Licensees are designed to make sure that you
  Developers that use the GNU GPL protect your rights with two steps:
  For the developers' and authors' protection, the GPL clearly explains
authors' sake, the GPL requires that modified versions be marked as
have designed this version of the GPL to prohibit the practice for those
...

量词 (Quantifiers)

  • ? (零次或一次): 使前一个字符或表达式可选。

    grep -E "(copy)?right" GPL-3
    

    输出示例(匹配 copyrightright):

     Copyright (C) 2007 Free Software Foundation, Inc.
      To protect your rights, we need to prevent others from denying you
    these rights or asking you to surrender the rights.  Therefore, you have
    know their rights.
    ...
    
  • + (一次或多次): 匹配前一个表达式一次或多次。

    grep -E "free[^[:space:]]+" GPL-3
    

    输出示例(匹配 “free” 后跟一个或多个非空白字符):

      The GNU General Public License is a free, copyleft license for
    to take away your freedom to share and change the works.  By contrast,
    the GNU General Public License is intended to guarantee your freedom to
      When we speak of free software, we are referring to freedom, not
    ...
    

指定匹配重复次数

使用大括号 {} 指定精确或范围内的重复次数:

  • {n}:精确匹配 n 次。
  • {n,m}:匹配 nm 次。
  • {n,}:匹配至少 n 次。
grep -E "[AEIOUaeiou]{3}" GPL-3 # 匹配三个连续的元音字母
grep -E "[[:alpha:]]{16,20}" GPL-3 # 匹配长度在16到20之间的字母单词

实际应用案例

grep 结合正则表达式在日常开发和系统管理中有着广泛的应用:

  1. 验证 CSV 字段: 检查 CSV 文件中每行是否具有指定数量的逗号分隔字段。

    grep -E "^[^,]+,[^,]+,[^,]+,[^,]+,[^,]+$" yourfile.csv
    
  2. 按错误级别过滤日志: 过滤包含特定错误级别(如 ERROR)的日志行。

    grep "ERROR" logs.txt
    
  3. 在源代码中搜索特定函数: 递归搜索目录中的函数定义。

    grep -r "calculateTotal" /path/to/source/code/directory
    
  4. 匹配 URL 或电子邮件地址: 在文本中查找 URL 或电子邮件地址。

    grep -E "https?://[^ ]+" yourfile.txt # 匹配 HTTP/HTTPS URL
    # grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" yourfile.txt # 匹配电子邮件
    
  5. NLP 预处理:过滤停用词: 过滤掉包含常见停用词的行,例如在自然语言处理 (NLP) 任务中。

    grep -vE "the|and|a" yourfile.txt
    
  6. 检测近似重复条目或拼写错误: 匹配相似模式以发现潜在重复或错误(如重复字符)。

    grep -E "(\w)\1" yourfile.txt
    
  7. 命名实体或常见短语的正则表达式模式: 查找文本中的特定短语。

    grep -E "named entity recognition" yourfile.txt
    

常见错误与调试

在使用 grep 和正则表达式时,开发者可能会遇到一些常见问题:

  1. 未转义正则表达式元字符: 如果想匹配 *, +, ?, . 等具有特殊含义的元字符本身,需要用反斜杠 \ 进行转义(如 \*)。
  2. 匹配空行或仅包含空白的行:
    grep -E "^\s*$" yourfile.txt
    
  3. 匹配制表符 (\t) 和回车符 (\r):
    grep -E "\t" yourfile.txt   # 匹配包含制表符的行
    grep -E "\r" yourfile.txt   # 匹配包含回车符的行
    
  4. 未正确引用模式: 当模式包含空格或特殊字符时,应使用单引号或双引号将其括起来,以防止 shell 解释。

grep, egrepfgrep 的区别

命令 描述 特性 用例 示例命令
grep 基本模式匹配 支持基本正则表达式 通用模式匹配 grep "pattern" file.txt
egrep 扩展模式匹配 支持扩展正则表达式 复杂模式匹配(无需转义 ?, +, ` , ()` 等)
fgrep 固定模式匹配 不支持正则表达式 匹配固定字符串(速度快) fgrep "pattern" file.txt

注意: egrep 相当于 grep -Efgrep 相当于 grep -F。它们的主要区别在于支持的模式匹配类型。grep 支持基本正则表达式,egrep 支持扩展正则表达式,而 fgrep 则完全不支持正则表达式,它将搜索模式视为字面字符串,这在匹配固定文本时效率更高。

处理多行模式

grep 天生是面向行的工具,因此不太适合直接处理跨越多行的模式。对于这类需求,可以使用 awkperl 等更强大的工具:

  • awk awk 是一种文本处理语言,可以处理更复杂的文本流。

    awk '/pattern/ {print $0}' yourfile.txt
    
  • perl perl 是一种通用的脚本语言,其正则表达式功能非常强大,并支持多行匹配。

    perl -0777 -ne 'print if /pattern/s' yourfile.txt
    

    其中 -0777 使 perl 以“吸入模式”读取整个文件作为单个字符串,s 修饰符允许模式中的 . 匹配包括换行符在内的任意字符。

常见问题解答 (FAQs)

  1. grepegrep 有什么区别? grep 默认支持基本正则表达式,而 egrep(或 grep -E)支持功能更丰富的扩展正则表达式,例如 ?+|() 等元字符无需转义即可直接使用。

  2. 我可以使用 grep 搜索多个文件吗? 可以。你可以指定多个文件名或使用通配符: grep "pattern" file1.txt file2.txtgrep "pattern" *.txt

  3. 如何用 grep 查找不匹配模式的行? 使用 -v--invert-match 选项:grep -v "pattern" yourfile.txt

  4. 如何在 grep 输出中包含行号? 使用 -n--line-number 选项:grep -n "pattern" yourfile.txt

  5. 为什么我的 grep 正则表达式没有按预期工作? 请检查以下几点:

    • 正则表达式语法是否正确。
    • 是否使用了正确的选项(例如,对于扩展正则表达式是否使用了 -E)。
    • 是否转义了特殊字符(如 .*+? 等)。
    • 模式是否正确引用(例如,用引号包裹以避免 shell 解释)。
  6. 如何搜索包含空格或特殊字符的模式? 对于包含空格的模式,用单引号或双引号将其括起来:grep "My pattern"。对于其他特殊字符,使用反斜杠 \ 转义,例如 grep "pattern\ with\ \$pecial\ char"

总结

grep 是在文件或文件系统层次结构中查找模式的强大工具。熟练掌握其选项和正则表达式语法将极大地提高你处理文本数据的能力,无论是日常的日志分析、代码搜索还是数据清洗,grep 都能发挥重要作用。正则表达式是计算中的一个基本概念,理解它们将为文本编辑器中的高级搜索和替换、编程语言中的数据验证等应用打开广阔的可能性。

关于

关注我获取更多资讯

公众号
📢 公众号
个人号
💬 个人号
使用 Hugo 构建
主题 StackJimmy 设计