Text Processing

Natural Language Processing, or NLP, is a research area in the intersection of computer science, linguistics, and artificial intelligence. The main goal of NLP is enabling computers to 'understand' natural language to execute useful tasks; the methods used to reach this goal are also called NLP. Text Mining refers to the extraction of information from texts; it can be thought of as a subset of NLP.

A corpus is a collection of texts designed to be processed using language analysis tools. A sequence of NLP steps is called an NLP pipeline, and pipelines are commonly used to process corpora. The main elements of a pipeline are morphological analysis, syntactic analysis, and semantic interpretation of natural language. The preprocessing steps needed at the beginning of the NLP pipeline depend on whether the natural language to be analyzed is recorded as speech or as text. For speech, preprocessing typically involves phonological analysis, whereas for text, optical character recognition (OCR), tokenization (i.e., the splitting of text into its elements on the word level), and sentence splitting (i.e., the splitting of text into its sentences) might be required.

Depending on the strategy used to solve processing tasks, we distinguish between rule-based procedures and statistical procedures (not only in NLP). As the growth in computing power has only recently made the statistical analysis of large datasets possible, current research focuses largely on statistical procedures. To get started with text mining, however, rule-based procedures are better suited.


Specifically: Regular Expressions (Regex)

Regular expressions, or regex, are one of the oldest and most powerful tools of rule-based text mining. With their help, we can texts for patterns that we have specified beforehand. Therefore, every regex setup as at least three components:

  1. a text to be searched;

  2. a pattern to be searched in the text; and

  3. a function performing the search, which receives the text and the pattern as arguments (or which is called as a method on the text or the pattern) and returns the search results.

Since only those patterns are found which conform to our specification, regex are primarily useful to search texts from domains using a fairly controlled vocabulary and well-understood grammar, such as judicial decisions in the legal domain.

Most higher programming languages offer some regex-based methods for text manipulation that we can call directly on String objects. In addition to these build-ins, there are usually specialized regex modules. These modules typically conform to some general regex principles and offer classic regex functionalities following the syntax of their host programming language.

Python's standard regex module is called re and can be imported using that name. The full documentation can be found here; in the following, only a few central syntax elements are presented.

# Importing Python's regex module
import re
# Basic functionality
# -------------------

# Specifying patterns (potentially escaping special characters)
pattern = "Art\. \d+"               # \d stands for a digit

# Specifying the String to be searched (normally read from a file)
string = "As stated in Art. 20(1) and Art. 21(1) of the..."

# Look only for the first match - return match object or match as a String
re.search(pattern, string)          # match object if a match is found, else None
re.search(pattern, string).group(0) # 'Art. 21'

# Search for all matches - return list of matches
re.findall(pattern, string)         # ['Art. 21', 'Art. 20']

# Change the search configuration using flags (separate individual flags with a | )
re.search(pattern, string, re.DOTALL | re.IGNORECASE)


# Alternative search option: regex objects (more efficient if regex are used many times)
p = re.compile(pattern)             # compiling a regex object...
result = p.findall(string)          # ...to then call methods on it
# Pattern Construction
# ---------------------------

# Special Characters: . ^ $ * + ? { } [ ] \ | ( ) \A \b \B \d \D \s \S \w \W \Z

# Structuring the Expression
# --------------------------
# .          - any character
# \.         - a fullstop (special characters must be escaped)
# [...]      - a set of characters to be matched
# [^...]     - a set of characters not to be matched
# [a-z]      - abbreviation(s) for frequently used sets of characters
#              (here: lowercase alphabet letters)
# \W         - abbreviation for [^a-zA-Z0-9_]
#              (works similarly for the other letter-name special characters)
# (...)      - group, accessible using .group(0) etc.
# (?:...)    - non-capturing group (not accessible later)
# (...|...)  - several options within a group

# Number specifications (after characters or groups)
# ---------------------------
# *          - 0 or more
# +          - 1 or more
# ?          - 0 or 1, behind other expressions (e.g., *, +, {min,max})
#              usually 'as few as possible' (non-greedy)
# (nothing)  - exactly 1
# {n}        - exactly n
# {min,max}  - between min and max, inclusive
#              (if one border entry is missing, the interval is half-open)

# Context-dependent matching
# --------------------------
# (?=...)    - Positive Lookahead Assertion (matche only if ... matches afterwards)
# (?!...)    - Negative Lookahead Assertion (match only if ... doesn't match afterwards)
# (?<=...)   - Positive Lookbehind Assertion (match only if ... matches before)
# (?<!...)   - Negative Lookbehind Assertion (matche only if ... doesn't match before)

string = "As the Bundesverfassungsgericht has recently emphasized..."
testpattern = "[^\W]+?[Gg]ericht.*?(?=\s)"   # Checking your understanding
re.findall(testpattern, string);           # ['Bundesverfassungsgericht']
# Backslash Intricacies
# ---------------------

challenge = "\section{Title}"

re.search('\section', challenge)     
# No match since \s is treated as a special character by the re module
re.search('\\section', challenge)    
# No match since \ is a special character in Python
re.search('\\\\section', challenge)  
# Match in normal notation
re.search(r'\\section', challenge);  
# Raw String Notation (no special treatment of \ in Python
# Other Text Munging
# ------------------

msg = "Hello, World!"

# Splitting text on matches
sep = "\s"
re.split(sep, msg)                   # Returns a list of splitted text parts

# Replacing matches
pat = "Hello"
repl = "Bonjour"
re.sub(pat, repl, msg)               # Returns new String with matches replaces

# Entirely without regex
# ----------------------
# (to see more functions, press 'Tab' when the cursor is placed at the String variable


# Simple splitting (only exact matches, no abstract patterns or special characters)
msg.split(" ")                               # equivalent to re.split(" ", msg)

# Testing the start or end of a String for matches
msg.endswith('!')                            # True
msg.startswith('Ola')                        # False

# Simple replacements
msg.replace('Hello', 'Bonjour')              # equivalent to re.sub(pat, repl, msg)

# Counting characters
msg.count('l')                               # 3
{char:msg.count(char) for char in set(msg)}; # Dict with chars and their frequencies

results matching ""

    No results matching ""