Java Regular Expressions

Ratings:
(4)
Views: 0
Banner-Img
Share this blog:

Java Regular Expressions in Java

A regular expression or REGEX is explained as a specific kind of text pattern that can be used in many advanced applications and with any programming language. They are used for searching, editing and manipulating data.

For example, using regular expressions we can verify if a particular input string matches a given text pattern or to find out a set of characters from a large batch of characters. They are also used in replacing and re-arranging a block of text or splitting a big chunk of data into smaller subsets. Regular expressions are generally not language specific and follow a similar pattern in most of the programming languages, but with slight variation.

Regular expressions are powerful tools and would reduce the time taken for processing a job, when your program needs to manipulate or extract text on computer. By using them skillfully, regular expressions help us to perform many tasks that wouldn't be feasible at all.

Regular expressions are so useful in real life computing that, the various systems and languages have evolved to provide both a fundamental and protracted standard for the grammar and syntax for usage of modern regular expressions. Also Regular expression processors are found in major of the search engines, search and replace place-holders of various word processors and text editors, and in the command lines of utilities that are used in processing text inputs.

Here's a set of strings that have a few things in common:

  • A string
  • A longer string
  • A much longer string

Note that each of these strings begins with A and ends with string. The Java Regular Expressions API helps you pull out these elements, see the pattern among them, and do interesting things with the information you've gleaned. The Regular Expressions API has three core classes that you use almost all the time:

  • Pattern describes a string pattern.
  • Matcher tests a string to see if it matches the pattern.
  • PatternSyntaxException tells you that something wasn't acceptable about the pattern that you tried to define.

You'll begin working on a simple regular-expressions pattern that uses these classes shortly. But first, take a look at the regex pattern syntax. Regex pattern syntax A regex pattern describes the structure of the string that the expression tries to find in an input string. The pattern syntax can look strange to the uninitiated, but once you understand it, you'll find it easier to decipher. Table 1 lists some of the most common regex constructs that you use in pattern strings.

Regex construct What qualifies as a match
. Any character
? Zero (0) or one (1) of what came before
* Zero (0) or more of what came before
+ One (1) or more of what came before
[] A range of characters or digits
^ Negation of whatever follows (that is, "not whatever")
\d Any digit (alternatively, [0-9])
\D Any nondigit (alternatively, [^0-9])
\s Any whitespace character (alternatively, [\n\t\f\r])
\S Any nonwhitespace character (alternatively, [^\n\t\f\r])
\w Any word character (alternatively, [a-zA-Z_0-9])
\W Any nonword character (alternatively, [^\w])

The first few constructs are called quantifiers, because they quantify what comes before them. Constructs like \d are predefined character classes. Any character that doesn't have special meaning in a pattern is a literal and matches itself. The first few constructs are called quantifiers, because they quantify what comes before them. Constructs like \d are predefined character classes. Any character that doesn't have special meaning in a pattern is a literal and matches itself.

Pattern matching

Armed with the pattern syntax in Table 1, you can work through the simple example in Listing 1, using the classes in the Java Regular Expressions API.

Listing 1. Pattern matching with regex
Pattern pattern = Pattern.compile("[Aa].*string");
Matcher matcher = pattern.matcher("A string");
boolean didMatch = matcher.matches();
Logger.getAnonymousLogger().info (didMatch);
int patternStartIndex = matcher.start();
Logger.getAnonymousLogger().info (patternStartIndex);
int patternEndIndex = matcher.end();
Logger.getAnonymousLogger().info (patternEndIndex);

First, Listing 1 creates a Pattern class by calling compile()— a static method on Pattern— with a string literal representing the pattern you want to match. That literal uses the regex pattern syntax. In this example, the English translation of the pattern is: Find a string of the form A or a followed by zero or more characters, followed by string.

Methods for matching

Next, Listing 1 calls matcher() on Pattern. That call creates a Matcher instance. The Matcher then searches the string you passed in for matches against the pattern string you used when you created the Pattern. Every Java language string is an indexed collection of characters, starting with 0 and ending with the string length minus one. The Matcher parses the string, starting at 0, and looks for matches against it. After that process is complete, the Matcher contains information about matches found (or not found) in the input string. You can access that information by calling various methods on Matcher:

  • matches() tells you if the entire input sequence was an exact match for the pattern.
  • start() tells you the index value in the string where the matched string starts.
  • end() tells you the index value in the string where the matched string ends, plus one.

Listing 1 finds a single match starting at 0 and ending at 7. Thus, the call to matches() returns true, the call to start() returns 0, and the call to end() returns 8.

lookingAt() versus matches()

If your string had more elements than the number of characters in the pattern you searched for, you could use lookingAt() instead of matches(). The lookingAt() method searches for substring matches for a specified pattern. For example, consider the following string:

a string with more than just the pattern.

If you search this string for a.*string, you get a match if you use lookingAt(). But if you use matches(), it returns false, because there's more to the string than what's in the pattern.

Complex patterns in regex

Simple searches are easy with the regex classes, but you can also do highly sophisticated things with the Regular Expressions API. Wikis are based almost entirely on regular expressions. Wiki content is based on string input from users, which is parsed and formatted using regular expressions. Any user can create a link to another topic in a wiki by entering a wiki word, which is typically a series of concatenated words, each of which begins with an uppercase letter, like this:

MyWikiWord

Suppose a user inputs the following string:

Here is a WikiWord followed by AnotherWikiWord, then YetAnotherWikiWord.

You could search for wiki words in this string with a regex pattern like this:

[A-Z][a-z]*([A-Z][a-z]*)+

And here's code to search for wiki words:

String input = "Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.";
Pattern pattern = Pattern.compile("[A-Z][a-z]*([A-Z][a-z]*)+");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
  Logger.getAnonymousLogger().info("Found this wiki word: " + matcher.group());
}

Run this code, and you can see the three wiki words in your console.

Replacing strings

Searching for matches is useful, but you also can manipulate strings after you find a match for them. You can do that by replacing matched strings with something else, just as you might search for text in a word-processing program and replace it with other text. Matcher has a couple of methods for replacing string elements:

  • replaceAll() replaces all matches with a specified string.
  • replaceFirst() replaces only the first match with a specified string.

Using Matcher's replace methods is straightforward:

String input = "Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.";
Pattern pattern = Pattern.compile("[A-Z][a-z]*([A-Z][a-z]*)+");
Matcher matcher = pattern.matcher(input);
Logger.getAnonymousLogger().info("Before: " + input);
String result = matcher.replaceAll("replacement");
Logger.getAnonymousLogger().info("After: " + result);

This code finds wiki words, as before. When the Matcher finds a match, it replaces the wiki word text with its replacement. When you run the code, you can see the following on your console:

Before: Here is WikiWord followed by AnotherWikiWord, then SomeWikiWord.
After: Here is replacement followed by replacement, then replacement.

If you had used replaceFirst(), you would have seen this:

Before: Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.
After: Here is a replacement followed by AnotherWikiWord, then SomeWikiWord.

Matching and manipulating groups

When you search for matches against a regex pattern, you can get information about what you found. You've seen some of that capability with the start() and end() methods on Matcher. But it's also possible to reference matches by capturing groups. In each pattern, you typically create groups by enclosing parts of the pattern in parentheses. Groups are numbered from left to right, starting with 1 (group 0 represents the entire match). The code in Listing 2 replaces each wiki word with a string that "wraps" the word:

Listing 2. Matching groups
String input = "Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.";
Pattern pattern = Pattern.compile("[A-Z][a-z]*([A-Z][a-z]*)+");
Matcher matcher = pattern.matcher(input);
Logger.getAnonymousLogger().info("Before: " + input);
String result = matcher.replaceAll("blah$0blah");
Logger.getAnonymousLogger().info("After: " + result);

Run the Listing 2 code, and you get the following console output:

Before: Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.
After: Here is a blahWikiWordblah followed by blahAnotherWikiWordblah,then blahSomeWikiWordblah.

Listing 2 references the entire match by including $0 in the replacement string. Any portion of a replacement string of the form $int refers to the group identified by the integer (so $1 refers to group 1, and so on). In other words, $0 is equivalent to matcher.group(0);. You could accomplish the same replacement goal by using other methods. Rather than calling replaceAll(), you could do this:

StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
 matcher.appendReplacement(buffer, "blah$0blah");
}
matcher.appendTail(buffer);
Logger.getAnonymousLogger().info("After: " + buffer.toString());

And you'd get the same result:

Before: Here is a WikiWord followed by AnotherWikiWord, then SomeWikiWord.
After: Here is a blahWikiWordblah followed by blahAnotherWikiWordblah,then blahSomeWikiWordblah.

Common Uses

Regular expressions are majorly used in a wide variety of text processing tasks, and more generally string processing, where the data need not be textual. Common applications include data validation, data scraping, data wrangling, simple parsing, the production of syntax highlighting systems, and many other tasks. While regular expressions would be useful on Internet search engines, processing them across the entire database could consume excessive computer resources depending on the complexity and design of the regex. Some important uses of regular expressions are:

  • URL validation
  • Email Validation
  • Validation of numbers, characters and special characters
  • Internet address
  • Extracting information from text such as code, log files, spreadsheets, or documents.
  • Search and replace Strings

Note:

  • Keep in mind, when using regular expression everything is essentially a character, and we are writing patterns to match a specific sequence of characters (also known as a string).
  • Generally patterns are provided in ASCII, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!
  • Unicode characters are used to match any type of international text Understanding the Regex Engine

One of the most important things to do is to take a look at How a Regex Engine Works Internally because knowing how the regex engine works will help us to craft better regexes more easily. It will also help in understanding quickly why a particular regex does work in the way it was initially expected to do so. This will also save lots of guesswork and head scratching when wTe need to wuite more complex regexes. There are two kinds of regular expression engines:

  • text-directed engines
  • Regex-directed engines.
  1. Mostly all the regex flavors available are based on regex-directed engines. This is because certain very7 useful features, such as lazy quantifiers and back references, can only be implemented in regex-directed engines.
  2. You can easily find out whether the regex flavor you intend to use has a text- directed or regex-directed engine. If back references and/or lazy quantifiers are available, you can be certain the engine is regex-directed.
  3. You can do the test by applying the regex «text|text not» to the string '‘text not”. If the resulting match is only “text”, the engine is regex-directed. If the result is “text not”, then it is text-directed. The reason behind this is that the regex-directed engine is an early starter.
  4. Before looking into the examples provided, understanding how the regex engine works will enable you to use its full power and help you avoid common mistakes.

The Regex-Directed Engine Always Returns the Leftmost Match. There are some important points to be noted when working with engines:

  • A regex-directed engine wall always return the leftmost match, even if a more suitable match could be found later.
  • When applying a regex to a string, the engine will start at the first character of the string. It wall try all possible variations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Otherwise it will stop there
  • Again, it wall try all possible variations of the regex, in exactly the same order till it finds a match.
  • The result is that the regex-directed engine wall return the leftmost match.

Let us consider the example, searching «book» to “He bought a bookshelf for his book." The engine will try to match the first token in the regex «b» to the first character in the match “H“. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the «b» with the “e". This fails too, as does matching the «b» with the space. Arriving at the 4th character in the match, «b» matches “b”. The engine will then try to match the second token «o» to the 5th character, “o’’. This succeeds too. But then, «o» fails to match “u”. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: “a”. Again, «b»fails to match here and the engine carries on. At the 30th character in the match, «b» again matches “b”.

  • The engine then proceeds to attempt to match the remainder of the regex at character 30th and finds that «o» matches “o”, again <<o>> matches “o” and «k» matches “k”. The complete regex could be matched starting at character 30. The engine is an early starter to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any “better” matches. In this first example of the engine’s internals, our regex engine simply appears to work like a regular text search routine.

A text-directed engine would have returned the same result too. However, it is important that you can follow the steps the engine takes in your mind. In following examples, the way the engine works will have a profound impact on the matches it will find. Some of the results may be surprising. But they are always logical and predetermined, once you know how the engine works.

Algorithms Used in Regex

When using regex, there are at least three different algorithms that decide whether and how7 a given regular expression matches a string.

Converting NFA to DFA

This is the first and quickest method. It is based on a result in formal language theory that permits every non-deterministic finite automaton (NFA) to become into a deterministic finite automaton (DFA). The DFA can be explicitly constructed then run on the resulting input string one by one symbol at a time.

Simulating NFA Directly (DFA/NFA Algorithm)

The next method is to simulate the NFA directly, constructing each DFA state on demand and then discarding it at the next execution step. This method keeps the DFA implicit and avoids increase of the construction cost exponentially, but has an overhead of rising the running cost to 0(m n). The explicit an implicit approaches are called the DFA algorithm and the NFA algorithm respectively Adding caching to the NFA algorithm is often called the "lazy DFA" algorithm or just the DFA algorithm without making a distinction.

Backtracking

The third and final algorithm is to match the pattern against the input string by backtracking. This algorithm is usually called as NFA, but this terminology can be confusing a times. This is used simple implementations which exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification. This kind of processing can force the algorithm to consider an exponentially increasing number of sub-cases. This might also lead to security problems called Regular expression Denial of Service.

Common Regex Syntaxes

A regular expression is in the bottom a string patterns that represents text. These descriptions can be applied in several ways. The basic language constructs embrace character classes, quantifiers, and meta-characters. The below section explains the various options we can use to define regular expression.

  • String Literals

String literals are used to search a particular match in the text. For example, if we are going to search for a text “test'’ then we can simply write the code like this: Here text and regex both are same. Pattern.matches("test", "test")

  • Character Classes

A character class is used to match a single character in the input text against multiple allowed characters in the character class, a character class has no relation with a class construct or class files in Java. Examples:

  1. [Tt]est would match all the occurrences of String “test” with either lower case or upper case “T”.
  2. The string "A@BAND@YEA@U" matches the pattern ”[ABC]@." twice even though the string contains three @ signs.
  3. The second @ is not a part of any match, because it is preceded by D and not A, B. or C.

Few more samples: Pattem.matches("[pqr]M, "abed"); It would give false as no p,q or r in the text Pattem.matches("[pqr]M, "r"); Return true as r is found Pattem.matches("[pqr]M, "pq"); Return false as any one of them can be in text not both. The meta characters [ and ] (left and right brackets) are used to specify a character class inside a regular expression. Sometimes we limit the characters that produce matches to a special set of characters. Here is the sample list of various character classes constructs:

  1. Simple: consists in a group of characters set up side by side and matches only those characters. Example: [abc] matches characters a, b, and c.

Let's take a look at the following example:

[csw] cave matches c in [csw] with c in cave.

  1. Negation: starts with the A meta-character and matches only those characters not in that class.

Example: [Aabc] matches all characters except a, b, and c.

Let's see a second example:

[Acsw] cave matches a, v, and e with their counterparts in cave.

  1. Range: involves all characters starting with the character on the left of a hyphen meta character (-) and finishing with the character on the right of the hyphen meta character, matching only those characters in that range.

Example: [a-z] matches all lowercase alphabetic characters.

Let's see another example:

[a-c] clown matches c in [a-c] with c in clown.

  1. Union: involves multiple nested character classes and matches all characters that fit to the resulting union.

Example: [a-d[m-p]] matches characters a through d and m through p.

Let's see a second example:

[ab[c-e]] abcdef matches a, b, c, d, and e with their counterparts in abcdef.

  1. Intersection: involves characters usual to all nested classes and matches only common characters.

Example: [a-z&&[d-f]] matches characters d, e, and f. Other example:

[aeiouv&&[y]] party matches v in [aeiou&&[y]] with v in party..

  1. Subtraction: involve all characters less those indicated in nested negation character classes and matches the remaining characters.

Example: [a-z&&[Am-p]] matches characters a through 1 and q through z.

The following command line offers a second example:

[a-f&&[Aa-c]&&[Ae]] abcdefg matches d and f with their counterparts in abcdefg.

You liked the article?

Like: 0

Vote for difficulty

Current difficulty (Avg): Medium

EasyMediumHardDifficultExpert
IMPROVE ARTICLEReport Issue

About Author

Authorlogo
Name
TekSlate
Author Bio

TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.

Stay Updated
Get stories of change makers and innovators from the startup ecosystem in your inbox