regex.txt

Regular expressions

= a method of using a sequence of characters to define a search to match strings.
For use in "find and replace" operations.

For example:
- Find a particular word
- Find something that looks like a date
- Find something that looks like an email address or a phone number
etc.

Think about the wild card character you may be using in searches.
It exists in regular expressions too, with many more features to specify exactly what you are looking for.

Steps of a regular expression
- match on exact string of characters (text or number) or certain type of characters (eg upper case letters, digits, spaces)
- match patterns that repeat a given number of times
- capture (and possibly replace) the parts that match the pattern

xkcd - RegEx saves the day: https://xkcd.com/208/


1 - INTRO


Small demo
on https://regex101.com/


On text
1-888-924-8924
(888) 924 8924
888-924-8924
888 924 8924
8889248924

508-944-123
978-3-16-148410-0
Thomas (he/him)
regex101.com

We are looking for phone numbers. We could look for digits  \d or \d* , but there are times people write their number
with parentheses or dashes. Looking for parentheses  \(   isn't enough.

With regular expressions, we can start building queries that exactly match what we're looking for.

For example, I can type something like
\d to look for a digit (a number)
\d* for a digit repeated several times
\d{3} for a digit repeated exactly 3 times
\d{3}-\d{3}-\d{4} etc.

or
Thomas
or
a range of characters: [a-h]
[a-h]{2}
[a-z]{3}\)

Don't worry too much about writing any of this down, we'll go through all of them today.

What's interesting to note here, is that a regex is composed of

- literal characters
    character that mean what they usually mean - an a is the letter a etc.
- meta characters or tokens
    characters that have special meaning, e.g. "repeat this many times" or "any character within this range" or "any digit"
    Note that if you want meta characters to be literal, you have to ESCAPE them
    e.g. \)
    
    Confusingly, some meta characters need to be escaped to be treated as meta characters, e.g. \d ...
    
Please take a minute to locate the back slash on your keyboard, you're going to need it.


2 - SYNTAX ELEMENTS


Let's look at the quick reference on regex101.com/
It lists the tokens we can use

    [ABC] matches A or B or C.
    [A-Z] matches any upper case letter.
    
        To illustrate the difference, search for
            - ho
            - [ho]
            - [h-o]
    
    [A-Za-z] matches any upper or lower case letter.
    [A-Za-z0-9] matches any upper or lower case letter or any digit.

    . matches any character.
    \d matches any single digit.
    \w matches any part of word character (equivalent to [A-Za-z0-9]).
    \s matches any space, tab, or newline.
    \S matches anything that ISN'T a space, tab or newline
    \D matches anything that ISN'T a digit
    
    \ to escape e.g. \.com (because . is any character)
    
    
Then we have meta characters to specify boundaries

    ^ = start of the line anchor
    $ = end of the line anchor
    \b a word boundary
    
        e.g. mark will match
            market
            marketing
            remarkable
            mark
        vs \bmark
           \bmark\b


**** QUESTION *****************************

What will

^[Oo]rgani.e\b

match?

*******************************************


Explain the elements


https://regexper.com --> to visually explain what a regex does


Some more tokens (quantifier section on regex101)

* matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.
+ matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”.
? matches when the preceding character appears zero or one time.
{VALUE}

| = or
/i case-insensitive


**** EXERCICE ****************************************
Using pen and paper for now, try figuring out the 
following before testing them on regex101

1. What will the regular expression Fr[ea]nc[eh] match?

    French
    France
    Frence
    Franch

2. How do you match the whole words colour and color (case insensitive)?

    \b[Cc]olou?r\b|\bCOLOU?R\b
    /colou?r/i
    
3. How would you match the date format dd-MM-yyyy?

    \b\d{2}-\d{2}-\d{4}\b
    
4. How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?

    .* ?: .*, \d{4}
    
*******************************************************


3 - MATCHING AND EXTRACTING STRINGS


We're going to use the Carpentries Code of Conduct as a sample text to work on.
Copy and paste https://github.com/LibraryCarpentry/lc-data-intro/blob/gh-pages/data/swcCoC.md into regex101

(not the actual CoC - we have made a few changes for this exercice)


- community
    we get 6 matches
    
- community (with a space at the end)
    why do we get fewer matches?
    excluding community-led
    
- Are we sure there are only 6 times community? What could we change?
    Does it match Community? How can we match it?
    
- If I shorten it to [Cc]ommuni  I get more matches, why?
    Matches communication, etc.
    

Let's try more complex stuff now.

**** EXERCICE 1 ****************************************
Using the CoC, see if you can extract all email addresses

Hint: go step by step
- start with what all emails have in common (@)
- think about what an email should always contain

Solution:
[\w.-]+@[\w.-]+\.[\w]{2,3}
(although there are now TLDs with more than 3 characters...)

*********************************************************


**** EXERCICE 2 ****************************************
Using the CoC, see if you can extract all phone numbers

- start without area code
- include area code, with dash, with parentheses
- try to add a country code (+1) as well
    Hint, country codes can be up to 3 digits

Solution:
(\+?\d{1,3}( |-))?\(?\d{3}\)?(\s|-)?\d{3}-\d{4}

*********************************************************


4 - QUIZ

If enough time, try going through https://librarycarpentry.org/lc-data-intro/03-quiz/index.html

- Check answers
- Note your correct answers

Are there any answers you don't understand?


APARTE - greedy vs lazy

Lazy expressions will match as few times as possible
Greedy expression will match as many times as possible

Example


<em>Hello World</em>

Greedy:
<.+>

Lazy:
<.+?>


5- CONCLUSION

Regex used in several programming langages and software.

In Google Sheets, can be used using the REGEXEXTRACT function

    If enough time, extra exercice
        - Download
        https://github.com/LibraryCarpentry/lc-data-intro/blob/gh-pages/files/PLS_FY17.zip (2017 public library survey)
        - Open it in Google Sheets
        - Create a new column and extract the latitude and longitude in the ADDRESS column using a regex
        
    Solution: =REGEXEXTRACT(G2,"\d+\.\d+, -?\d+\.\d+")
    
    
In OpenRefine, see the OR lesson


Some real life library examples
https://acrl.ala.org/techconnect/post/fear-no-longer-regular-expressions/


Library of common examples
- Check the library tab to the left of regex101
- O'Reilly: Regular Expressions Cookbook [show ebook]
- https://www.pgdp.net/wiki/Regex_Cookbook
- Google search


There are local variants of regex. If you are using regex within a programming language, eg python, perl, and it's not working,
look at the doc.
    Comparison of regex dialects: https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816
See the "flavor" tab at the left of regex101.com


Want to test yourself? https://regexcrossword.com/