-
Notifications
You must be signed in to change notification settings - Fork 0
/
regex.txt
307 lines (179 loc) · 7.48 KB
/
regex.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
Regular expressions
= a method of using a sequence of characters to define a search to match strings.
For use in "find and replace" operations.
For example:
- Find a particular word
- Find something that looks like a date
- Find something that looks like an email address or a phone number
etc.
Think about the wild card character you may be using in searches.
It exists in regular expressions too, with many more features to specify exactly what you are looking for.
Steps of a regular expression
- match on exact string of characters (text or number) or certain type of characters (eg upper case letters, digits, spaces)
- match patterns that repeat a given number of times
- capture (and possibly replace) the parts that match the pattern
xkcd - RegEx saves the day: https://xkcd.com/208/
1 - INTRO
Small demo
on https://regex101.com/
On text
1-888-924-8924
(888) 924 8924
888-924-8924
888 924 8924
8889248924
508-944-123
978-3-16-148410-0
Thomas (he/him)
regex101.com
We are looking for phone numbers. We could look for digits \d or \d* , but there are times people write their number
with parentheses or dashes. Looking for parentheses \( isn't enough.
With regular expressions, we can start building queries that exactly match what we're looking for.
For example, I can type something like
\d to look for a digit (a number)
\d* for a digit repeated several times
\d{3} for a digit repeated exactly 3 times
\d{3}-\d{3}-\d{4} etc.
or
Thomas
or
a range of characters: [a-h]
[a-h]{2}
[a-z]{3}\)
Don't worry too much about writing any of this down, we'll go through all of them today.
What's interesting to note here, is that a regex is composed of
- literal characters
character that mean what they usually mean - an a is the letter a etc.
- meta characters or tokens
characters that have special meaning, e.g. "repeat this many times" or "any character within this range" or "any digit"
Note that if you want meta characters to be literal, you have to ESCAPE them
e.g. \)
Confusingly, some meta characters need to be escaped to be treated as meta characters, e.g. \d ...
Please take a minute to locate the back slash on your keyboard, you're going to need it.
2 - SYNTAX ELEMENTS
Let's look at the quick reference on regex101.com/
It lists the tokens we can use
[ABC] matches A or B or C.
[A-Z] matches any upper case letter.
To illustrate the difference, search for
- ho
- [ho]
- [h-o]
[A-Za-z] matches any upper or lower case letter.
[A-Za-z0-9] matches any upper or lower case letter or any digit.
. matches any character.
\d matches any single digit.
\w matches any part of word character (equivalent to [A-Za-z0-9]).
\s matches any space, tab, or newline.
\S matches anything that ISN'T a space, tab or newline
\D matches anything that ISN'T a digit
\ to escape e.g. \.com (because . is any character)
Then we have meta characters to specify boundaries
^ = start of the line anchor
$ = end of the line anchor
\b a word boundary
e.g. mark will match
market
marketing
remarkable
mark
vs \bmark
\bmark\b
**** QUESTION *****************************
What will
^[Oo]rgani.e\b
match?
*******************************************
Explain the elements
https://regexper.com --> to visually explain what a regex does
Some more tokens (quantifier section on regex101)
* matches the preceding element zero or more times. For example, ab*c matches “ac”, “abc”, “abbbc”, etc.
+ matches the preceding element one or more times. For example, ab+c matches “abc”, “abbbc” but not “ac”.
? matches when the preceding character appears zero or one time.
{VALUE}
| = or
/i case-insensitive
**** EXERCICE ****************************************
Using pen and paper for now, try figuring out the
following before testing them on regex101
1. What will the regular expression Fr[ea]nc[eh] match?
French
France
Frence
Franch
2. How do you match the whole words colour and color (case insensitive)?
\b[Cc]olou?r\b|\bCOLOU?R\b
/colou?r/i
3. How would you match the date format dd-MM-yyyy?
\b\d{2}-\d{2}-\d{4}\b
4. How would you match publication formats such as British Library : London, 2015 and Manchester University Press: Manchester, 1999?
.* ?: .*, \d{4}
*******************************************************
3 - MATCHING AND EXTRACTING STRINGS
We're going to use the Carpentries Code of Conduct as a sample text to work on.
Copy and paste https://github.com/LibraryCarpentry/lc-data-intro/blob/gh-pages/data/swcCoC.md into regex101
(not the actual CoC - we have made a few changes for this exercice)
- community
we get 6 matches
- community (with a space at the end)
why do we get fewer matches?
excluding community-led
- Are we sure there are only 6 times community? What could we change?
Does it match Community? How can we match it?
- If I shorten it to [Cc]ommuni I get more matches, why?
Matches communication, etc.
Let's try more complex stuff now.
**** EXERCICE 1 ****************************************
Using the CoC, see if you can extract all email addresses
Hint: go step by step
- start with what all emails have in common (@)
- think about what an email should always contain
Solution:
[\w.-]+@[\w.-]+\.[\w]{2,3}
(although there are now TLDs with more than 3 characters...)
*********************************************************
**** EXERCICE 2 ****************************************
Using the CoC, see if you can extract all phone numbers
- start without area code
- include area code, with dash, with parentheses
- try to add a country code (+1) as well
Hint, country codes can be up to 3 digits
Solution:
(\+?\d{1,3}( |-))?\(?\d{3}\)?(\s|-)?\d{3}-\d{4}
*********************************************************
4 - QUIZ
If enough time, try going through https://librarycarpentry.org/lc-data-intro/03-quiz/index.html
- Check answers
- Note your correct answers
Are there any answers you don't understand?
APARTE - greedy vs lazy
Lazy expressions will match as few times as possible
Greedy expression will match as many times as possible
Example
<em>Hello World</em>
Greedy:
<.+>
Lazy:
<.+?>
5- CONCLUSION
Regex used in several programming langages and software.
In Google Sheets, can be used using the REGEXEXTRACT function
If enough time, extra exercice
- Download
https://github.com/LibraryCarpentry/lc-data-intro/blob/gh-pages/files/PLS_FY17.zip (2017 public library survey)
- Open it in Google Sheets
- Create a new column and extract the latitude and longitude in the ADDRESS column using a regex
Solution: =REGEXEXTRACT(G2,"\d+\.\d+, -?\d+\.\d+")
In OpenRefine, see the OR lesson
Some real life library examples
https://acrl.ala.org/techconnect/post/fear-no-longer-regular-expressions/
Library of common examples
- Check the library tab to the left of regex101
- O'Reilly: Regular Expressions Cookbook [show ebook]
- https://www.pgdp.net/wiki/Regex_Cookbook
- Google search
There are local variants of regex. If you are using regex within a programming language, eg python, perl, and it's not working,
look at the doc.
Comparison of regex dialects: https://gist.github.com/CMCDragonkai/6c933f4a7d713ef712145c5eb94a1816
See the "flavor" tab at the left of regex101.com
Want to test yourself? https://regexcrossword.com/