joemccray

Regex Class

Feb 22nd, 2017
1,073
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. #######################
  2. # VMs for this course #
  3. #######################
  4. https://s3.amazonaws.com/infosecaddictsvirtualmachines/Win7x64.zip
  5. username: workshop
  6. password: password
  7.  
  8. https://s3.amazonaws.com/infosecaddictsvirtualmachines/InfoSecAddictsVM.zip
  9. user: infosecaddicts
  10. pass: infosecaddicts
  11.  
  12. You don't have to, but you can do the updates in the Win7 VM (yes, it is a lot of updates).
  13.  
  14. You'll need to create directory in the Win7 VM called "c:\ps"
  15.  
  16. In this file you will also need to change the text '192.168.200.144' to the IP address of your Ubuntu host.
  17.  
  18.  
  19.  
  20.  
  21.  
  22. ##############################################
  23. # Log Analysis with Linux command-line tools #
  24. ##############################################
  25. The following command line executables are found in the Mac as well as most Linux Distributions.
  26.  
  27. cat – prints the content of a file in the terminal window
  28. grep – searches and filters based on patterns
  29. awk – can sort each row into fields and display only what is needed
  30. sed – performs find and replace functions
  31. sort – arranges output in an order
  32. uniq – compares adjacent lines and can report, filter or provide a count of duplicates
  33.  
  34.  
  35. ##############
  36. # Cisco Logs #
  37. ##############
  38.  
  39. wget https://s3.amazonaws.com/infosecaddictsfiles/cisco.log
  40.  
  41.  
  42. AWK Basics
  43. ----------
  44. To quickly demonstrate the print feature in awk, we can instruct it to show only the 5th word of each line. Here we will print $5. Only the last 4 lines are being shown for brevity.
  45.  
  46. cat cisco.log | awk '{print $5}' | tail -n 4
  47.  
  48.  
  49.  
  50.  
  51. Looking at a large file would still produce a large amount of output. A more useful thing to do might be to output every entry found in “$5”, group them together, count them, then sort them from the greatest to least number of occurrences. This can be done by piping the output through “sort“, using “uniq -c” to count the like entries, then using “sort -rn” to sort it in reverse order.
  52.  
  53. cat cisco.log | awk '{print $5}'| sort | uniq -c | sort -rn
  54.  
  55.  
  56.  
  57.  
  58. While that’s sort of cool, it is obvious that we have some garbage in our output. Evidently we have a few lines that aren’t conforming to the output we expect to see in $5. We can insert grep to filter the file prior to feeding it to awk. This insures that we are at least looking at lines of text that contain “facility-level-mnemonic”.
  59.  
  60. cat cisco.log | grep %[a-zA-Z]*-[0-9]-[a-zA-Z]* | awk '{print $5}' | sort | uniq -c | sort -rn
  61.  
  62.  
  63.  
  64.  
  65.  
  66. Now that the output is cleaned up a bit, it is a good time to investigate some of the entries that appear most often. One way to see all occurrences is to use grep.
  67.  
  68. cat cisco.log | grep %LINEPROTO-5-UPDOWN:
  69.  
  70. cat cisco.log | grep %LINEPROTO-5-UPDOWN:| awk '{print $10}' | sort | uniq -c | sort -rn
  71.  
  72. cat cisco.log | grep %LINEPROTO-5-UPDOWN:| sed 's/,//g' | awk '{print $10}' | sort | uniq -c | sort -rn
  73.  
  74. cat cisco.log | grep %LINEPROTO-5-UPDOWN:| sed 's/,//g' | awk '{print $10 " changed to " $14}' | sort | uniq -c | sort -rn
  75.  
  76.  
  77.  
  78.  
  79. #########
  80. # EGrep #
  81. #########
  82.  
  83.  
  84.  
  85.  
  86.  
  87. #####################
  88. # Powershell Basics #
  89. #####################
  90.  
  91. PowerShell is Microsoft’s new scripting language that has been built in since the release Vista.
  92.  
  93. PowerShell file extension end in .ps1 .
  94.  
  95. An important note is that you cannot double click on a PowerShell script to execute it.
  96.  
  97. To open a PowerShell command prompt either hit Windows Key + R and type in PowerShell or Start -> All Programs -> Accessories -> Windows PowerShell -> Windows PowerShell.
  98.  
  99. dir
  100. cd
  101. ls
  102. cd c:\
  103.  
  104.  
  105. To obtain a list of cmdlets, use the Get-Command cmdlet
  106.  
  107. Get-Command
  108.  
  109.  
  110.  
  111. You can use the Get-Alias cmdlet to see a full list of aliased commands.
  112.  
  113. Get-Alias
  114.  
  115.  
  116.  
  117. Don't worry you won't blow up your machine with Powershell
  118. Get-Process | stop-process Don't press [ ENTER ] What will this command do?
  119. Get-Process | stop-process -whatif
  120.  
  121.  
  122. To get help with a cmdlet, use the Get-Help cmdlet along with the cmdlet you want information about.
  123.  
  124. Get-Help Get-Command
  125.  
  126. Get-Help Get-Service –online
  127.  
  128. Get-Service -Name TermService, Spooler
  129.  
  130. Get-Service –N BITS
  131.  
  132.  
  133.  
  134. PowerShell variables begin with the $ symbol. First lets create a variable
  135.  
  136. $serv = Get-Service –N Spooler
  137.  
  138. To see the value of a variable you can just call it in the terminal.
  139.  
  140. $serv
  141.  
  142. $serv.gettype().fullname
  143.  
  144.  
  145. Get-Member is another extremely useful cmdlet that will enumerate the available methods and properties of an object. You can pipe the object to Get-Member or pass it in
  146.  
  147. $serv | Get-Member
  148.  
  149. Get-Member -InputObject $serv
  150.  
  151.  
  152.  
  153.  
  154.  
  155. Let’s use a method and a property with our object.
  156.  
  157. $serv.Status
  158. $serv.Stop()
  159. $serv.Refresh()
  160. $serv.Status
  161. $serv.Start()
  162. $serv.Refresh()
  163. $serv.Status
  164.  
  165.  
  166.  
  167.  
  168. Methods can return properties and properties can have sub properties. You can chain them together by appending them to the first call.
  169.  
  170.  
  171.  
  172.  
  173.  
  174.  
  175. - Run cmdlet through a pie and refer to its properties as $_
  176. Get-Service | where-object { $_.Status -eq "Running"}
  177.  
  178.  
  179. Variables
  180. ---------
  181.  
  182. vs1 = 1
  183. vs1.GetType().Name
  184.  
  185.  
  186. vs1 = "string "
  187. vs1.GetType().Name
  188.  
  189.  
  190.  
  191. - Get a listing of variables
  192. Get-variable
  193. Get-ChildItem variable
  194.  
  195.  
  196.  
  197. For Loops
  198. ---------
  199. 1..5 | ForEach-Object { $Sum = 0 } { $Sum += $_ }
  200.  
  201.  
  202.  
  203.  
  204. $Numbers = 4..7
  205. 1..1 | forecach-object { if ($Numbers -contains $_)
  206. { continue }; $_ }
  207.  
  208.  
  209.  
  210.  
  211. foreach ($i in (1..10)){
  212. if ($i -gt 5){
  213. continue
  214. }
  215. $i
  216. )
  217.  
  218.  
  219.  
  220.  
  221.  
  222.  
  223. PSDrives
  224. --------
  225.  
  226. To get a list of current PSDrives that are available on a system we use Get-PSDrive cmdlet
  227.  
  228. To get a list of the Providers the current sessions has available with the modules it has loaded the Get-PSProvider cmdlet is used.
  229.  
  230. The default PSDrives created when a Shell Session is started are:
  231.  
  232. - Alias - Represent all aliases valid for the current PowerShell session.
  233.  
  234. - Cert - Certificate store for the user represented in Current Location.
  235.  
  236. - Env - All environment variables for the current PowerShell Session
  237.  
  238. - Function - All functions available for the current PowerShell
  239.  
  240. - HKLM - Registry HKey Local Machine registry hive
  241.  
  242. - HKCU - Registry HKCU Current user hive
  243.  
  244. - WSMan - WinRM (Windows Remote Management) configuration and credentials
  245.  
  246.  
  247.  
  248.  
  249. Playing with WMI
  250. ----------------
  251.  
  252. # List all namespaces in the default root/cimv2
  253. Get-WmiObject -Class __namespace | Select-Object Name
  254.  
  255.  
  256. # List all namespaces under root/microsoft
  257. Get-WmiObject -Class __namespace -Namespace root/microsoft | Select-Object Name
  258.  
  259. # To list classes under the default namespace
  260. Get-WmiObject -List *
  261.  
  262. # To filter classes with the word network in their name
  263. Get-WmiObject -List *network*
  264.  
  265.  
  266. # To list classes in another namespace
  267. Get-WmiObject -List * -Namespace root/microsoft/homenet
  268.  
  269.  
  270. # To get a description of a class
  271. (Get-WmiObject -list win32_service -Amended).qualifiers | Select-Object name, value | ft -AutoSize -Wrap
  272.  
  273.  
  274.  
  275.  
  276. PowerShell treats WMI objects the same as .Net Objects so we can use Select-Object, Where-Object, ForEach-Object and Formatting cmdlets like we do with any other .Net object type.
  277.  
  278. In the case of WMI with Get-WMIObject we also have the ability to use filters based on WQL Operators with the -Filter parameter
  279.  
  280. $wmishare = [wmiclass] "win32_process"
  281. $wmishare.Methods
  282.  
  283.  
  284. Invoke-WMIMethod -class Win32_Process -Name create -ArgumentList 'calc.exe'
  285.  
  286.  
  287.  
  288.  
  289.  
  290.  
  291.  
  292.  
  293. Get-PSProvider Registry
  294.  
  295. - To list sub-keys of a registry path
  296. Get-childItem -Path hkcu:\
  297.  
  298. - To copy a key and all sub-keys
  299. Copy-Item -Path 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion' -Destination hkcu: -Recurse
  300.  
  301. - To create a key
  302. New-Item -Path HKCU:\_DeleteMe
  303.  
  304. - To Remove keys
  305. Remove-Item -Path HKCU:\_DeleteMe
  306. Remove-Item -Path HKCU:\CurrentVersion
  307.  
  308.  
  309.  
  310.  
  311. - Selecting Objects
  312. - Selecting specific Objects from a list
  313. Get-Process | Sort-Object workingset -D
  314.  
  315. $str = "my string"
  316. $str.contains(" ")
  317.  
  318. - Selecting a range of objects from a list
  319. Get-Process | Sort-Object workingset -Descending | Select-Object -Index (0..4)
  320.  
  321. - Creating/Renaming a property
  322. Get-Process | Select-Object -Property name,@{name = 'PID'; expression = {$_.id}}
  323.  
  324.  
  325. Get-Process | Sort-Object workingset -Descending | Select-Object -Index 0,1,2,3,4
  326.  
  327. #############################
  328. # Simple Event Log Analysis #
  329. #############################
  330.  
  331. Step 1: Dump the event logs
  332. ---------------------------
  333. The first thing to do is to dump them into a format that facilitates later processing with Windows PowerShell.
  334.  
  335. To dump the event log, you can use the Get-EventLog and the Exportto-Clixml cmdlets if you are working with a traditional event log such as the Security, Application, or System event logs.
  336. If you need to work with one of the trace logs, use the Get-WinEvent and the ExportTo-Clixml cmdlets.
  337.  
  338. Get-EventLog -LogName application | Export-Clixml Applog.xml
  339.  
  340. type .\Applog.xml
  341.  
  342. $logs = "system","application","security"
  343.  
  344. The % symbol is an alias for the Foreach-Object cmdlet. It is often used when working interactively from the Windows PowerShell console
  345.  
  346. $logs | % { get-eventlog -LogName $_ | Export-Clixml "$_.xml" }
  347.  
  348.  
  349.  
  350.  
  351.  
  352. Step 2: Import the event log of interest
  353. ----------------------------------------
  354. To parse the event logs, use the Import-Clixml cmdlet to read the stored XML files.
  355. Store the results in a variable.
  356. Let's take a look at the commandlets Where-Object, Group-Object, and Select-Object.
  357.  
  358. The following two commands first read the exported security log contents into a variable named $seclog, and then the five oldest entries are obtained.
  359.  
  360. $seclog = Import-Clixml security.xml
  361.  
  362. $seclog | select -Last 5
  363.  
  364.  
  365. Cool trick from one of our students named Adam. This command allows you to look at the logs for the last 24 hours:
  366.  
  367. Get-EventLog Application -After (Get-Date).AddDays(-1)
  368.  
  369. You can use '-after' and '-before' to filter date ranges
  370.  
  371. One thing you must keep in mind is that once you export the security log to XML, it is no longer protected by anything more than the NFTS and share permissions that are assigned to the location where you store everything.
  372. By default, an ordinary user does not have permission to read the security log.
  373.  
  374.  
  375.  
  376.  
  377. Step 3: Drill into a specific entry
  378. -----------------------------------
  379. To view the entire contents of a specific event log entry, choose that entry, send the results to the Format-List cmdlet, and choose all of the properties.
  380.  
  381.  
  382. $seclog | select -first 1 | fl *
  383.  
  384. The message property contains the SID, account name, user domain, and privileges that are assigned for the new login.
  385.  
  386.  
  387. ($seclog | select -first 1).message
  388.  
  389. (($seclog | select -first 1).message).gettype()
  390.  
  391.  
  392.  
  393. In the *nix world you often want a count of something (wc -l).
  394. How often is the SeSecurityPrivilege privilege mentioned in the message property?
  395. To obtain this information, pipe the contents of the security log to a Where-Object to filter the events, and then send the results to the Measure-Object cmdlet to determine the number of events:
  396.  
  397. $seclog | ? { $_.message -match 'SeSecurityPrivilege'} | measure
  398.  
  399. If you want to ensure that only event log entries return that contain SeSecurityPrivilege in their text, use Group-Object to gather the matches by the EventID property.
  400.  
  401.  
  402. $seclog | ? { $_.message -match 'SeSecurityPrivilege'} | group eventid
  403.  
  404. Because importing the event log into a variable from the stored XML results in a collection of event log entries, it means that the count property is also present.
  405. Use the count property to determine the total number of entries in the event log.
  406.  
  407. $seclog.Count
  408.  
  409.  
  410.  
  411.  
  412.  
  413.  
  414. ############################
  415. # Simple Log File Analysis #
  416. ############################
  417.  
  418.  
  419. You'll need to create the directory c:\ps and download sample iss log http://pastebin.com/raw.php?i=LBn64cyA
  420.  
  421.  
  422. mkdir c:\ps
  423. cd c:\ps
  424. (new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=LBn64cyA", "c:\ps\u_ex1104.log")
  425.  
  426.  
  427.  
  428.  
  429.  
  430.  
  431.  
  432.  
  433. ###############################################
  434. # Intrusion Analysis Using Windows PowerShell #
  435. ###############################################
  436.  
  437. Download sample file http://pastebin.com/raw.php?i=ysnhXxTV into the c:\ps directory
  438.  
  439.  
  440.  
  441.  
  442.  
  443. (new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=ysnhXxTV", "c:\ps\CiscoLogFileExamples.txt")
  444.  
  445. Select-String 192.168.208.63 .\CiscoLogFileExamples.txt
  446.  
  447.  
  448.  
  449.  
  450. The Select-String cmdlet searches for text and text patterns in input strings and files. You can use it like Grep in UNIX and Findstr in Windows.
  451.  
  452. Select-String 192.168.208.63 .\CiscoLogFileExamples.txt | select line
  453.  
  454.  
  455.  
  456.  
  457. To see how many connections are made when analyzing a single host, the output from that can be piped to another command: Measure-Object.
  458.  
  459. Select-String 192.168.208.63 .\CiscoLogFileExamples.txt | select line | Measure-Object
  460.  
  461.  
  462.  
  463. To select all IP addresses in the file expand the matches property, select the value, get unique values and measure the output.
  464.  
  465. Select-String “\b(?:\d{1,3}\.){3}\d{1,3}\b” .\CiscoLogFileExamples.txt | select -ExpandProperty matches | select -ExpandProperty value | Sort-Object -Unique | Measure-Object
  466.  
  467.  
  468.  
  469. Removing Measure-Object shows all the individual IPs instead of just the count of the IP addresses. The Measure-Object command counts the IP addresses.
  470.  
  471. Select-String “\b(?:\d{1,3}\.){3}\d{1,3}\b” .\CiscoLogFileExamples.txt | select -ExpandProperty matches | select -ExpandProperty value | Sort-Object -Unique
  472.  
  473.  
  474. In order to determine which IP addresses have the most communication the last commands are removed to determine the value of the matches. Then the group command is issued on the piped output to group all the IP addresses (value), and then sort the objects by using the alias for Sort-Object: sort count –des.
  475. This sorts the IP addresses in a descending pattern as well as count and deliver the output to the shell.
  476.  
  477. Select-String “\b(?:\d{1,3}\.){3}\d{1,3}\b” .\CiscoLogFileExamples.txt | select -ExpandProperty matches | select value | group value | sort count -des
  478.  
  479.  
  480.  
  481.  
  482. This will get the setting for logs in the windows firewall which should be enabled in GPO policy for analysis.
  483. The command shows that the Firewall log is at:
  484. %systemroot%\system32\LogFiles\Firewall\pfirewall.log, in order to open the file PowerShell will need to be run with administrative privileges.
  485.  
  486.  
  487. First step is to get the above command into a variable using script logic.
  488. Thankfully PowerShell has a built-in integrated scripting environment, PowerShell.ise.
  489.  
  490. netsh advfirewall show allprofiles | Select-String FileName | select -ExpandProperty line | Select-String “%systemroot%.+\.log" | select -ExpandProperty matches | select -ExpandProperty value | sort –uniq
  491.  
  492.  
  493. ##############################################
  494. # Parsing Log files using windows PowerShell #
  495. ##############################################
  496.  
  497. Download the sample IIS log http://pastebin.com/LBn64cyA
  498.  
  499.  
  500. (new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=LBn64cyA", "c:\ps\u_ex1104.log")
  501.  
  502. Get-Content ".\*log" | ? { ($_ | Select-String "WebDAV")}
  503.  
  504.  
  505.  
  506. The above command would give us all the WebDAV requests.
  507.  
  508. To filter this to a particular user name, use the below command:
  509.  
  510. Get-Content ".\*log" | ? { ($_ | Select-String "WebDAV") -and ($_ | Select-String "OPTIONS")}
  511.  
  512.  
  513.  
  514. Some more options that will be more commonly required :
  515.  
  516. For Outlook Web Access : Replace WebDAV with OWA
  517.  
  518. For EAS : Replace WebDAV with Microsoft-server-activesync
  519.  
  520. For ECP : Replace WebDAV with ECP
  521.  
  522.  
  523.  
  524.  
  525.  
  526.  
  527.  
  528. ####################################################################
  529. # Windows PowerShell: Extracting Strings Using Regular Expressions #
  530. ####################################################################
  531. To build a script that will extract data from a text file and place the extracted text into another file, we need three main elements:
  532.  
  533. 1) The input file that will be parsed
  534.  
  535. (new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=rDN3CMLc", "c:\ps\emails.txt")
  536. (new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=XySD8Mi2", "c:\ps\ip_addresses.txt")
  537. (new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=v5Yq66sH", "c:\ps\URL_addresses.txt")
  538.  
  539. 2) The regular expression that the input file will be compared against
  540.  
  541. 3) The output file for where the extracted data will be placed.
  542.  
  543. Windows PowerShell has a “select-string” cmdlet which can be used to quickly scan a file to see if a certain string value exists.
  544. Using some of the parameters of this cmdlet, we are able to search through a file to see whether any strings match a certain pattern, and then output the results to a separate file.
  545.  
  546. To demonstrate this concept, below is a Windows PowerShell script I created to search through a text file for strings that match the Regular Expression (or RegEx for short) pattern belonging to e-mail addresses.
  547.  
  548. $input_path = ‘c:\ps\emails.txt’
  549. $output_file = ‘c:\ps\extracted_addresses.txt’
  550. $regex = ‘\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b’
  551. select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
  552.  
  553. In this script, we have the following variables:
  554.  
  555. 1) $input_path to hold the path to the input file we want to parse
  556.  
  557. 2) $output_file to hold the path to the file we want the results to be stored in
  558.  
  559. 3) $regex to hold the regular expression pattern to be used when the strings are being matched.
  560.  
  561. The select-string cmdlet contains various parameters as follows:
  562.  
  563. 1) “-Path” which takes as input the full path to the input file
  564.  
  565. 2) “-Pattern” which takes as input the regular expression used in the matching process
  566.  
  567. 3) “-AllMatches” which searches for more than one match (without this parameter it would stop after the first match is found) and is piped to “$.Matches” and then “$_.Value” which represent using the current values of all the matches.
  568.  
  569. Using “>” the results are written to the destination specified in the $output_file variable.
  570.  
  571. Here are two further examples of this script which incorporate a regular expression for extracting IP addresses and URLs.
  572.  
  573. IP addresses
  574. ------------
  575. For the purposes of this example, I ran the tracert command to trace the route from my host to google.com and saved the results into a file called ip_addresses.txt. You may choose to use this script for extracting IP addresses from router logs, firewall logs, debug logs, etc.
  576.  
  577. $input_path = ‘c:\ps\ip_addresses.txt’
  578. $output_file = ‘c:\ps\extracted_ip_addresses.txt’
  579. $regex = ‘\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b’
  580. select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
  581.  
  582.  
  583. URLs
  584. ----
  585. For the purposes of this example, I created a couple of dummy web server log entries and saved them into URL_addresses.txt.
  586. You may choose to use this script for extracting URL addresses from proxy logs, network packet capture logs, debug logs, etc.
  587.  
  588. $input_path = ‘c:\ps\URL_addresses.txt’
  589. $output_file = ‘c:\ps\extracted_URL_addresses.txt’
  590. $regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’
  591. select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
  592.  
  593.  
  594. In addition to the examples above, many other types of strings can be extracted using this script.
  595. All you need to do is switch the regular expression in the “$regex” variable!
  596. In fact, the beauty of such a PowerShell script is its simplicity and speed of execution.
  597.  
  598.  
  599. ###################
  600. # Regex in Python #
  601. ###################
  602.  
  603.  
  604.  
  605.  
  606. **************************************************
  607. * What is Regular Expression and how is it used? *
  608. **************************************************
  609.  
  610.  
  611. Simply put, regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file.
  612.  
  613.  
  614. Regular expressions use two types of characters:
  615.  
  616. a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wildcard.
  617.  
  618. b) Literals (like a,b,1,2…)
  619.  
  620.  
  621. In Python, we have module "re" that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.
  622.  
  623.  
  624. Use this code --> import re
  625.  
  626.  
  627.  
  628.  
  629. The most common uses of regular expressions are:
  630. --------------------------------------------------
  631.  
  632. - Search a string (search and match)
  633. - Finding a string (findall)
  634. - Break string into a sub strings (split)
  635. - Replace part of a string (sub)
  636.  
  637.  
  638.  
  639. Let's look at the methods that library "re" provides to perform these tasks.
  640.  
  641.  
  642.  
  643. ****************************************************
  644. * What are various methods of Regular Expressions? *
  645. ****************************************************
  646.  
  647.  
  648. The ‘re' package provides multiple methods to perform queries on an input string. Here are the most commonly used methods, I will discuss:
  649.  
  650. re.match()
  651. re.search()
  652. re.findall()
  653. re.split()
  654. re.sub()
  655. re.compile()
  656.  
  657. Let's look at them one by one.
  658.  
  659.  
  660. re.match(pattern, string):
  661. -------------------------------------------------
  662.  
  663. This method finds match if it occurs at start of the string. For example, calling match() on the string ‘AV Analytics AV' and looking for a pattern ‘AV' will match. However, if we look for only Analytics, the pattern will not match. Let's perform it in python now.
  664.  
  665. Code
  666.  
  667. import re
  668. result = re.match(r'AV', 'AV Analytics ESET AV')
  669. print result
  670.  
  671. Output:
  672. <_sre.SRE_Match object at 0x0000000009BE4370>
  673.  
  674. Above, it shows that pattern match has been found. To print the matching string we'll use method group (It helps to return the matching string). Use "r" at the start of the pattern string, it designates a python raw string.
  675.  
  676.  
  677. result = re.match(r'AV', 'AV Analytics ESET AV')
  678. print result.group(0)
  679.  
  680. Output:
  681. AV
  682.  
  683.  
  684. Let's now find ‘Analytics' in the given string. Here we see that string is not starting with ‘AV' so it should return no match. Let's see what we get:
  685.  
  686.  
  687. Code
  688.  
  689. result = re.match(r'Analytics', 'AV Analytics ESET AV')
  690. print result
  691.  
  692.  
  693. Output:
  694. None
  695.  
  696.  
  697. There are methods like start() and end() to know the start and end position of matching pattern in the string.
  698.  
  699. Code
  700.  
  701. result = re.match(r'AV', 'AV Analytics ESET AV')
  702. print result.start()
  703. print result.end()
  704.  
  705. Output:
  706. 0
  707. 2
  708.  
  709. Above you can see that start and end position of matching pattern ‘AV' in the string and sometime it helps a lot while performing manipulation with the string.
  710.  
  711.  
  712.  
  713.  
  714.  
  715. re.search(pattern, string):
  716. -----------------------------------------------------
  717.  
  718.  
  719. It is similar to match() but it doesn't restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Analytics' will return a match.
  720.  
  721. Code
  722.  
  723. result = re.search(r'Analytics', 'AV Analytics ESET AV')
  724. print result.group(0)
  725.  
  726. Output:
  727. Analytics
  728.  
  729. Here you can see that, search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.
  730.  
  731.  
  732.  
  733.  
  734.  
  735.  
  736. re.findall (pattern, string):
  737. ------------------------------------------------------
  738.  
  739.  
  740. It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV' in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.
  741.  
  742.  
  743. Code
  744.  
  745. result = re.findall(r'AV', 'AV Analytics ESET AV')
  746. print result
  747.  
  748. Output:
  749. ['AV', 'AV']
  750.  
  751.  
  752.  
  753.  
  754.  
  755. re.split(pattern, string, [maxsplit=0]):
  756. ------------------------------------------------------
  757.  
  758.  
  759.  
  760. This methods helps to split string by the occurrences of given pattern.
  761.  
  762.  
  763. Code
  764.  
  765. result=re.split(r'y','Analytics')
  766. result
  767.  
  768. Output:
  769. ['Anal', 'tics']
  770.  
  771. Above, we have split the string "Analytics" by "y". Method split() has another argument "maxsplit". It has default value of zero. In this case it does the maximum splits that can be done, but if we give value to maxsplit, it will split the string. Let's look at the example below:
  772.  
  773.  
  774. Code
  775.  
  776. result=re.split(r's','Analytics eset')
  777. print result
  778.  
  779. Output:
  780. ['Analytic', 'e', 'et'] #It has performed all the splits that can be done by pattern "s".
  781.  
  782. Code
  783.  
  784. result=re.split(r's','Analytics eset',maxsplit=1)
  785. result
  786.  
  787. Output:
  788. ['Analytic', 'eset']
  789.  
  790. Here, you can notice that we have fixed the maxsplit to 1. And the result is, it has only two values whereas first example has three values.
  791.  
  792.  
  793.  
  794.  
  795. re.sub(pattern, repl, string):
  796. ----------------------------------------------------------
  797.  
  798. It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.
  799.  
  800. Code
  801.  
  802. result=re.sub(r'Ruby','Python','Joe likes Ruby')
  803. result
  804. Output:
  805. 'Joe likes Python'
  806.  
  807.  
  808.  
  809.  
  810.  
  811. re.compile(pattern, repl, string):
  812. ----------------------------------------------------------
  813.  
  814.  
  815. We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.
  816.  
  817.  
  818. Code
  819.  
  820. import re
  821. pattern=re.compile('XSS')
  822. result=pattern.findall('XSS is Cross Site Sripting, XSS')
  823. print result
  824. result2=pattern.findall('XSS is Cross Site Scripting, SQLi is Sql Injection')
  825. print result2
  826. Output:
  827. ['XSS', 'XSS']
  828. ['XSS']
  829.  
  830. Till now, we looked at various methods of regular expression using a constant pattern (fixed characters). But, what if we do not have a constant search pattern and we want to return specific set of characters (defined by a rule) from a string? Don't be intimidated.
  831.  
  832. This can easily be solved by defining an expression with the help of pattern operators (meta and literal characters). Let's look at the most common pattern operators.
  833.  
  834.  
  835.  
  836.  
  837.  
  838. **********************************************
  839. * What are the most commonly used operators? *
  840. **********************************************
  841.  
  842.  
  843. Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and text mining to extract required information.
  844.  
  845. Operators Description
  846. . Matches with any single character except newline ‘\n'.
  847. ? match 0 or 1 occurrence of the pattern to its left
  848. + 1 or more occurrences of the pattern to its left
  849. * 0 or more occurrences of the pattern to its left
  850. \w Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
  851. \d Matches with digits [0-9] and /D (upper case D) matches with non-digits.
  852. \s Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.
  853. \b boundary between word and non-word and /B is opposite of /b
  854. [..] Matches any single character in a square bracket and [^..] matches any single character not in square bracket
  855. \ It is used for special meaning characters like \. to match a period or \+ for plus sign.
  856. ^ and $ ^ and $ match the start or end of the string respectively
  857. {n,m} Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
  858. a| b Matches either a or b
  859. ( ) Groups regular expressions and returns matched text
  860. \t, \n, \r Matches tab, newline, return
  861.  
  862.  
  863. For more details on meta characters "(", ")","|" and others details , you can refer this link (https://docs.python.org/2/library/re.html).
  864.  
  865. Now, let's understand the pattern operators by looking at the below examples.
  866.  
  867.  
  868.  
  869. ****************************************
  870. * Some Examples of Regular Expressions *
  871. ****************************************
  872.  
  873. ******************************************************
  874. * Problem 1: Return the first word of a given string *
  875. ******************************************************
  876.  
  877.  
  878. Solution-1 Extract each character (using "\w")
  879. ---------------------------------------------------------------------------
  880.  
  881. Code
  882.  
  883. import re
  884. result=re.findall(r'.','Python is the best scripting language')
  885. print result
  886.  
  887. Output:
  888. ['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 's', 'c', 'r', 'i', 'p', 't', 'i', 'n', 'g', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']
  889.  
  890.  
  891. Above, space is also extracted, now to avoid it use "\w" instead of ".".
  892.  
  893.  
  894. Code
  895.  
  896. result=re.findall(r'\w','Python is the best scripting language')
  897. print result
  898.  
  899. Output:
  900. ['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 't', 'h', 'e', 'b', 'e', 's', 't', 's', 'c', 'r', 'i', 'p', 't', 'i', 'n', 'g', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']
  901.  
  902.  
  903.  
  904.  
  905. Solution-2 Extract each word (using "*" or "+")
  906. ---------------------------------------------------------------------------
  907.  
  908. Code
  909.  
  910. result=re.findall(r'\w*','Python is the best scripting language')
  911. print result
  912.  
  913. Output:
  914. ['Python', '', 'is', '', 'the', '', 'best', '', 'scripting', '', 'language', '']
  915.  
  916.  
  917. Again, it is returning space as a word because "*" returns zero or more matches of pattern to its left. Now to remove spaces we will go with "+".
  918.  
  919. Code
  920.  
  921. result=re.findall(r'\w+','Python is the best scripting language')
  922. print result
  923. Output:
  924. ['Python', 'is', 'the', 'best', 'scripting', 'language']
  925.  
  926.  
  927.  
  928.  
  929.  
  930. Solution-3 Extract each word (using "^")
  931. -------------------------------------------------------------------------------------
  932.  
  933.  
  934. Code
  935.  
  936. result=re.findall(r'^\w+','Python is the best scripting language')
  937. print result
  938.  
  939. Output:
  940. ['Python']
  941.  
  942. If we will use "$" instead of "^", it will return the word from the end of the string. Let's look at it.
  943.  
  944. Code
  945.  
  946. result=re.findall(r'\w+$','Python is the best scripting language')
  947. print result
  948. Output:
  949. [‘language']
  950.  
  951.  
  952.  
  953.  
  954.  
  955. **********************************************************
  956. * Problem 2: Return the first two character of each word *
  957. **********************************************************
  958.  
  959.  
  960.  
  961.  
  962. Solution-1 Extract consecutive two characters of each word, excluding spaces (using "\w")
  963. ------------------------------------------------------------------------------------------------------
  964.  
  965. Code
  966.  
  967. result=re.findall(r'\w\w','Python is the best')
  968. print result
  969.  
  970. Output:
  971. ['Py', 'th', 'on', 'is,', 'th', 'eb', 'es']
  972.  
  973.  
  974.  
  975.  
  976.  
  977.  
  978. Solution-2 Extract consecutive two characters those available at start of word boundary (using "\b")
  979. ------------------------------------------------------------------------------------------------------
  980.  
  981. Code
  982.  
  983. result=re.findall(r'\b\w.','Python is the best')
  984. print result
  985.  
  986. Output:
  987. ['Py', 'is,', 'th', 'be']
  988.  
  989.  
  990.  
  991.  
  992.  
  993.  
  994. ********************************************************
  995. * Problem 3: Return the domain type of given email-ids *
  996. ********************************************************
  997.  
  998.  
  999. To explain it in simple manner, I will again go with a stepwise approach:
  1000.  
  1001.  
  1002.  
  1003.  
  1004.  
  1005. Solution-1 Extract all characters after "@"
  1006. ------------------------------------------------------------------------------------------------------------------
  1007.  
  1008. Code
  1009.  
  1010. result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.com, test.first@strategicsec.com, first.test@rest.biz')
  1011. print result
  1012.  
  1013. Output: ['@gmail', '@test', '@strategicsec', '@rest']
  1014.  
  1015.  
  1016.  
  1017. Above, you can see that ".com", ".biz" part is not extracted. To add it, we will go with below code.
  1018.  
  1019.  
  1020. result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.com, test.first@strategicsec.com, first.test@rest.biz')
  1021. print result
  1022.  
  1023. Output:
  1024. ['@gmail.com', '@test.com', '@strategicsec.com', '@rest.biz']
  1025.  
  1026.  
  1027.  
  1028.  
  1029.  
  1030.  
  1031. Solution – 2 Extract only domain name using "( )"
  1032. -----------------------------------------------------------------------------------------------------------------------
  1033.  
  1034.  
  1035. Code
  1036.  
  1037. result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.com, test.first@strategicsec.com, first.test@rest.biz')
  1038. print result
  1039.  
  1040. Output:
  1041. ['com', 'com', 'com', 'biz']
  1042.  
  1043.  
  1044.  
  1045.  
  1046.  
  1047.  
  1048. ********************************************
  1049. * Problem 4: Return date from given string *
  1050. ********************************************
  1051.  
  1052.  
  1053. Here we will use "\d" to extract digit.
  1054.  
  1055.  
  1056. Solution:
  1057. ----------------------------------------------------------------------------------------------------------------------
  1058.  
  1059. Code
  1060.  
  1061. result=re.findall(r'\d{2}-\d{2}-\d{4}','Joe 34-3456 12-05-2007, XYZ 56-4532 11-11-2016, ABC 67-8945 12-01-2009')
  1062. print result
  1063.  
  1064. Output:
  1065. ['12-05-2007', '11-11-2016', '12-01-2009']
  1066.  
  1067. If you want to extract only year again parenthesis "( )" will help you.
  1068.  
  1069.  
  1070. Code
  1071.  
  1072.  
  1073. result=re.findall(r'\d{2}-\d{2}-(\d{4})','Joe 34-3456 12-05-2007, XYZ 56-4532 11-11-2016, ABC 67-8945 12-01-2009')
  1074. print result
  1075.  
  1076. Output:
  1077. ['2007', '2016', '2009']
  1078.  
  1079.  
  1080.  
  1081.  
  1082.  
  1083. *******************************************************************
  1084. * Problem 5: Return all words of a string those starts with vowel *
  1085. *******************************************************************
  1086.  
  1087.  
  1088.  
  1089.  
  1090. Solution-1 Return each words
  1091. -----------------------------------------------------------------------------------------------------------------
  1092.  
  1093. Code
  1094.  
  1095. result=re.findall(r'\w+','Python is the best')
  1096. print result
  1097.  
  1098. Output:
  1099. ['Python', 'is', 'the', 'best']
  1100.  
  1101.  
  1102.  
  1103.  
  1104.  
  1105. Solution-2 Return words starts with alphabets (using [])
  1106. ------------------------------------------------------------------------------------------------------------------
  1107.  
  1108. Code
  1109.  
  1110. result=re.findall(r'[aeiouAEIOU]\w+','I love Python')
  1111. print result
  1112.  
  1113. Output:
  1114. ['I', 'ove', 'on']
  1115.  
  1116. Above you can see that it has returned "ove" and "on" from the mid of words. To drop these two, we need to use "\b" for word boundary.
  1117.  
  1118.  
  1119.  
  1120.  
  1121.  
  1122. Solution- 3
  1123. ------------------------------------------------------------------------------------------------------------------
  1124.  
  1125. Code
  1126.  
  1127. result=re.findall(r'\b[aeiouAEIOU]\w+','I love Python')
  1128. print result
  1129.  
  1130. Output:
  1131. ['I']
  1132.  
  1133.  
  1134. In similar ways, we can extract words those starts with constant using "^" within square bracket.
  1135.  
  1136.  
  1137. Code
  1138.  
  1139. result=re.findall(r'\b[^aeiouAEIOU]\w+','I love Python')
  1140. print result
  1141.  
  1142. Output:
  1143. [' love', ' Python']
  1144.  
  1145. Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].
  1146.  
  1147.  
  1148. Code
  1149.  
  1150. result=re.findall(r'\b[^aeiouAEIOU ]\w+','I love Python')
  1151. print result
  1152.  
  1153. Output:
  1154. ['love', 'Python']
  1155.  
  1156.  
  1157.  
  1158.  
  1159.  
  1160.  
  1161. *************************************************************************************************
  1162. * Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) *
  1163. *************************************************************************************************
  1164.  
  1165.  
  1166. We have a list phone numbers in list "li" and here we will validate phone numbers using regular
  1167.  
  1168.  
  1169.  
  1170.  
  1171. Solution
  1172. -------------------------------------------------------------------------------------------------------------------------------------
  1173.  
  1174.  
  1175. Code
  1176.  
  1177. import re
  1178. li=['9999999999','999999-999','99999x9999']
  1179. for val in li:
  1180. if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
  1181. print 'yes'
  1182. else:
  1183. print 'no'
  1184.  
  1185.  
  1186. Output:
  1187. yes
  1188. no
  1189. no
  1190.  
  1191.  
  1192.  
  1193.  
  1194.  
  1195. ******************************************************
  1196. * Problem 7: Split a string with multiple delimiters *
  1197. ******************************************************
  1198.  
  1199.  
  1200.  
  1201. Solution
  1202. ---------------------------------------------------------------------------------------------------------------------------
  1203.  
  1204.  
  1205. Code
  1206.  
  1207. import re
  1208. line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
  1209. result= re.split(r'[;,\s]', line)
  1210. print result
  1211.  
  1212. Output:
  1213. ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
  1214.  
  1215.  
  1216.  
  1217. We can also use method re.sub() to replace these multiple delimiters with one as space " ".
  1218.  
  1219.  
  1220. Code
  1221.  
  1222. import re
  1223. line = 'asdf fjdk;afed,fjek,asdf,foo'
  1224. result= re.sub(r'[;,\s]',' ', line)
  1225. print result
  1226.  
  1227. Output:
  1228. asdf fjdk afed fjek asdf foo
  1229.  
  1230.  
  1231.  
  1232.  
  1233. **************************************************
  1234. * Problem 8: Retrieve Information from HTML file *
  1235. **************************************************
  1236.  
  1237.  
  1238.  
  1239. I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string str.
  1240.  
  1241.  
  1242.  
  1243. Sample HTML file (str)
  1244.  
  1245. <tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
  1246. <tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
  1247. <tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
  1248. <tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
  1249. <tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
  1250. <tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
  1251. <tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>
  1252. Solution:
  1253.  
  1254.  
  1255.  
  1256. Code
  1257.  
  1258. result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
  1259. print result
  1260.  
  1261. Output:
  1262. [('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia'), ('Michael', 'Emily')]
  1263.  
  1264.  
  1265.  
  1266. You can read html file using library urllib2 (see below code).
  1267.  
  1268.  
  1269. Code
  1270.  
  1271. import urllib2
  1272. response = urllib2.urlopen('')
  1273. html = response.read()
Add Comment
Please, Sign In to add comment