Advertisement
joemccray

Web Scraping & Data Parsing with Python

Apr 29th, 2015
1,085
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 35.48 KB | None | 0 0
  1. ###########################################
  2. # Required libs to make the tutorial work #
  3. ##########################################
  4. sudo apt-get install -y python-pip python-virtualenv python-lxml libxslt1-dev libxml2 libxml2-dev python-bs4
  5. pip install beautifulsoup4
  6. pip install lxml
  7. pip install cssselect
  8. pip install requests
  9. virtualenv venv
  10. source venv/bin/activate
  11.  
  12.  
  13.  
  14.  
  15. #######################
  16. # Regular Expressions #
  17. #######################
  18. References:
  19. http://www.pythonforbeginners.com/regex/regular-expressions-in-python
  20. http://www.tutorialspoint.com/python/python_reg_expressions.htm
  21.  
  22.  
  23. re.findall - finds every occurrence of whatever you are searching for
  24. ---------------------------------------------------------------------
  25. python
  26. import re
  27. str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
  28. emails = re.findall(r'[\w\.-]+@[\w\.-]+', str)
  29. for email in emails:
  30. # do something with each found email string
  31. print email
  32.  
  33.  
  34.  
  35. re.search - This function searches for first occurrence a pattern within a string.
  36. ----------------------------------------------------------------------------------
  37. python
  38. import re
  39. str = 'an example word:cat!!'
  40.  
  41. match = re.search(r'word:www', str)
  42.  
  43. # If-statement after search() tests if it succeeded
  44. if match:
  45. print 'found', match.group()
  46.  
  47. else:
  48. print 'did not find' ## Hit enter 2 times
  49.  
  50.  
  51.  
  52.  
  53.  
  54. str = 'an example word:cat!!'
  55.  
  56. match = re.search(r'word:cat', str)
  57.  
  58. # If-statement after search() tests if it succeeded
  59. if match:
  60. print 'found', match.group()
  61.  
  62. else:
  63. print 'did not find' ## Hit enter 2 times
  64.  
  65.  
  66.  
  67.  
  68.  
  69.  
  70.  
  71. re.sub - replaces all occurrences of the pattern in the string
  72. --------------------------------------------------------------
  73. import re
  74. text = "Python for beginner is a very cool website"
  75. pattern = re.sub("cool", "good", text)
  76. print pattern
  77.  
  78.  
  79.  
  80.  
  81. re.compile - replaces all occurrences of the pattern in the string
  82. --------------------------------------------------------------
  83. vi namecheck.py
  84.  
  85.  
  86. #!/usr/bin/python
  87. import re
  88.  
  89. name_check = re.compile(r"[^A-Za-zs.]")
  90.  
  91. name = raw_input ("Please, enter your name: ")
  92.  
  93. while name_check.search(name):
  94. print "Please enter your name correctly!"
  95. name = raw_input ("Please, enter your name: ")
  96.  
  97.  
  98.  
  99.  
  100.  
  101.  
  102. Quick website connect via URLLib2 and IP address parse via re.compile
  103. ---------------------------------------------------------------------
  104. vi getip.py
  105.  
  106.  
  107. #!/usr/bin/env python
  108. import re
  109. import urllib2
  110.  
  111. def getIP():
  112. ip_checker_url = "http://checkip.dyndns.org/"
  113. address_regexp = re.compile ('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
  114. response = urllib2.urlopen(ip_checker_url).read()
  115. result = address_regexp.search(response)
  116.  
  117. if result:
  118. return result.group()
  119. else:
  120. return None
  121.  
  122.  
  123. ################
  124. # Web Scraping #
  125. ################
  126.  
  127. Reference:
  128. http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/
  129.  
  130.  
  131. vi urlget.py
  132. ----------------------------------------------------
  133. #!/usr/bin/env python
  134. from bs4 import BeautifulSoup
  135.  
  136. import requests
  137.  
  138. url = raw_input("Enter a website to extract the URL's from: ")
  139.  
  140. r = requests.get("http://" +url)
  141.  
  142. data = r.text
  143.  
  144. soup = BeautifulSoup(data)
  145.  
  146. for link in soup.find_all('a'):
  147. print(link.get('href'))
  148.  
  149. -----------------------------------------------------
  150.  
  151.  
  152.  
  153.  
  154. Reference:
  155. http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
  156.  
  157.  
  158.  
  159.  
  160. -----------------------------------------------------
  161.  
  162. Reference:
  163. http://www.pythonforbeginners.com/beautifulsoup/scraping-websites-with-beautifulsoup
  164.  
  165.  
  166.  
  167. -----------------------------------------------------
  168.  
  169. Reference:
  170. http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
  171.  
  172.  
  173.  
  174. -----------------------------------------------------
  175.  
  176. Reference:
  177. http://www.pythonforbeginners.com/api/python-api-and-json
  178.  
  179.  
  180.  
  181. -----------------------------------------------------
  182. Reference:
  183. http://www.pythonforbeginners.com/scripts/using-the-youtube-api/
  184.  
  185.  
  186.  
  187. -----------------------------------------------------
  188. Reference:
  189. http://www.pythonforbeginners.com/code-snippets-source-code/imdb-crawler
  190.  
  191.  
  192. #############
  193. # XML Intro #
  194. #############
  195. Basics:
  196. http://effbot.org/zone/element-index.htm
  197. http://lxml.de/tutorial.html
  198. http://lxml.de/parsing.html
  199.  
  200.  
  201. Reference:
  202. http://www.blog.pythonlibrary.org/2010/11/12/python-parsing-xml-with-minidom/
  203.  
  204. vi example.xml
  205. -------------------------------
  206. <?xml version="1.0"?>
  207. <catalog>
  208. <book id="bk101">
  209. <author>Gambardella, Matthew</author>
  210. <title>XML Developer's Guide</title>
  211. <genre>Computer</genre>
  212. <price>44.95</price>
  213. <publish_date>2000-10-01</publish_date>
  214. <description>An in-depth look at creating applications
  215. with XML.</description>
  216. </book>
  217. <book id="bk102">
  218. <author>Ralls, Kim</author>
  219. <title>Midnight Rain</title>
  220. <genre>Fantasy</genre>
  221. <price>5.95</price>
  222. <publish_date>2000-12-16</publish_date>
  223. <description>A former architect battles corporate zombies,
  224. an evil sorceress, and her own childhood to become queen
  225. of the world.</description>
  226. </book>
  227. <book id="bk103">
  228. <author>Corets, Eva</author>
  229. <title>Maeve Ascendant</title>
  230. <genre>Fantasy</genre>
  231. <price>5.95</price>
  232. <publish_date>2000-11-17</publish_date>
  233. <description>After the collapse of a nanotechnology
  234. society in England, the young survivors lay the
  235. foundation for a new society.</description>
  236. </book>
  237. <book id="bk104">
  238. <author>Corets, Eva</author>
  239. <title>Oberon's Legacy</title>
  240. <genre>Fantasy</genre>
  241. <price>5.95</price>
  242. <publish_date>2001-03-10</publish_date>
  243. <description>In post-apocalypse England, the mysterious
  244. agent known only as Oberon helps to create a new life
  245. for the inhabitants of London. Sequel to Maeve
  246. Ascendant.</description>
  247. </book>
  248. <book id="bk105">
  249. <author>Corets, Eva</author>
  250. <title>The Sundered Grail</title>
  251. <genre>Fantasy</genre>
  252. <price>5.95</price>
  253. <publish_date>2001-09-10</publish_date>
  254. <description>The two daughters of Maeve, half-sisters,
  255. battle one another for control of England. Sequel to
  256. Oberon's Legacy.</description>
  257. </book>
  258. <book id="bk106">
  259. <author>Randall, Cynthia</author>
  260. <title>Lover Birds</title>
  261. <genre>Romance</genre>
  262. <price>4.95</price>
  263. <publish_date>2000-09-02</publish_date>
  264. <description>When Carla meets Paul at an ornithology
  265. conference, tempers fly as feathers get ruffled.</description>
  266. </book>
  267. <book id="bk107">
  268. <author>Thurman, Paula</author>
  269. <title>Splish Splash</title>
  270. <genre>Romance</genre>
  271. <price>4.95</price>
  272. <publish_date>2000-11-02</publish_date>
  273. <description>A deep sea diver finds true love twenty
  274. thousand leagues beneath the sea.</description>
  275. </book>
  276. <book id="bk108">
  277. <author>Knorr, Stefan</author>
  278. <title>Creepy Crawlies</title>
  279. <genre>Horror</genre>
  280. <price>4.95</price>
  281. <publish_date>2000-12-06</publish_date>
  282. <description>An anthology of horror stories about roaches,
  283. centipedes, scorpions and other insects.</description>
  284. </book>
  285. <book id="bk109">
  286. <author>Kress, Peter</author>
  287. <title>Paradox Lost</title>
  288. <genre>Science Fiction</genre>
  289. <price>6.95</price>
  290. <publish_date>2000-11-02</publish_date>
  291. <description>After an inadvertant trip through a Heisenberg
  292. Uncertainty Device, James Salway discovers the problems
  293. of being quantum.</description>
  294. </book>
  295. <book id="bk110">
  296. <author>O'Brien, Tim</author>
  297. <title>Microsoft .NET: The Programming Bible</title>
  298. <genre>Computer</genre>
  299. <price>36.95</price>
  300. <publish_date>2000-12-09</publish_date>
  301. <description>Microsoft's .NET initiative is explored in
  302. detail in this deep programmer's reference.</description>
  303. </book>
  304. <book id="bk111">
  305. <author>O'Brien, Tim</author>
  306. <title>MSXML3: A Comprehensive Guide</title>
  307. <genre>Computer</genre>
  308. <price>36.95</price>
  309. <publish_date>2000-12-01</publish_date>
  310. <description>The Microsoft MSXML3 parser is covered in
  311. detail, with attention to XML DOM interfaces, XSLT processing,
  312. SAX and more.</description>
  313. </book>
  314. <book id="bk112">
  315. <author>Galos, Mike</author>
  316. <title>Visual Studio 7: A Comprehensive Guide</title>
  317. <genre>Computer</genre>
  318. <price>49.95</price>
  319. <publish_date>2001-04-16</publish_date>
  320. <description>Microsoft Visual Studio 7 is explored in depth,
  321. looking at how Visual Basic, Visual C++, C#, and ASP+ are
  322. integrated into a comprehensive development
  323. environment.</description>
  324. </book>
  325. </catalog>
  326.  
  327. -------------------------------
  328.  
  329.  
  330.  
  331. vi xmlparse.py
  332. -------------------------------
  333. #!/usr/bin/python
  334. import xml.dom.minidom as minidom
  335.  
  336.  
  337. def getTitles(xml):
  338. """
  339. Print out all titles found in xml
  340. """
  341. doc = minidom.parse(xml)
  342. node = doc.documentElement
  343. books = doc.getElementsByTagName("book")
  344.  
  345. titles = []
  346. for book in books:
  347. titleObj = book.getElementsByTagName("title")[0]
  348. titles.append(titleObj)
  349.  
  350. for title in titles:
  351. nodes = title.childNodes
  352. for node in nodes:
  353. if node.nodeType == node.TEXT_NODE:
  354. print node.data
  355.  
  356. if __name__ == "__main__":
  357. document = 'example.xml'
  358. getTitles(document)
  359.  
  360.  
  361.  
  362. This code is just one short function that accepts one argument, the XML file. We import the minidom module and give it the same name to make it easier to reference. Then we parse the XML. The first two lines in the function are pretty much the same as the previous example. We use getElementsByTagName to grab the parts of the XML that we want, then iterate over the result and extract the book titles from them. This actually extracts title objects, so we need to iterate over that as well and pull out the plain text, which is what the second nested for loop is for.
  363.  
  364.  
  365. -------------------------------
  366. #####################
  367. # Python and SQLite #
  368. #####################
  369. http://zetcode.com/db/sqlitepythontutorial/
  370.  
  371.  
  372. ####################
  373. # Python and MySQL #
  374. ####################
  375. Reference:
  376. http://zetcode.com/db/mysqlpython/
  377.  
  378. ################################
  379. # Lesson 16: Parsing XML Files #
  380. ################################
  381.  
  382. /---------------------------------------------------/
  383. --------------------PARSING XML FILES----------------
  384. /---------------------------------------------------/
  385.  
  386.  
  387. Type the following commands:
  388. ---------------------------------------------------------------------------------------------------------
  389.  
  390. wget https://s3.amazonaws.com/SecureNinja/Python/samplescan.xml
  391.  
  392. wget https://s3.amazonaws.com/SecureNinja/Python/application.xml
  393.  
  394. wget https://s3.amazonaws.com/SecureNinja/Python/security.xml
  395.  
  396. wget https://s3.amazonaws.com/SecureNinja/Python/system.xml
  397.  
  398. wget https://s3.amazonaws.com/SecureNinja/Python/sc_xml.xml
  399.  
  400.  
  401.  
  402. -------------TASK 1------------
  403. vi readxml1.py
  404.  
  405. #!/usr/bin/python
  406. from xmllib import attributes
  407. from xml.dom.minidom import toxml
  408. from xml.dom.minidom import firstChild
  409. from xml.dom import minidom
  410. xmldoc = minidom.parse('sc_xml.xml')
  411. grandNode = xmldoc.firstChild
  412. nodes = grandNode.getElementsByTagName('host')
  413. count = 0
  414.  
  415. for node in nodes:
  416. os = node.getElementsByTagName('os')[0]
  417. osclasses = os.getElementsByTagName('osclass')
  418. for osclass in osclasses:
  419. if osclass.attributes['osfamily'].value == 'Windows' and osclass.attributes['osgen'].value == 'XP':
  420. try:
  421. print '%-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'OS',os.getElementsByTagName('osmatch')[0].attributes['name'].value)
  422. except:
  423. print '%-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','OS',os.getElementsByTagName('osmatch')[0].attributes['name'].value)
  424.  
  425.  
  426.  
  427.  
  428.  
  429. -------------TASK 2------------
  430. vi readxml2.py
  431.  
  432. #!/usr/bin/python
  433. from xmllib import attributes
  434. from xml.dom.minidom import toxml
  435. from xml.dom.minidom import firstChild
  436. from xml.dom import minidom
  437. xmldoc = minidom.parse('sc_xml.xml')
  438. grandNode = xmldoc.firstChild
  439. nodes = grandNode.getElementsByTagName('host')
  440. count = 0
  441. for node in nodes:
  442. portsNode = node.getElementsByTagName('ports')[0]
  443. ports = portsNode.getElementsByTagName('port')
  444. for port in ports:
  445. if port.attributes['portid'].value == '22' and port.attributes['protocol'].value == 'tcp':
  446. state = port.getElementsByTagName('state')[0]
  447. if state.attributes['state'].value == 'open':
  448. try:
  449. print '%-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'Ports','open : tcp : 22')
  450. except:
  451. print '%-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','Ports','open : tcp : 22')
  452.  
  453.  
  454.  
  455.  
  456. -------------TASK 3------------
  457. vi readxml3.py
  458.  
  459. #!/usr/bin/python
  460. from xmllib import attributes
  461. from xml.dom.minidom import toxml
  462. from xml.dom.minidom import firstChild
  463. from xml.dom import minidom
  464. xmldoc = minidom.parse('sc_xml.xml')
  465. grandNode = xmldoc.firstChild
  466. nodes = grandNode.getElementsByTagName('host')
  467. count = 0
  468. for node in nodes:
  469. portsNode = node.getElementsByTagName('ports')[0]
  470. ports = portsNode.getElementsByTagName('port')
  471. flag = 0
  472. for port in ports:
  473. if flag == 0:
  474. if port.attributes['protocol'].value == 'tcp' and (port.attributes['portid'].value == '443' or port.attributes['portid'].value == '80'):
  475. state = port.getElementsByTagName('state')[0]
  476. if state.attributes['state'].value == 'open':
  477. try:
  478. print '%-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'Ports','open : tcp : '+port.attributes['portid'].value)
  479. except:
  480. print '%-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','Ports','open : tcp : '+port.attributes['portid'].value)
  481. flag = 1
  482.  
  483.  
  484.  
  485.  
  486. -------------TASK 4------------
  487. vi readxml4.py
  488.  
  489. #!/usr/bin/python
  490. from xmllib import attributes
  491. from xml.dom.minidom import toxml
  492. from xml.dom.minidom import firstChild
  493. from xml.dom import minidom
  494. xmldoc = minidom.parse('sc_xml.xml')
  495. grandNode = xmldoc.firstChild
  496. nodes = grandNode.getElementsByTagName('host')
  497. count = 0
  498. for node in nodes:
  499. flag = 0
  500. naddress = ''
  501. addresses = node.getElementsByTagName('address')
  502. for address in addresses:
  503. if address.attributes['addrtype'].value == 'ipv4' and address.attributes['addr'].value[0:6] == '10.57.':
  504. naddress = address.attributes['addr'].value
  505. flag = 1
  506. if flag == 1:
  507. portsNode = node.getElementsByTagName('ports')[0];
  508. ports = portsNode.getElementsByTagName('port')
  509. flag = 0
  510. for port in ports:
  511. status = {}
  512. if port.attributes['protocol'].value == 'tcp' and port.attributes['portid'].value[0:2] == '22':
  513. state = port.getElementsByTagName('state')[0]
  514. if "open" in state.attributes['state'].value:
  515. status[0] = state.attributes['state'].value
  516. status[1] = port.attributes['portid'].value
  517. flag = 1
  518. else:
  519. flag = 0
  520. if port.attributes['protocol'].value == 'tcp' and flag == 1:
  521. if port.attributes['portid'].value == '80' or port.attributes['portid'].value == '443':
  522. state = port.getElementsByTagName('state')[0]
  523. if state.attributes['state'].value == 'open':
  524. flag = 0
  525. try:
  526. print '%-8s: %s -> %-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'IP',naddress,'Ports',status[0]+' : tcp : '+status[1]+' and open : tcp : '+port.attributes['portid'].value)
  527. except:
  528. print '%-8s: %s -> %-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','IP',naddress,'Ports',status[0]+' : tcp : '+status[1]+' and open : tcp : '+port.attributes['portid'].value)
  529.  
  530.  
  531.  
  532. ################################
  533. # Lesson 17: Parsing EVTX Logs #
  534. ################################
  535. /---------------------------------------------------/
  536. --------------------PARSING EVTX FILES----------------
  537. /---------------------------------------------------/
  538.  
  539.  
  540. Type the following commands:
  541. ---------------------------------------------------------------------------------------------------------
  542.  
  543. wget https://s3.amazonaws.com/SecureNinja/Python/Program-Inventory.evtx
  544.  
  545. wget https://s3.amazonaws.com/SecureNinja/Python/WIN-M751BADISCT_Application.evtx
  546.  
  547. wget https://s3.amazonaws.com/SecureNinja/Python/WIN-M751BADISCT_Security.evtx
  548.  
  549. wget https://s3.amazonaws.com/SecureNinja/Python/WIN-M751BADISCT_System.evtx
  550.  
  551.  
  552.  
  553.  
  554. -------------TASK 1------------
  555. vi readevtx1.py
  556.  
  557. import mmap
  558. import re
  559. import contextlib
  560. import sys
  561. import operator
  562. import HTMLParser
  563. from xml.dom import minidom
  564. from operator import itemgetter, attrgetter
  565.  
  566. from Evtx.Evtx import FileHeader
  567. from Evtx.Views import evtx_file_xml_view
  568.  
  569. pars = HTMLParser.HTMLParser()
  570. print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
  571. file_name = str(raw_input('Enter EVTX file name without extension : '))
  572. file_name = 'WIN-M751BADISCT_System'
  573. with open(file_name+'.evtx', 'r') as f:
  574. with contextlib.closing(mmap.mmap(f.fileno(), 0,
  575. access=mmap.ACCESS_READ)) as buf:
  576. fh = FileHeader(buf, 0x0)
  577. xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
  578. try:
  579. for xml, record in evtx_file_xml_view(fh):
  580. xml_file += xml
  581. except:
  582. pass
  583. xml_file += "</Events>"
  584. xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
  585. xml_file = re.sub('<local>', '<local></local>', xml_file)
  586. xml_file = re.sub('&amp;', '&amp;', xml_file)
  587. f = open(file_name+'.xml', 'w')
  588. f.write(xml_file)
  589. f.close()
  590. try:
  591. xmldoc = minidom.parse(file_name+'.xml')
  592. except:
  593. sys.exit('Invalid file...')
  594. grandNode = xmldoc.firstChild
  595. nodes = grandNode.getElementsByTagName('Event')
  596.  
  597.  
  598. event_num = int(raw_input('How many events you want to show : '))
  599. length = int(len(nodes)) - 1
  600. event_id = 0
  601. if event_num > length:
  602. sys.exit('You have entered an ivalid num...')
  603. while True:
  604. if event_num > 0 and length > -1:
  605. try:
  606. event_id = nodes[length].getElementsByTagName('EventID')[0].childNodes[0].nodeValue
  607. try:
  608. print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event',node.getElementsByTagName('string')[1].childNodes[0].nodeValue)
  609. except:
  610. print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event','Name not found')
  611. event_num -= 1
  612. length -= 1
  613. except:
  614. length -= 1
  615. else:
  616. sys.exit('...Search Complete...')
  617.  
  618.  
  619.  
  620. -------------TASK 2------------
  621. vi readevtx2.py
  622.  
  623. import mmap
  624. import re
  625. import contextlib
  626. import sys
  627. import operator
  628. import HTMLParser
  629. from xml.dom import minidom
  630. from operator import itemgetter, attrgetter
  631.  
  632. from Evtx.Evtx import FileHeader
  633. from Evtx.Views import evtx_file_xml_view
  634.  
  635. pars = HTMLParser.HTMLParser()
  636. print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
  637. file_name = str(raw_input('Enter EVTX file name without extension : '))
  638. file_name = 'WIN-M751BADISCT_System'
  639. with open(file_name+'.evtx', 'r') as f:
  640. with contextlib.closing(mmap.mmap(f.fileno(), 0,
  641. access=mmap.ACCESS_READ)) as buf:
  642. fh = FileHeader(buf, 0x0)
  643. xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
  644. try:
  645. for xml, record in evtx_file_xml_view(fh):
  646. xml_file += xml
  647. except:
  648. pass
  649. xml_file += "</Events>"
  650. xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
  651. xml_file = re.sub('<local>', '<local></local>', xml_file)
  652. xml_file = re.sub('&amp;', '&amp;', xml_file)
  653. f = open(file_name+'.xml', 'w')
  654. f.write(xml_file)
  655. f.close()
  656. try:
  657. xmldoc = minidom.parse(file_name+'.xml')
  658. except:
  659. sys.exit('Invalid file...')
  660. grandNode = xmldoc.firstChild
  661. nodes = grandNode.getElementsByTagName('Event')
  662.  
  663. event = int(raw_input('Enter Event ID : '))
  664. event_id = 0
  665. for node in nodes:
  666. try:
  667. event_id = node.getElementsByTagName('EventID')[0].childNodes[0].nodeValue
  668. if int(event_id) == event:
  669. try:
  670. print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event',node.getElementsByTagName('string')[1].childNodes[0].nodeValue)
  671. except:
  672. print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event','Name not found')
  673. except:
  674. continue
  675. sys.exit('...Search Complete...')
  676.  
  677.  
  678.  
  679. -------------TASK 3------------
  680. vi readevtx3.py
  681.  
  682. import mmap
  683. import re
  684. import contextlib
  685. import sys
  686. import operator
  687. import HTMLParser
  688. from xml.dom import minidom
  689. from operator import itemgetter, attrgetter
  690.  
  691. from Evtx.Evtx import FileHeader
  692. from Evtx.Views import evtx_file_xml_view
  693.  
  694. pars = HTMLParser.HTMLParser()
  695. print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
  696. file_name = str(raw_input('Enter EVTX file name without extension : '))
  697. file_name = 'WIN-M751BADISCT_System'
  698. with open(file_name+'.evtx', 'r') as f:
  699. with contextlib.closing(mmap.mmap(f.fileno(), 0,
  700. access=mmap.ACCESS_READ)) as buf:
  701. fh = FileHeader(buf, 0x0)
  702. xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
  703. try:
  704. for xml, record in evtx_file_xml_view(fh):
  705. xml_file += xml
  706. except:
  707. pass
  708. xml_file += "</Events>"
  709. xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
  710. xml_file = re.sub('<local>', '<local></local>', xml_file)
  711. xml_file = re.sub('&amp;', '&amp;', xml_file)
  712. f = open(file_name+'.xml', 'w')
  713. f.write(xml_file)
  714. f.close()
  715. try:
  716. xmldoc = minidom.parse(file_name+'.xml')
  717. except:
  718. sys.exit('Invalid file...')
  719. grandNode = xmldoc.firstChild
  720. nodes = grandNode.getElementsByTagName('Event')
  721.  
  722. event = int(raw_input('Enter Event ID : '))
  723. event_id = 0
  724. event_count = 0;
  725. for node in nodes:
  726. try:
  727. event_id = node.getElementsByTagName('EventID')[0].childNodes[0].nodeValue
  728. if int(event_id) == event:
  729. event_count += 1
  730. except:
  731. continue
  732. print '%-8s: %s - %-8s: %s' % ('Event ID',event,'Count',event_count)
  733. sys.exit('...Search Complete...')
  734.  
  735.  
  736.  
  737. -------------TASK 4------------
  738. vi readevtx4.py
  739.  
  740. import mmap
  741. import re
  742. import contextlib
  743. import sys
  744. import operator
  745. import HTMLParser
  746. from xml.dom import minidom
  747. from operator import itemgetter, attrgetter
  748.  
  749. from Evtx.Evtx import FileHeader
  750. from Evtx.Views import evtx_file_xml_view
  751.  
  752. pars = HTMLParser.HTMLParser()
  753. print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
  754. file_name = str(raw_input('Enter EVTX file name without extension : '))
  755. file_name = 'WIN-M751BADISCT_System'
  756. with open(file_name+'.evtx', 'r') as f:
  757. with contextlib.closing(mmap.mmap(f.fileno(), 0,
  758. access=mmap.ACCESS_READ)) as buf:
  759. fh = FileHeader(buf, 0x0)
  760. xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
  761. try:
  762. for xml, record in evtx_file_xml_view(fh):
  763. xml_file += xml
  764. except:
  765. pass
  766. xml_file += "</Events>"
  767. xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
  768. xml_file = re.sub('<local>', '<local></local>', xml_file)
  769. xml_file = re.sub('&amp;', '&amp;', xml_file)
  770. f = open(file_name+'.xml', 'w')
  771. f.write(xml_file)
  772. f.close()
  773. try:
  774. xmldoc = minidom.parse(file_name+'.xml')
  775. except:
  776. sys.exit('Invalid file...')
  777. grandNode = xmldoc.firstChild
  778. nodes = grandNode.getElementsByTagName('Event')
  779.  
  780. events = []
  781. event_id = 0
  782. count = 0
  783. for node in nodes:
  784. try:
  785. event_id = node.getElementsByTagName('EventID')[0].childNodes[0].nodeValue
  786. try:
  787. events.append({'event_id' : int(event_id), 'event_name' : node.getElementsByTagName('string')[1].childNodes[0].nodeValue})
  788. except:
  789. events.append({'event_id' : int(event_id), 'event_name' : 'Name not found...'})
  790. count += 1
  791. except:
  792. continue
  793. events = sorted(events, key=itemgetter('event_id'))
  794. for e in events:
  795. print e
  796. sys.exit('...Search Complete...')
  797.  
  798.  
  799. ############################
  800. # XML to MySQL with Python #
  801. ############################
  802. http://programmazioneit.altervista.org/Programmazione_script/Python/How-to-insert-XML-data-into-MYSQL-through-PYTHON.php
  803. http://stackoverflow.com/questions/10128921/xml-to-mysql-using-python
  804. http://stackoverflow.com/questions/15784208/storing-data-from-xml-to-database-using-python
  805.  
  806.  
  807.  
  808. #################
  809. # Numpy & Scipy #
  810. ################
  811. http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf
  812.  
  813.  
  814.  
  815. #################
  816. # Data Analysis #
  817. #################
  818. Reference:
  819. http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
  820.  
  821. Reference (Lesson 1):
  822. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/01%20-%20Lesson.ipynb
  823.  
  824. Reference (Lesson 2):
  825. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/02%20-%20Lesson.ipynb
  826.  
  827. Reference (Lesson 3):
  828. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/03%20-%20Lesson.ipynb
  829.  
  830. Reference (Lesson 4):
  831. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/04%20-%20Lesson.ipynb
  832.  
  833. Reference (Lesson 5):
  834. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/05%20-%20Lesson.ipynb
  835.  
  836. Reference (Lesson 6):
  837. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/06%20-%20Lesson.ipynb
  838.  
  839. Reference (Lesson 7):
  840. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/07%20-%20Lesson.ipynb
  841.  
  842. Reference (Lesson 8):
  843. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/08%20-%20Lesson.ipynb
  844.  
  845. Reference (Lesson 9):
  846. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/09%20-%20Lesson.ipynb
  847.  
  848. Reference (Lesson 10):
  849. http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/10%20-%20Lesson.ipynb
  850.  
  851.  
  852. ##################################
  853. # Data Visualization with Python #
  854. ##################################
  855. Reference:
  856. http://jakevdp.github.io/mpl_tutorial/
  857.  
  858.  
  859. Reference:
  860. http://machinelearningmastery.com/machine-learning-in-python-step-by-step/
  861.  
  862.  
  863. sudo apt install -y python-scipy python-numpy python-matplotlib python-matplotlib-data python-pandas python-sklearn python-sklearn-pandas python-sklearn-lib python-scikits-learn
  864.  
  865.  
  866.  
  867. vi libcheck.py
  868.  
  869. ----------------------------------------------------------
  870. #!/usr/bin/env python
  871.  
  872. # Check the versions of libraries
  873.  
  874. # Python version
  875. import sys
  876. print('Python: {}'.format(sys.version))
  877. # scipy
  878. import scipy
  879. print('scipy: {}'.format(scipy.__version__))
  880. # numpy
  881. import numpy
  882. print('numpy: {}'.format(numpy.__version__))
  883. # matplotlib
  884. import matplotlib
  885. print('matplotlib: {}'.format(matplotlib.__version__))
  886. # pandas
  887. import pandas
  888. print('pandas: {}'.format(pandas.__version__))
  889. # scikit-learn
  890. import sklearn
  891. print('sklearn: {}'.format(sklearn.__version__))
  892. ----------------------------------------------------------
  893.  
  894.  
  895.  
  896.  
  897. python
  898.  
  899. import pandas csv
  900. from pandas.tools.plotting import scatter_matrix
  901. import matplotlib.pyplot as plt
  902. from sklearn import model_selection
  903. from sklearn.metrics import classification_report
  904. from sklearn.metrics import confusion_matrix
  905. from sklearn.metrics import accuracy_score
  906. from sklearn.linear_model import LogisticRegression
  907. from sklearn.tree import DecisionTreeClassifier
  908. from sklearn.neighbors import KNeighborsClassifier
  909. from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
  910. from sklearn.naive_bayes import GaussianNB
  911. from sklearn.svm import SVC
  912.  
  913.  
  914.  
  915.  
  916.  
  917. url = "https://s3.amazonaws.com/infosecaddictsfiles/sampleSubmission.csv"
  918. names = ["Id", "Prediction1", "Prediction2", "Prediction3", "Prediction4", "Prediction5", "Prediction6", "Prediction7", "Prediction8", "Prediction9"]
  919. dataset = pandas.read_csv(url, names=names)
  920.  
  921.  
  922.  
  923.  
  924. Summarize the Dataset
  925. ---------------------
  926. We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
  927.  
  928. >>> print(dataset.shape)
  929.  
  930.  
  931.  
  932.  
  933. >>> print(dataset.head(20))
  934.  
  935. You should see the first 20 rows of the data:
  936.  
  937.  
  938.  
  939.  
  940.  
  941. Statistical Summary
  942. -------------------
  943.  
  944. Now we can take a look at a summary of each attribute.
  945.  
  946. This includes the count, mean, the min and max values as well as some percentiles.
  947.  
  948. >>> print(dataset.describe())
  949.  
  950.  
  951.  
  952.  
  953.  
  954. Class Distribution
  955. ------------------
  956. Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.
  957.  
  958. >>> print(dataset.groupby('class').size())
  959.  
  960. We can see that each class has the same number of instances
  961.  
  962.  
  963.  
  964.  
  965. Data Visualization
  966. ------------------
  967.  
  968. We now have a basic idea about the data. We need to extend that with some visualizations.
  969.  
  970. We are going to look at two types of plots:
  971.  
  972. - Univariate plots to better understand each attribute.
  973. - Multivariate plots to better understand the relationships between attributes.
  974.  
  975.  
  976. Univariate Plots
  977.  
  978. We start with some univariate plots, that is, plots of each individual variable.
  979.  
  980. Given that the input variables are numeric, we can create box and whisker plots of each.
  981.  
  982. >>> dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
  983. >>> plt.show()
  984.  
  985. This gives us a much clearer idea of the distribution of the input attributes:
  986.  
  987.  
  988.  
  989. ******************* INSERT DIAGRAM SCREENSHOT *******************
  990.  
  991.  
  992.  
  993. We can also create a histogram of each input variable to get an idea of the distribution.
  994.  
  995.  
  996. >>> dataset.hist()
  997. >>> plt.show()
  998.  
  999. It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.
  1000.  
  1001.  
  1002. ******************* INSERT DIAGRAM SCREENSHOT *******************
  1003.  
  1004.  
  1005.  
  1006.  
  1007. Multivariate Plots
  1008. ------------------
  1009. Now we can look at the interactions between the variables.
  1010.  
  1011. First let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.
  1012.  
  1013.  
  1014. >>> scatter_matrix(dataset)
  1015. >>> plt.show()
  1016.  
  1017. Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.
  1018.  
  1019. ******************* INSERT DIAGRAM SCREENSHOT *******************
  1020.  
  1021.  
  1022.  
  1023.  
  1024. Create a Validation Dataset
  1025. ---------------------------
  1026.  
  1027. We need to know that the model we created is any good.
  1028.  
  1029. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
  1030.  
  1031. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
  1032.  
  1033. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
  1034.  
  1035.  
  1036. >>> array = dataset.values
  1037. >>> X = array[:,0:4]
  1038. >>> Y = array[:,4]
  1039. >>> validation_size = 0.20
  1040. >>> seed = 7
  1041. >>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
  1042.  
  1043.  
  1044.  
  1045. Test Harness
  1046. ------------
  1047. We will use 10-fold cross validation to estimate accuracy.
  1048.  
  1049. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
  1050.  
  1051.  
  1052. >>> seed = 7
  1053. >>> scoring = 'accuracy'
  1054.  
  1055. We are using the metric of ‘accuracy‘ to evaluate models.
  1056. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate).
  1057. We will be using the scoring variable when we run build and evaluate each model next.
  1058.  
  1059.  
  1060.  
  1061.  
  1062. Build Models
  1063. ------------
  1064. We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
  1065.  
  1066. Let’s evaluate 6 different algorithms:
  1067.  
  1068. - Logistic Regression (LR)
  1069. - Linear Discriminant Analysis (LDA)
  1070. - K-Nearest Neighbors (KNN).
  1071. - Classification and Regression Trees (CART).
  1072. - Gaussian Naive Bayes (NB).
  1073. - Support Vector Machines (SVM).
  1074.  
  1075. This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
  1076.  
  1077. Let’s build and evaluate our five models:
  1078.  
  1079.  
  1080.  
  1081. # Spot Check Algorithms
  1082. models = []
  1083. models.append(('LR', LogisticRegression()))
  1084. models.append(('LDA', LinearDiscriminantAnalysis()))
  1085. models.append(('KNN', KNeighborsClassifier()))
  1086. models.append(('CART', DecisionTreeClassifier()))
  1087. models.append(('NB', GaussianNB()))
  1088. models.append(('SVM', SVC()))
  1089. # evaluate each model in turn
  1090. results = []
  1091. names = []
  1092. for name, model in models:
  1093. kfold = model_selection.KFold(n_splits=10, random_state=seed)
  1094. cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
  1095. results.append(cv_results)
  1096. names.append(name)
  1097. msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
  1098. print(msg)
  1099.  
  1100.  
  1101.  
  1102.  
  1103. Select Best Model
  1104. -----------------
  1105. We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
  1106.  
  1107. Running the example above, we get the following raw results:
  1108.  
  1109.  
  1110. ******************* INSERT DIAGRAM SCREENSHOT *******************
  1111.  
  1112.  
  1113. We can see that it looks like KNN has the largest estimated accuracy score.
  1114.  
  1115. We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model.
  1116. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
  1117.  
  1118.  
  1119.  
  1120. # Compare Algorithms
  1121. fig = plt.figure()
  1122. fig.suptitle('Algorithm Comparison')
  1123. ax = fig.add_subplot(111)
  1124. plt.boxplot(results)
  1125. ax.set_xticklabels(names)
  1126. plt.show()
  1127.  
  1128. You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.
  1129.  
  1130.  
  1131. Make Predictions
  1132. ----------------
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement