XPath Fundamentals

Last week we looked at a light introduction to Python. This week we are going to take a look at XPaths. We need to learn xpaths if we are to build a web crawler which can be used to scrape data from websites.

XPaths are fairly straightforward and easy to learn. Follow these steps:

1. Download the chrome browser and open it.
2. The default page should be Google so right-click on the "Google" icon
3. Select inspect. You should get the Developer Console to pop up as shown below:


4. Right-click on the highlighted element
5. Select Copy >> Copy XPath
6. Paste the path into any text editor, it should be this:
//*[@id="hplogo"]

There are two types of xpaths, absolute and relative. The xpath shown above is relative. The absolute xpath of the Google logo is:
/html[1]/body[1]/div[1]/div[8]/span[1]/center[1]/div[1]/img[1]

XPaths are subject to constant change so it is always good to use a relative xpath whenever possible. Breaking down the syntax for the relative XPath:

//* Select ALL elements in the DOM [@id="hplogo"] that have an attribute named id that is set to hplogo. When selecting elements in a DOM it is good practice to refer to their ids when possible. Ids are unique to elements and normally do not change.

More information on XPath syntax can be discovered here:
https://www.w3schools.com/xml/xpath_intro.asp

A good extension for creating, editing and testing XPaths is the ChroPath. You can learn all about it from this YouTube video:
https://www.youtube.com/watch?v=ikC4gLt0u8M

Next time, we will take what we've learned in last week's and this week's blog and combine it to build ourselves a python web scraper.

Comments

Popular posts from this blog

Covering Your Tracks

Covering Your Tracks - Anti-forensics for the Cloud - Introduction

Cross-Site Scripting (XSS) Introduction