Tutorial:
Setting up
Install the .zip file for your appropriate OS (operating system). Then, unzip the file. Don't remove any files outside any folders. Ensure that the application remains within the folder so that it can access ChromeDriver correctly. Also, do not rename any folders stored next to the application.
To start using the software, simply open the MonkeyScrape application.
Finding XPATH
This is going to explain how to obtain the XPATH of an element.
An XPATH directs the scraper towards the element that you want to deal with. It's like a set of directions to navigate the page.
1. Open a test window
2. Navigate to the page you desire
3. Next, open developer tools (shortcut: Ctrl+Shift+i)
4. At the top left of the developer tools window, there's a cursor and box (it's to the left of the devices). Click that icon (or alternatively use the shortcut: Ctrl+Shift+c)
5. Select the item you desire on the page
6. The element should be highlighted on the developer tools window
7. Right click the highlighted element in the developer tools window, go to Copy, and select COPY FULL XPATH.
NOTE: Ensure that it is "Copy full XPATH" and do NOT click "Copy XPATH".
Choose URL
This command will allow you to enter a URL for you to navigate the chromedriver window to a chosen website. This will occur when the script reaches this command. Simply enter the entire address (begins with http) or you can start with "www". By starting with "www", "https" will be used instead of "http" because of improved security for yourself.
Open Link
This command will allow you to paste the XPATH of an element which has a link attached to it. The program will then redirect the chromedriver to this link when the script reaches this command. For example, <a> tags commonly have something called 'href'. This 'href' redirects your browser to a link that has been added by the programmer of the website.
Get Text
This command will allow you to enter the XPATH of an element you wish to copy text from. When the script reaches this command, the text from the given element will be temporarily stored. It is recommended that you follow this command with Save to a variable or Append to a variable in order to not lose your text. When this command is repeated, it will overwrite text stored temporarily.
Click button
This command will allow you to enter the XPATH of the element you wish to click. It is not reserved to buttons only and can click on any element when the script reaches this command.
Delete element
This command will allow you to enter the XPATH of an element you wish to delete. When deleting elements with child elements nested, the element and child elements will all be deleted. This command can be quite helpful when scraping a list of items for example.
Let's say you need to scrape the following data shown below.
Product | Price | Stock |
---|---|---|
Towel | $5.00 | 100 |
Blanket | $10.00 | 200 |
Plates | $15.00 | 150 |
... | ... | ... |
Let's say that this is a small excerpt from a website with thousands of more products. It would take a long time to get the XPATH for each item. Therefore, we can use a cool trick to save us a lot of time.
We want to get all of the products from this website. So what we do is get the XPATH of the first item in a list or table. The XPATH will look something like this: "/html/body/main/div/div[6]/table/tbody/tr[2]/td[1]". This XPATH is directing to the cell containing "Towel" (highlighted below). We can copy this text and save it to a variable.
Product | Price | Stock |
---|---|---|
Towel | $5.00 | 100 |
Blanket | $10.00 | 200 |
Plates | $15.00 | 150 |
... | ... | ... |
Now, we read from right to left and look for square brackets within our XPATH. Within the given XPATH, we have "td[1]", "tr[2]", and "div[6]". A quick Google search (for example "td html tag" or "div html tag") would show us that we want to focus on "tr" since it stands for table row. In other words, it is the row we wish to remove. By removing it, we can push the second row up to fill in the gap of the first row. If we delete the first row, the outcome is shown below.
Product | Price | Stock |
---|---|---|
Blanket | $10.00 | 200 |
Plates | $15.00 | 150 |
... | ... | ... |
As you can see, the highlighted section is still at the top of the table. This is because the XPATH will always point to its given position. Therefore, instead of changing the XPATH, we can adapt the website's layout to fit the current needs. Don't worry, this is all done locally and it doesn't affect other users' experiences.
By using a loop, we can repetitively get these commands to repeat. This can save us a lot of time when it comes to large chunks of data. This trick often works because programmers (especially in large companies) need some kind of organisation with their websites. There's often a pattern you can figure out by looking at their code. This example can be repeated with e-commerce websites and even video streaming platforms.
Wait time
This command will allow the chromedriver window to wait for a given amount of time. This can be useful if you need certain portions of the website to load in, or if you're waiting for a certain thing to occur. On execution of this command, any automated interaction will be halted until the given time has elapsed.
Go to previous page
This command will return to the previous page. Not much for this one.
Write text to input box
This command will ask for the XPATH of an input box, for the text you want to enter, and a code to press ENTER or TAB. For the code, you can leave it empty (so that no button will be pressed after inputting the text), or you can enter some text. There is a code next to each button (uE007 for Enter and uE004 for Tab). If you decide to enter a code, the appropriate button will be pressed.
Save to a variable
This command will allow you to save text to a given variable. If this variable doesn't exist, then it creates one and save it to this new variable. If this variable already exists, then it overwrites it. Saving to a variable overwrites anything stores.
Append to a variable
This command will allow you to either add to an existing variable, or create a variable and add to it. This can store multiple text items without overwriting. For example, you can create a price variable and append different prices to it using this command.
Looping
This is achieved by encapsulating commands within the Start Loop and End Loop commands. The start loop command will allow you to select the number of loops you wish to execute.
NOTE: DO NOT USE A HIGH NUMBER OF LOOPS AS DAMAGE TO A COMPUTER IS POSSIBLE. If this is unavoidable, try splitting the loop into smaller loops.
Output to .txt
This command will ask for variables to output to a text file. This text file will be new and stored within the "unsaved files" folder. It will be given a generic name to allow you to rename it.
Output to .json
This command will ask for variables to output to a json file. This json file will be new and stored within the "unsaved files" folder. It will be given a generic name to allow you to rename it.