Let us say that you want to search for a course to take, and there are many online courses on the Internet, making you confused. Instead of clicking on each and every course, and going back and forth to compare, if allowed, you can scrape the data and make your comparisons more conveniently.
Let us say that you are searching for online data science courses on Reeds. And this is the link for that search. Before any web scraping, you need to right-click with your mouse and choose Inspect. If it is an HTML web page, then you will see how it is structured as well as the location of the data that you are interested in. From that point on, you can use the Beautiful Soup library from Python and extract your data! Let us first extract the course provider name and the course links in one data frame:
#Load the necessary libraries import csv import pandas as pd import requests from bs4 import BeautifulSoup #Create a list called "full" where you save your data. page = [1, 2, 3, 4, 5, 6, 7, 8, 9] #let us extract the first 9 pages full = [] for i in page: url = 'https://www.reed.co.uk/courses/data-science?pageno={page}' response = requests.get(url) soup = BeautifulSoup(response.content,'html.parser') course = soup.find('script', type='application/ld+json') provider = [el['provider'] for el in json.loads(course.text)['itemListElement']] full.append(provider) full = [l for li in full for l in li] #Create a data frame with that list data = pd.DataFrame(full) data.columns = ['type', 'provider_name', 'course_links'] data.head()
Output:
Let us also extract the course descriptions:
page=[1, 2, 3, 4, 5, 6, 7, 8, 9] full2 = [] for i in page: url = 'https://www.reed.co.uk/courses/data-science?pageno={page}' response = requests.get(url) soup = BeautifulSoup(response.content,'html.parser') course_description = soup.find('script', type='application/ld+json') description = [el['description'] for el in json.loads(course_description.text)['itemListElement']] full2.append(description) full2 = [l for li in full2 for l in li] df2 = pd.DataFrame(full2) df2.columns = ['course_description'] df2.head()
Output:
We can do the same thing for job advertising websites, so long as it is allowed:), which can potentially ease job search and understanding what most employers want. Overall, it is a skill that you will not regret having!
Cheers!