Hello World! My name is Francisco, fcoterroba on the Internet and today I bring you a post in which we are going to talk, theoretically and practically, what is and how can we do a trending topic at the moment. WebScrapping.
We are going to need to know a little bit about Python, so I recommend that you first go through one of the previous ones that I have done: How can we do a calculator in the terminal?, project to generate QR codes.
We did a currency exchange with GUI using free and public APIs.
Finally, I recommend the penultimate post that I have uploaded to date, in which we made a link shortener with graphical interface where you can automatically open and copy the generated links.
And, if you feel motivated and strong, continue with the most complicated post so far! Programming a fully functional Twitter bot!
Before starting my whole tirade, I wanted to congratulate you on the New Year, wishing with all my soul that you were able to spend the holidays surrounded by your loved ones.
I also wish you the best for 2021!
I am going to give a series of small tips that will change the website.
- I will stop uploading posts weekly to start uploading them when I can.
- The section of recommendations It will remain as it is, with 30 songs and 30 videos BUT they will be modified monthly. The first Monday of every month.
I do this because with my studies, I cannot dedicate everything I would like to the page and that is why I prefer to leave you, for the moment, in the background, finish my studies and return to the page. Of course, you can be informed of everything I do on my social networks. You have links at the end of the entry.
Before starting, although later I will explain what it is, I will I recommend visiting a post that I uploaded more than a month ago, in which I explain many of the most used computer terms in our day to day. Since, in this post, you will see words that you probably do not sound a lot. 🤯 You can reed the post here.
I also want to remind you that a few months ago I uploaded a video to my YouTube channel, very interesting, focused on home automation. Specifically, we connect, configure and install a smart light bulb 💡 with which you can change its color, turn it off, turn it on and much more simply by using your mobile phone and/or voice assistants such as Google, Alexa, etc. 👇🏻
What is webscrapping?
Web scraping consists of automatically browsing a website and extracting information from it. This can be very useful for many things and beneficial for almost any business. Today, I don't think there is a single successful company that doesn't do it—or doesn't want to do it.
Software programmed for scraping is often called a bot, spider or crawler. Anyone can program a crawler, since there are tools to tune it up that do not require programming knowledge. Of course, these tools will never give you all the flexibility that you would have if you developed them in a programming language. Later we will see the technologies and tools most used to create these little bugs.
Perhaps the typical captchas or the typical “I'm not a robot” that we normally encounter when filling out an online form have come to mind. Exact! It is crawlers that they protect themselves from. Because in addition to collecting information, they can also be used to fill out forms, create fake accounts or perform any action on the network automatically.
It is true that there is a legal and ethical debate about scraping. Later we will see the judicial risks involved and how to carry it out responsibly.
It is not so simple to say whether scraping is legal or not. It depends a lot on each case and, even so, most of the time it will not be very clear. Therefore, although it is unlikely, always before scraping you have to assume the risk that the scraped company may file a complaint. Although it is most likely that, in the case of performing some response action, it will be blocking the bot or giving some kind of warning.
Of course, before starting to scrape, it is more than advisable to check if an API is provided with which to access the data without scraping. For example, Idealista, the well-known home rental and buying website, offers an API so that both sides win. On the one hand, making requests to an API requires less development than scraping and, on the other hand, Idealista avoids all that “unwanted” traffic. Additionally, by opening their data to the public, they give the opportunity for someone to develop something powerful for them.
If there has been no luck and they do not provide an API, you will have to read the terms and conditions to verify that they do not say anything regarding the automatic extraction of information. In any case, before starting to scrape, you will have to consider whether the number of requests necessary is too excessive and whether the end of the project harms the business of the page. There could be more problems on the part of the scraped company if you are profiting from its content, since you can be sued for violating intellectual property rights. Especially if the content you want to extract is behind the user login, since, in this case, we are surely not talking about information in the public domain.
In conclusion, in order to avoid scares, it is advisable to contact the company to tell them what the project consists of or even reach an agreement with them. Furthermore, it is always a good idea to seek advice from an expert lawyer.
Project: Webscrapping Amazon,, MediaMarkt & PcComponentes
Let's put ourselves in the situation that I want to buy a PS5 game and controller. But, since I'm not pressed for time, I want to wait and see when it's cheaper at MediaMarkt
Alright. Well, the exercise will go about:
- MediaMarkt:
- Game and Controller
We open our favorite code editor and start a project in Python! The basis of the project, although it can be whatever you want, I start it like this:
def MediaMarkt(): print ("Hola MediaMarkt") if __name__ == '__main__': MediaMarkt()
There are many Python libraries for web scraping such as Selenium, Scrapy, etc. Although, because of its simplicity as well as its complexity, I like it. BeautifulSoup.
So yes, as you can imagine, you have to install said package. To do this you have to write the following in the terminal:
pip3 install beautifulsoup4
Next, we import the package in the first line, before any functions
from bs4 import BeautifulSoup
And let's start to scrapp!
To do it in an indicative way, the first thing we are going to do is write a header with the current date and time. There is an internal Python library that is used for that:
from bs4 import BeautifulSoup from datetime import datetime now = datetime.now() def MediaMarkt(): print (f"Precios MediaMarkt -- {now.date()} {now.time()}") if __name__ == '__main__': MediaMarkt()
Later we add the following code, necessary to identify the elements within the class:
from bs4 import BeautifulSoup import urllib.request from datetime import datetime now = datetime.now() def MediaMarkt(): print (f"Precios MediaMarkt -- {now.date()} {now.time()}") mando = 'https://www.mediamarkt.es/es/product/_mando-sony-ps5-dualsense%E2%84%A2-wireless-controller-blanco-1487502.html' mandoOnURL = urllib.request.urlopen(mando) mandoSoup = BeautifulSoup(mandoOnURL, 'html.parser') for i in mandoSoup.find('div', {'class':'big'}): print(f"El mando cuesta {i}€") juego = 'https://www.mediamarkt.es/es/product/_ps5-marvel-s-spider-man-miles-morales-1487492.html' juegoOnURL = urllib.request.urlopen(juego) juegoSoup = BeautifulSoup(juegoOnURL, 'html.parser') for j in juegoSoup.find('div', {'class':'big'}): print(f"El juego cuesta {j}€") if __name__ == '__main__': MediaMarkt()
And that would be all for today! I hope you liked doing it as much as I did! I also hope you have a great week and we'll see you here soon! Greetings and remember to follow me on the networks as Twitter, Facebook, Instagram and LinkedIn. 🤟🏻
sources: Aukera