https://faroukibrahim-fii.github.io/reading-notes/
Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.
Turnstile data is compiled every week from May 2010 to present, so hundreds of .txt files exist on the site.
The first thing that we need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags.
We start by importing the following libraries.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
Next, we set the url to the website and access the site with our requests library.
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
If the access was successful, you should see the following output:
This code saves the first text file, ‘data/nyct/turnstile/turnstile_180922.txt’ to our variable link. The full url to download the data is actually ‘http://web.mta.info/developers/data/nyct/turnstile/turnstile_180922.txt’ which I discovered by clicking on the first data file on the website as a test.
Last but not least, we should include this line of code so that we can pause our code for a second so that we are not spamming the website with requests.
time.sleep(1)
The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet.
A simple yet powerful approach to extract information from web pages can be based on the UNIX grep
Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template.
By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts.
There are several companies that have developed vertical specific harvesting platforms. These platforms create and monitor a multitude of “bots” for specific verticals with no “man in the loop” (no direct human involvement), and no work related to a specific target site.
The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data snippets.
There are many software tools available that can be used to customize web-scraping solutions. This software may attempt to automatically recognize the data structure of a page or provide a recording interface that removes the necessity to manually write web-scraping code.
Web spiders should ideally follow the robot.txt file for a website while scraping. It has specific rules for good behavior, such as how frequently you can scrape, which pages allow scraping, and which ones you can’t. Some websites allow Google to scrape their websites, by not allowing any other websites to scrape.
Web scraping bots fetch data very fast, but it is easy for a site to detect your scraper, as humans cannot browse that fast. The faster you crawl, the worse it is for everyone.
Humans generally will not perform repetitive tasks as they browse through a site with random actions. Web scraping bots tend to have the same crawling pattern because they are programmed that way unless specified.
When scraping, your IP address can be seen. A site will know what you are doing and if you are collecting data.
A user agent is a tool that tells the server which web browser is being used. If the user agent is not set, websites won’t let you view content. Every request made from a web browser contains a user-agent header and using the same user-agent consistently leads to the detection of a bot. You can get your User-Agent by typing ‘what is my user agent’ in Google’s search bar.
If none of the methods above works, the website must be checking if you are a REAL browser.
Honeypots are systems set up to lure hackers and detect any hacking attempts that try to gain information. It is usually an application that imitates the behavior of a real system.
Some websites make it tricky for scrapers, serving slightly different layouts.
For example, in a website pages 1-20 will display a layout, and the rest of the pages may display something else.
Login is basically permission to get access to web pages. Some websites like Indeed do not allow permission.
If a page is protected by login, the scraper would have to send some information or cookies along with each request to view the page.
Many websites use anti web scraping measures. If you are scraping a website on a large scale, the website will eventually block you.
how-do-websites-detect-web-scraping
Websites can use different mechanisms to detect a scraper/spider from a normal user. Some of these methods are enumerated below:
website-blocked-scraping
If any of the following signs appear on the site that you are crawling, it is usually a sign of being blocked or banned.