Why do web crawlers need crawler IP?

In today's era of data flying all over the sky, all walks of life are using public data more and more widely, which means that the demand for data collection is increasing day by day. The market demand is getting bigger and wider, but the supporting technical personnel cannot meet the demand. Therefore, more and more people choose the industry of web crawlers.

Today we will talk about the knowledge related to the crawler IP used in data crawling. High-quality IP is also an important prerequisite for the stable operation of the crawler.

Crawler IP Overview

An IP address is a unique address that is used to identify an Internet or local network device. A crawler IP is also known as a crawler IP server (Proxy Server). Its main function is to act as an intermediate layer to communicate with the target server on behalf of the user. When the client interacts with the server, like the packet capture tool, the request initiated by the client will pass through the crawler IP server and be forwarded by the crawler IP server. When the server receives the request, the IP obtained is the IP address of the crawler IP server, thus realizing the hiding of the real IP.

Crawler IP function

The application scope of crawler IP is very wide, and its main application scenarios are:

Break through access restrictions: Some websites will restrict access based on the user's IP address. Using crawler IP can bypass these restrictions.

Crawler data collection: In batch data collection, the crawler program needs to use the crawler IP to prevent being blocked or restricted by the target website.

Improve network security: By using crawler IP to hide the real IP address, security threats such as hacker attacks and phishing can be prevented.

Crawler IP classification

Crawler IPs can be classified by different characteristics, such as anonymity, supported protocols, geographic location, usage, quality level, etc. Here we mainly introduce the first three.

Classification by anonymity

Highly anonymous crawler IP: Highly anonymous crawler IP is also called completely anonymous crawler IP. It completely hides the client's real IP address and other information. The server cannot know which client the request comes from, and can only obtain the IP address of the crawler IP server. This type of crawler IP has high speed and stability, but it needs to be paid.

Ordinary anonymous crawler IP: Ordinary anonymous crawler IP is also called anonymous crawler IP. It will hide the client's IP address, but will expose other request information of the client, such as HTTP request header information. The server can recognize that the request comes from the crawler IP server, but cannot track the client's real IP. This type of crawler IP is slow and less stable.

Transparent crawler IP: Transparent crawler IP is also called ordinary crawler IP. It will not hide the client's IP address and other request information. The server can obtain the real IP that initiates the request. Therefore, transparent crawler IP does not have much practical use and is less used.

Classification by Support Agreement

HTTP crawler IP: HTTP crawler IP forwards request data packets through the HTTP protocol. It is mainly used to request web pages, access restricted websites, and improve user anonymity.

HTTPS crawler IP: HTTPS crawler IP forwards request data packets through the https protocol. It can help the client and the server establish a secure communication channel, mainly used for the transmission of encrypted privacy data.

FTP crawler IP: FTP crawler IP is used to forward FTP requests, mainly for data upload, download, and cache. It can provide anonymity, access control, speed optimization and other functions for accessing FTP servers.

SOCKS crawler IP: SOCKS crawler IP can forward any type of network request. It supports multiple authentications and is a universal network crawler IP.

Use of crawler IP

In the previous article, we talked about the use of five network request libraries: urllib, requests, httpx, aiohttp, and websocket. Here we will introduce how to set the crawler IP when using these network request libraries.

The crawler ip used in the following text is provided by crawler ip.

Usually after purchasing a crawler IP, the platform will provide the IP address, port number, account number, and password. When setting the crawler IP in the code, the crawler IP data type is usually a dictionary type. The format is as follows:

Common crawler IP format:

 proxy = { 'https://': 'http://%(ip)s:%(port)s' % {"ip":ip,"port":port}, 'http://': 'http://%(ip)s:%(port)s' % {"ip":ip,"port":port} }

The format for using account and password authentication is:

 proxy = { "http": "http://%(user)s:%(pwd)s@%(ip)s:%(port)s/" % {"user": username, "pwd": password, "ip":ip,"port":port}, "https": "http://%(user)s:%(pwd)s@%(ip)s:%(port)s/" % {"user": username, "pwd": password, "ip":ip,"port":port} }

like:

 proxy = { 'https': 'http://112.75.202.247:16816', 'http': 'http://112.78.202.247:16816' }

urllib

When urllib sets the crawler IP, it needs to use the ProxyHandler object. First, create a ProxyHandler object, and then pass the address and port number of the crawler IP server to it. Then use the build_opener method to create an Opener object, and pass the created ProxyHandler to the Opener. Finally, use the install_opener method to install the Opener object as a global Opener, so that this Opener object will be used in all subsequent urlopen requests.

 import urllib.request proxy = { 'https': 'http://112.77.202.247:16816', 'http': 'http://112.85.202.247:16816' } #爬虫ip地址proxy_handler = urllib.request.ProxyHandler(proxy) opener = urllib.request.build_opener(proxy_handler) urllib.request.install_opener(opener) response = urllib.request.urlopen('http://jshk.com.cn/').read().decode('utf-8') print(response) # xxx"origin": "112.74.202.247"xxx

After setting the crawler IP to request the web page, you can see that the origin value in the response information has become the crawler IP value we set. This means that the crawler IP is set successfully, and the IP obtained by the web server is the IP address of the crawler IP server.

requests

It is quite cumbersome to set the crawler IP using urllib, which is also the disadvantage of the urllib library. However, it is very simple to set the crawler IP in requests. The Request object provides a parameter to set the crawler IP.

 import requests proxy = { 'https': 'http://112.88.202.247:16816', 'http': 'http://112.76.202.247:16816' } url = 'http://jshk.com.cn/' response = requests.get(url,proxies=proxy) print(response.json()) #xxx'origin': '112.74.202.247'xxx

httpx

The function and usage of httpx are basically the same as requests, and the setting method is also the same. The only difference is the key name of the crawler IP.

 import httpx proxy = { 'https://': 'http://112.74.202.247:16816', 'http://': 'http://112.74.202.247:16816' } url = 'http://jshk.com.cn/' response = httpx.get(url,proxies=proxy) print(response.json()) #xxx'origin': '112.74.202.247'xxx

The key names in the proxy crawler ip dictionary are changed from http and https to http:// and https://.

The method of setting the crawler IP using httpx Client is the same, passing it through the proxies parameter.

 import httpx proxy = { 'https://': 'http://112.74.202.247:16816', 'http://': 'http://112.74.202.247:16816' } with httpx.Client(proxies=proxy) as client: response = client.get('https://httpbin.org/get') print(response.json()) #xxx'origin': '112.74.202.247'xxx

aiohttp

Setting the crawler IP in aiohttp is different from other libraries. The crawler IP format is a string and is passed through the proxy parameter.

 import aiohttp import asyncio async def main(): async with aiohttp.ClientSession() as session: async with session.get('https://httpbin.org/get',proxy=proxy) as response: print(await response.json()) ##xxx'origin': '112.74.202.247'xxx if __name__ == '__main__': proxy = "http://112.74.202.247:16816" loop = asyncio.get_event_loop() loop.run_until_complete(main())

websocket

There are two ways to set crawler IP through websocket:

The first

 from websocket import WebSocketApp def on_message(ws, message): #接收到消息时执行print(message) def on_open(ws): #开启连接时执行ws.send("Hello, WebSocket!") #发送信息if __name__ == "__main__": proxies = { "http_proxy_host": "112.74.202.247", "http_proxy_port": 16816, } ws = WebSocketApp("ws://echo.websocket.org/",on_message=on_message) ws.on_open = on_open ws.run_forever(**proxies)

The second

 ws = WebSocketApp("ws://echo.websocket.org/",http_proxy_host="112.74.202.247",http_proxy_port=16816)

Summarize

In the development of crawler programs, it is very important to use crawler IP. Whether it is large-scale batch collection or simple crawler scripts, using crawler IP can increase the stability of the program, provided that high-quality crawler IP is used. High-quality crawler IP can help developers solve many problems and is a tool that crawler developers must learn to use.

This article comes from online submissions and does not represent the analysis of kookeey. If you have any questions, please contact us

Like (0)
kookeeykookeey
Previous December 8, 2023 9:42 am
Next December 8, 2023 9:45 am

Related recommendations