What exactly is the proxy IP that the crawler needs?

When crawling certain websites, we often set up proxy IPs to avoid the crawler program being blocked. We usually obtain proxy IP addresses by extracting free proxies from well-known domestic IP agents. These agents generally provide transparent proxies, anonymous proxies, and high-anonymity proxies. So what are the differences between these types of proxies? How do we choose? The main content of this article is to explain the principles behind various proxy IPs.

What exactly is the proxy IP that the crawler needs?

1 Agent Type

There are four types of proxies. In addition to the transparent proxies, anonymous proxies, and high-anonymity proxies mentioned above, there are also obfuscated proxies. In terms of security, the order of these four types of proxies is high-anonymity > obfuscated > anonymous > transparent.

2 Proxy Principle

The proxy type mainly depends on the configuration of the proxy server. Different configurations will form different proxy types. In the configuration, the three variables REMOTE_ADDR , HTTP_VIA , HTTP_X_FORWARDED_FOR are the decisive factors.

1) REMOTE_ADDR
REMOTE_ADDR represents the client's IP, but its value is not provided by the client, but is assigned by the server based on the client's IP.

If you use a browser to access a website directly, the website's web server (Nginx, Apache, etc.) will set REMOTE_ADDR to the client's IP address.

If we set a proxy for the browser, our request to access the target website will first pass through the proxy server, and then the proxy server will convert the request to the target website. Then the website's web server will set REMOTE_ADDR to the IP of the proxy server.

2) X-Forwarded-For (XFF)
X-Forwarded-For is an HTTP extension header used to indicate the real IP address of the HTTP requester. When the client uses a proxy, the web server does not know the client's real IP address. To avoid this situation, the proxy server usually adds an X-Forwarded-For header information and adds the client's IP address to the header information.

The format of the X-Forwarded-For request header is as follows:

 X-Forwarded-For: client, proxy1, proxy2

client represents the IP address of the client; proxy1 is the IP of the device farthest from the server; proxy2 is the IP of the secondary proxy device; from the format, it can be seen that there can be multiple layers of proxies from the client to the server.

If an HTTP request passes through three proxies Proxy1, Proxy2, and Proxy3 before reaching the server, with IPs IP1, IP2, and IP3 respectively, and the user's real IP is IP0, then according to the XFF standard, the server will eventually receive the following information:

 X-Forwarded-For: IP0, IP1, IP2

Proxy3 is directly connected to the server, and it will add IP2 to XFF, indicating that it is helping Proxy2 forward requests. There is no IP3 in the list, and IP3 can be obtained on the server through the Remote Address field. We know that HTTP connection is based on TCP connection, and there is no concept of IP in the HTTP protocol. Remote Address comes from TCP connection and indicates the IP of the device that establishes a TCP connection with the server, which is IP3 in this example.

3) HTTP_VIA
Via is a header in the HTTP protocol, which records the proxies and gateways that an HTTP request passes through. If it passes through one proxy server, the information of one proxy server is added, and if it passes through two proxy servers, two proxy servers are added.

3. Differences between proxy types

1) Transparent Proxy
The proxy server configuration is as follows:

 REMOTE_ADDR = Proxy IPHTTP_VIA = Proxy IPHTTP_X_FORWARDED_ FOR = Your IP

Although a transparent proxy can directly "hide" the client's IP address, the client's IP address can still be found from HTTP_X_FORWARDED_FOR .

2) Anonymous Proxy
The proxy server configuration is as follows:

 REMOTE_ADDR = proxy IPHTTP_VIA = proxy IPHTTP_X_FORWARDED_ FOR = proxy IP

Anonymous proxy can provide the function of hiding the client's IP address. When using an anonymous proxy, the server can know that the client uses a proxy, but cannot know the client's real IP address.

3) Distorting Proxy
The proxy server configuration is as follows:

 REMOTE_ADDR = Proxy IPHTTP_VIA = Proxy IPHTTP_X_FORWARDED_ FOR = Random IP address

The principle is similar to that of an anonymous proxy, but the disguise is more realistic. If the client uses an obfuscated proxy, the server can still know that the client is using a proxy, but it will get a fake client IP address.

2) Elite Proxy or High Anonymity Proxy
The proxy server configuration is as follows:

 REMOTE_ADDR = Proxy IPHTTP_VIA = not determinedHTTP_X_FORWARDED_ FOR = not determined

High-anonymous proxy can not only make the server unclear whether the client is using a proxy, but also ensure that the server cannot obtain the client's real IP address.

4. Agent selection

Ordinary anonymous proxies can hide the real IP of the client, but will change our request information, and the server may think that we are using a proxy. However, when using this type of proxy, although the visited website cannot know the client's IP address, it can still know that you are using a proxy. Of course, some web pages that can detect IP can still find the client's IP.

A highly anonymous proxy does not change the client's request, so it looks to the server like a real client browser is accessing it. The client's real IP is hidden, and the server will not think that we are using a proxy.

Therefore, when the crawler program needs to use the proxy IP, try to choose a normal anonymous proxy or a high anonymous proxy. In addition, if you want to ensure that the data is not known by the proxy server, it is recommended to use a proxy with the HTTPS protocol.

This article comes from online submissions and does not represent the analysis of kookeey. If you have any questions, please contact us

Like (0)
kookeeykookeey
Previous December 13, 2023 6:27 am
Next December 13, 2023 6:36 am

Related recommendations