The world of proxy connections can be very complex when you're automating software and connections. But what if you use multiple proxies to run something like Scrapebox, which targets Google? Google is pretty good at detecting bots and stopping them in their tracks.
I am also often asked questions about individual software. “Will application X work with your proxy?” Generally speaking the answer is yes, but I would rather inform you why this is the case rather than leaving you blindly accepting it. So first I will go over the different important factors you may encounter with proxies and what they mean, then I will go over some of the most popular applications that may work with proxies and their requirements.
HTTP vs SOCKS4 vs SOCKS5
This is the first, and probably the most important, compatibility issue. The type of connection the proxy can use. SOCKS is the default proxy connection type. A proxy server using SOCKS sits in the middle between the client and the server destination. For example, if you are using a product like Scrapebox, it will sit between you and Google. SOCKS itself stands for SOCKet Secure.
The difference between SOCKS4 and SOCKS5 is that SOCKS5 includes authentication. With a SOCKS4 proxy, you cannot use it with a login and password, nor can you use authentication information on the target server. In other words, if you want to scrape data on a page that requires a login to access, you need to use a SOCKS5 proxy server.
What about HTTP? HTTP is more specialized and therefore more limited. You might recognize HTTP as the beginning of a public URL. That's because it's a universal protocol used for standard web traffic.
SOCKS is a protocol used for server to server communication and does not interpret the data. It just passes it along from point A to point B to point C.
However, the HTTP connection at point B has the opportunity to interpret and forward traffic. This can be useful for simplifying certain aspects of scraping. For example, if you are scraping Amazon traffic, the HTTP connection is able to identify and cache common elements to minimize the amount of content your scraper needs to download from Amazon itself.
That is, HTTP connections are limited to HTTP traffic. If you try to access a server that doesn't allow HTTP connections, but your software requires you to use HTTP connections, you won't be able to establish a connection in the first place.
Communication Port
Ports are another component of internet communications that most people ignore unless there is a need to mess with them. They are essentially like radio channel frequencies or TV channels. Another analogy might be an apartment building. It acts as a street address, the IP address. The port will specify the apartment itself.
Different ports are often used to distinguish between the services used to establish a connection.
- Port 21 is usually used for FTP connections
- Port 22 is used for SSH connections
- Port 53 is used for DNS services
- Port 80 is almost always used exclusively for HTTP traffic, which is also a limitation for proxies.
If your proxy only supports HTTP, it will be limited to port 80. If the proxy uses SOCKS, it can usually use any port, so you will have to adjust the port based on the requirements of your target.
Data Security
This is another concern you may have with a proxy server, but it has nothing to do with the SOCKS and Port factors above. It’s all about the security of connecting through a proxy.
A lot of public proxies are not secure at all. They route through Eastern European servers that inject ads into the traffic or route it through an overlay. You never know what kind of software might be running on that server snooping on the connections being made and the data being sent.
In contrast, private proxies tend to have greater security because the proxy servers themselves are located in more secure locations.
They are also designed for more advanced users who don't mind having their data snooped on. You may also need a secure connection to access certain websites, especially those that require authentication through SCOKS5. Always avoid putting sensitive login information into an unsecured proxy.
Anonymous
The issue of anonymity is at the heart of the concept of proxy connections. A lot of people use proxies for simple web browsing because they don’t want their home IP address associated with their browsing habits. They may just not want to be tracked by a large entity like Facebook, Google, or a large advertising network.
Or, they may be doing something illegal or actually illegal and want to hide from law enforcement or the NSA. A false sense of security comes from perceived anonymity, which itself comes from the idea that hiding behind a proxy makes you untraceable.
Proxy servers have different levels of anonymity. Some of them will forward pretty much all the regular information you would normally forward and really not provide you with any anonymity at all. They will tell the destination server their IP address to access. Not unless someone wants to track you, they can find your real IP there.
Higher security levels don't forward as much information. It won't reveal your IP address, but it will reveal that they are a proxy connection. The destination server will know someone is connecting through a proxy, but won't know the originating IP address.
The highest level of anonymity comes from top-tier proxies that mimic real connections. These don’t even reveal that they are proxies, although sometimes user behavior gives them away.
Ability to pass search engine blocks
This is one factor why people call proxies "Google-safe", the only thing it means is that the IP address of the proxy is not known to be a proxy server and it has never been abused in the past. Google has active anti-proxy and anti-bot measures in place and will time out your connection if abuse and malicious attacks are detected.
What proxies Google is safe to use isn't necessarily a factor of the proxy itself. It's usually more of a user behavior issue. If you're making a lot of similar, repeated requests from a single IP address, it starts to look like a bot. If you're varying the IP address of those requests, and changing the timing of them, it starts to look more like a natural user. This is why you should use a list of proxies instead of a single proxy, and why you should set up delays and asynchronous connections.
IP Location
The last factor is simply a matter of where the proxy server is coming from. There are two main categories for this.
The first category is geography. If you are trying to log into a US-centric website, then using a proxy server located in Ukraine is probably not a good idea. Many websites that are frequently attacked by scrapers will block foreign IPs, or reroute them to a foreign version of the site; of no value for your needs.
Use another category. Is the IP coming from a data center, or is it coming from a residential area? This is probably the most important factor on this list. Many large entities, such as Google, Amazon, and e-commerce sites detect when a connection is made from a data center. It is one way they detect proxy and scraping abuse. It is always better to come from a residential IP location because it is more like their typical user behavior.
Using another category. Is the IP coming from a data center, or is it from a residential area? This is probably the most important factor on this list. Many large entities like Google, Amazon, and e-commerce sites will detect when a connection is made from a data center. This is one way they detect proxy and scraping abuse. It is always better to come in from a local IP address, as this is more like their typical user behavior.
Applications and their compatibility
There are a bunch of common apps or software you might want to use with a proxy. They usually scrape data automatically in some form, though others will submit it in bulk. Generally, sites don't like robots doing this kind of thing, because that's how spam and fake accounts are created. I'm not here to judge your use of this; I'm sure you know what you're doing.
I am also not responsible for how you choose to use a proxy. All I do is review some common programs and then tell you their requirements. As a disclaimer, I do not necessarily support or condone black hat use of the following applications; what you do is up to you.
Scrapebox
This is probably one of the most powerful tools used in black and white hat operations, it is a very powerful data collector. It is used by both black hat SEOs and Fortune 500 companies. Multithreaded operation supports many connections and it is safe from Google as long as you use it correctly. Of course, it can be prohibited depending on your usage. That's why you need a lot of proxies, asynchronous and various requests and submission delays. Use with caution.
- Supports both HTTP and SOCKS connections.
- Both private and public proxies are supported, but private proxies are preferred.
- It is strongly recommended that you use a large rotating proxy list rather than a short static list.
XRumer
This is another link building SEO application that focuses on web forums with some residual value. It also targets blog comments, journal guestbooks, link directories, social networks, social bookmarking sites, etc. It includes CAPTCHA bypasses for many common systems, including text question-and-answer systems. To avoid spam tags, it attempts to customize posts based on the subject of the target forum or forums.
- Supports both HTTP and SOCKS connections.
- Prefer private proxies to avoid trying to use previously banned IP addresses.
SEnuke TNG
SEnuke TNG is an older program designed for SEO, and it serves as the basis for SEnukeX, a more advanced version. This new version was created from scratch, and it includes more features, including a basic tutorial, flow charts, and plans for weeks to come. It strives to stay on Google's good side by presenting it as naturally as possible. The app comes with a 14-day trial and a 30-day money-back guarantee.
- Only an HTTP connection is required.
- Prefer private proxies to avoid common issues with public proxy servers.
Tweet Attacks
Tweet Attacks Pro 4 (the current version of Tweet Attacks) is a piece of software that can manage up to thousands of Twitter accounts at any given time. It allows for automated following, unfollowing, back-following, tweeting, retweeting, replying, liking, deleting, and really anything else you might want to do with Twitter. It also allows for individual customization of these Twitter accounts to eliminate the "egg" problem when running a network of demo accounts. Fees depend on the tier of the program you prefer.
- Due to Twitter's authentication requirements, only an HTTP connection is required.
- Both private and public proxies are supported, although it is best to use a private proxy to avoid detection.
- It is recommended that you use a number of proxies to manage your accounts, although you do not have to have a dedicated proxy for each account.
Ticketmaster
This is a general category of Ticketmaster ticket buying bots. There are many varieties of them, including those named TicketMaster, TicketMaster Spinner, and TicketBots. All of these have their common needs in that they visit the same sites with the same goal. To buy large quantities of tickets to shows and then resell the tickets for a profit. This practice of reselling tickets is not illegal unless it is done at the venue. However, some states may have stricter laws regarding ticket reselling.
- Requires HTTP connection to Ticketmaster website for authentication and display.
- Residential IP addresses are preferred, as Ticketmaster easily revokes sales of data center IPs and other non-local IPs that give off bot signals.
Twitter Account Creation
To use a bot like the Twitter Manager mentioned above, you will need to create Twitter accounts in bulk. There are a number of different bots that allow this, such as Twitter Mass Account Maker or Twitter Account Creator Bot. Like the Ticketmaster bot, these bots all have similar requirements.
- An HTTP connection is required to ensure authenticity and login authentication to the Twitter servers.
- Residential IP addresses are preferred, and are usually private rather than public, although the occasional data center IP is not unexpected due to Twitter's agency and corporate usage.
Facebook Account Creation
This is identical in many ways to the Twitter bots listed above.
Some common Facebook account bots include Facebook Account Creator and FBDevil.
- An HTTP connection is required to ensure authenticity to Facebook servers and login authentication.
- Residential IP addresses are preferred, and private addresses are generally preferred over public addresses.
Email Account Creation
Email accounts can be created in bulk in the same way as social profiles, although there are as many bots as there are email providers. Every provider is different, and every bot is different, so make sure you meet the requirements before buying or using a proxy list.
Generally, the requirements are the same as above for social: HTTP connection and residential IP. However, some email systems may use other connections or datacenter IPs.
This article comes from online submissions and does not represent the analysis of kookeey. If you have any questions, please contact us