Forum

BanOnFlooding or something else

Anton
14 April 2015, 22:32
Hello

Some crawlers can parse large sites in one thread with cookie support and random requests.
For one day it could be gygabytes or thousands of pages.
What is the most effective option to stop them?


Hugo Leisink
15 April 2015, 08:59
You can take a look at the ChallengeClient option and use it in 'javascript' mode.
Bryan
15 April 2015, 23:43
This is an interesting feature -- the ChallengeClient option. How does it relate to ConnectionsPerIP and ConnectionsTotal? Does the threshold on ChallengeClient need to be set lower than ConnectionsPerIP in order to have any effect?
Hugo Leisink
15 April 2015, 23:48
It's not related. Threshold is the total amount of connections, which can include multiple clients. So yes, a threshold higher than ConnectionsTotal makes no sense.
Kapageridis Stavros
17 April 2015, 22:35
Hi Hugo ,
I have set the ConnectionsTotal = 10000 in my hiawatha.conf. I need to enable the ChallengeClient option. Is the following correct ?
ChallengeClient = 10000, javascript, 60, (and if it is) does it make any sense to security against DDOS ?
Anton
12 June 2015, 00:20
I have used phpcrawl against my site with delay 0.3 second. Maybe I have misconfigured something but hiawatha can't stop it.

My config

ServerId = www-data
ServerString = Server
ConnectionsTotal = 10000
ConnectionsPerIP = 25
SystemLogfile = /var/log/hiawatha/system.log
GarbageLogfile = /var/log/hiawatha/garbage.log
ExploitLogfile = /var/log/hiawatha/exploit.log

Binding {
Port = 80
}

BanOnGarbage = 1800
BanOnFlooding = 10/1:1800
BanOnMaxPerIP = 1800
ChallengeClient = 100, javascript, 1800
KickOnBan = yes
RebanDuringBan = yes
ReconnectDelay = 1
Anton
12 June 2015, 00:23
phpcrawl lib used for tests is http://phpcrawl.cuab.de/
Hugo Leisink
12 June 2015, 01:03
@Kapageridis: you set a threshold of 10.000. That means that when you get 10.000 or more connections, Hiawatha will start challenging clients. I think 10.000 is a bit too high. It think you want just a few hundred for that.
Hugo Leisink
12 June 2015, 01:05
@Anton: with what settings did you deploy this phpcrawl? How many simultenious connections, how many requests per seconds per connection? What kind of requests?

What kind of attacks do you want to block? Do you get any real attacks?
Anton
12 June 2015, 01:16
Hugo, I have used 1 thread with delay 0.3 second.

Hiawatha config is published before.

Here is php code that use phpcrawl lib.
/**
* The following code is a complete example of a resumable crawling-process
*
* You may test it by starting it from the commandline (CLI, type "php resumable_example.php"),
* abort it (Ctrl^C) and start it again).
*/

// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");


// Extend the class and override the handleDocumentInfo()-method
class MyCrawler extends PHPCrawler
{
function handleDocumentInfo($DocInfo)
{
// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";

// Print the URL and the HTTP-status-Code
$parser_results = fopen('1.txt', 'a+');
echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
fwrite($parser_results, print_r ($DocInfo->content, TRUE));
flush();
}
}

$crawler = new MyCrawler();
//$crawler->setFollowMode(1);
$crawler->setUserAgentString("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)");
$crawler->setURL("some-site-to-test.com");
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png|css|js)$# i");
$crawler->enableCookieHandling(true);
$crawler->setRequestDelay(0.3);


// Important for resumable scripts/processes!
$crawler->enableResumption();

// At the firts start of the script retreive the crawler-ID and store it
// (in a temporary file in this example)
if (!file_exists("/tmp/mycrawlerid.tmp"))
{
$crawler_ID = $crawler->getCrawlerId();
file_put_contents("/tmp/mycrawlerid.tmp", $crawler_ID);
}
// If the script was restarted again (after it was aborted), read the crawler-ID
// and pass it to the resume() method.
else
{
$crawler_ID = file_get_contents("/tmp/mycrawlerid.tmp");
$crawler->resume($crawler_ID);
}

// Start crawling
$crawler->goMultiProcessed(1);

// Delete the stored crawler-ID after the process is finished completely and successfully.
unlink("/tmp/mycrawlerid.tmp");

$report = $crawler->getProcessReport();

if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";

echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb;
Anton
12 June 2015, 01:20
I have tried to use wget against hiawatha. It can be banned very fast after 15-20 requested pages with my config.
Anton
12 June 2015, 01:24
I have not get any serious attacks but as I have checked logs periodically. There I can see some multi threaded parsing cases, Maybe I don't completely understand ChallengeClient option. I thought that phpcrawl test will be banned after 100+ requests but I have downloaded much more pages.
Anton
12 June 2015, 01:33
Maybe it becomes possible because of enableCookieHandling(true); in this parser.
Anton
12 June 2015, 01:56
Before I wrote about multi threaded parsing cases, Of course, I meant that hiawatha successfully ban them in system.log
Hugo Leisink
12 June 2015, 23:11
The threshold of the ChallengeClient option is not about the amount of requests, but about the amount of simultaneous connections.
Anton
13 June 2015, 15:33
I have tested httrack. It can download thousands of pages as phpcrawl. As I understood there is no option in hiawatha to stop such bots. And it is not easy to implement something to stop them in webserver.

Are there any server side software to count bandwidth by IP address? For example, if some IP address eats more than 30 Mb in 5 minutes than ban it for 1 hour. And some verified subnets of good bots that could eat more than this limit.
Anton
13 June 2015, 17:22
I have thought about this problem again. It looks like that most simple way to solve it is using something like this as separate daemon https://github.com/v2nek/python-ddos-evasive

With bandwidth limiting it is impossible to separate pictures, js and pages.
Hugo Leisink
14 June 2015, 16:17
You can block servers that send a large amount of requests at high speed via the BanOnFlooding option. But that only works for multiple requests within a connection. If the attacker uses a separate connection for every request, you can use a combination of ConnectionsPerIP and ReconnectDelay.
Anton
14 June 2015, 19:24
I have changed

ReconnectDelay = 1
to
ReconnectDelay = 100

phpcrawl (0.3 seconds delay) was banned after 25 pages downloaded as it mentioned in

ConnectionsPerIP = 25

It looks all works.
As I understand currently, if user (bot) with some IP will request 25 pages (ConnectionsPerIP = 25) in 100 seconds (ReconnectDelay = 100), that IP will be banned for 1800 seconds (BanOnMaxPerIP = 1800).

Hugo Leisink
15 June 2015, 11:03
No, ConnectionsPerIP limits the amount of simultaneous conenctions. ReconnectDelay makes Hiawatha remember closed connections for the specified amount of seconds.

Example: if you set ConnectionsPerIP to 5 and ReconnectDelay to 10 seconds and you make 5 connections in a row, you have to wait 10 seconds after you closed the first connection, before you can make a sixth connection. Otherwise, the connection will be dropped by Hiawatha and, if you set BanOnMaxPerIP, even get banned.
This topic has been closed.