Wednesday, July 13, 2011

Use hive to catch grabber

Get the logs from the farm via flume & syslog, mapreduce them in hive for IP, how often / second, bytes, item and compare with "human" profiles. Get the data on the fly via sqlstream, processes back into Oracle and from there a loadbalancer could get the IPs for a smooth redirect and I process the data into a graphing system (connection from that IP):

Hourly I check geolocation, whois, provider. Using pig.latin. Ready for first testing in our labs. And, of course, not a really performant task (yet) ;-)