2pk03 over AI, ML, BigData and data processing

Posts

Showing posts from September, 2011

hadoop log retention

By Anonymous - September 28, 2011

Some people ask me for a "issue" in mapreduce-jobhistory (/jobhistory.jsp) - the history tooks a while to load the site on high-traffic clusters. For that I'll explain the mechanism: The history-files will be available for 30 days (hardcoded in pre-h21). That produce a lot of logs and waste also space on the hadoop-jobtracker. So I have some installations which hold 20GB on logs in history, as a dependecy a audit of long running jobs isn't really useable. Beginning from h21 the cleanup is configurable: Key: mapreduce.jobtracker.jobhistory.maxage Default: 7 * 24 * 60 * 60 * 1000L (one week) to set the store into a 3-day period use: mapreduce.jobtracker.jobhistory.maxage 3 * 24 * 60 * 60 * 1000L That means 3 Days, 24 hours, 60 minutes, 60 seconds and a cache size of 1000. a other way, but more a hack via crond.d: find /var/log/hadoop-0.20/history/done/ -type f -mtime +1 |xargs rm -f

Analyze your IIS Logs with hive

By Anonymous - September 23, 2011

As you know, it's really easy to collect logs from a apache driven webfarm into a hive-cluster and analyze them. But how it'll work for IIS? Okay, lets do a view inside. IIS let us collect logs in W3C format by checking over the administraion console, register "website", "Active log format". Here you can setup the path where the logs will be stored, the fields you'll logging and much more. After a restart you should see the logs in the desired path. A good idea will be a split into hours, so you can run the jobs every hour on a fresh dataset. A really easy way will be for a small farm to export the path as a windows shared drive, connect your hive server with the samba-utils: mount -t cifs //Windows-Server/share -o user=name,password=passwd /mountpoint Copy the file into hdfs: hadoop dfs -copyFromLocal /mountpoint/filename <hdfs-dir> (we assume iislog) Now you can proceed with analysis, we use hive here. Lets assume you want to know whic

Speedup Sqoop

By Anonymous - September 15, 2011

Sqoop [1] (sql to hadoop) lets easy connect RDBMS into a hadoop infrastructure. Newest plugin comes from Microsoft and let us connect MS-SQL Server and hadoop each together. As a cool feature you can create a jar-file from your job, its pretty easy, just here a line: sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME> --ta ble<TABLENAME> --username<USERNAME> --password<PASSWORD> --export-dir <HDFS DIR WHICH CONTAINS DATA> --direct --fields-terminated-by '<TERMINATOR (Java)>' --package-name <JOBNAME>.<IDENTIFIER> --outdir <WHERE THE JAR SHOULD WRITTEN> --bindir <BIN_DIR> After you fired up you'll find a jar-package in --outdir, unzip it and you find your java-code and the precompiled class,so you can start to tune them. Now lets start the job again, but use the precompiled class: sqoop export --connect jdbc:<RDBMS>:thin:@<HOSTNAME>:<PORT>:<DB-NAME&g