Tuesday, May 29, 2012

Using filters in HBase to match certain columns

HBase is a column oriented database which stores the content by column rather than by row. To limit the output of an scan you can use filters, so far so good.

But how it'll work when you want to filter more as one matching column, let's say 2 or more certain columns?
The trick here is to use an SingleColumnValueFilter (SCVF) in conjunction with a boolean arithmetic operation. The idea behind is to include all columns which have "X" and NOT the value DOESNOTEXIST; the filter would look like:


List list = new ArrayList<Filter>(2);
Filter filter1 = new SingleColumnValueFilter(Bytes.toBytes("fam1"),
 Bytes.toBytes("VALUE1"), CompareOp.DOES_NOT_EQUAL, Bytes.toBytes("DOESNOTEXIST"));
filter1.setFilterIfMissing(true);
list.addFilter(filter1);
Filter filter2 = new SingleColumnValueFilter(Bytes.toBytes("fam2"),
 Bytes.toBytes("VALUE2"), CompareOp.DOES_NOT_EQUAL, Bytes.toBytes("DOESNOTEXIST"));
filter2.setFilterIfMissing(true);
list.addFilter(filter2);
FilterList filterList = new FilterList(list);
Scan scan = new Scan();
scan.setFilter(filterList);



Define a new filter list, add an family (fam1) and define the filter mechanism to match VALUE1 and compare them with NOT_EQUAL => DOESNOTEXIST. Means, the filter match all columns which have VALUE1 and returns only the rows who have NOT included DOESNOTEXIST. Now you can add more and more values to the filter list, start the scan and you should only get data back which match exactly your conditions.

Tuesday, May 15, 2012

Stop accepting new jobs in a hadoop cluster (ACL)

To stop accepting new MR jobs in a hadoop cluster you have to enable ACL's first. If you've done that, you can specify a single character queue ACL (' ' = a space!). Since mapred-queue-acls.xml is polled regularly you can dynamically change the queue in a running system . Useful for ops related work (setting into maintenance, extending / decommission nodes and such things).

Enable ACL's

Edit the config file ($HADOOP/conf/mapred-queue-acls.xml) to fit your needs:

<configuration>
 <property>
   <name>mapred.queue.default.acl-submit-job</name>
   <value>user1,user2,group1,group2,admins</value>
 </property>

 <property>
   <name>mapred.queue.default.acl-administer-jobs</name>
   <value>admins</value>
 </property>

</configuration>


Enable an ACL driven cluster by editing the value of mapred.acls.enabled in conf/mapred-site.xml and setting to true.

Now edit simply the value of mapred.queue.default.acl-submit-job and replace user1,user2,group1,group2 with ' ':

<configuration>
 <property>
   <name>mapred.queue.default.acl-submit-job</name>
   <value> </value>
 </property>

 <property>
   <name>mapred.queue.default.acl-administer-jobs</name>
   <value>admins</value>
 </property>

</configuration>


This stops all users to submit new jobs, but lets the started jobs running.