零起步的Hadoop实践日记（hbase in action）-白红宇

零起步的Hadoop实践日记（hbase in action）

阅读量：7228 次

发布时间：2019-06-29

本文共 11289 字，大约阅读时间需要 37 分钟。

笔者搭建的是伪分布，其他方式页面里面也有。安装参考：

新建表

create 'dailystats','uid','sTime','eTime','calories','steps','activeValue','pm25suck','runDist','runDura','cycDist','cycDura','walkDist','walkDura','runCal','cycCal','walkCal','goadCal','goalSteps','goalActiveVal','locations','day'

导入数据

默认是\t 分隔。下面是大量导入数据的方法之一：ImportTsv 和 LoadIncrementalHFiles / Completebulkload

第一步转化为HFile sudo -su hdfs hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,sTime,eTime,calories,steps,activeValue,pm25suck,runDist,runDura,cycDist,cycDura,walkDist,walkDura,runCal,cycCal,walkCal,goadCal,goalSteps,goalActiveVal,locations,day -Dimporttsv.bulk.output=/user/hdfs/hbase_day_uid_file_head dailystats /user/hdfs/day_uid_file_head第二步导入HFile到HBasesudo -su hdfs hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hdfs/hbase_day_uid_file_head dailystats

导入15w用户，供140w数据，第一步需要提前上传文件到hdfs，导入耗时约10分钟，第二步2秒内完成

第二步也可以为（未测试）

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload 
      
      
        实例：

hadoop jar ${HBASE_HOME}/hbase-0.92.1.jar completebulkload /user/hfile/test_hfile.log table_name

另外还有方法可以从HBase里面导入导出，可惜不是导入纯文本文件，是HBase产生的Sequence文件，只适合不同库导出导入。参考：

14.1.7. ExportExport is a utility that will dump the contents of table to HDFS in a sequence file. Invoke via:$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export 
      
      
        [
       
         [
        
          [
         
          ]]]

Note: caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration. 14.1.8. ImportImport is a utility that will load data that has been exported back into HBase. Invoke via:$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import

其中数据文件位置可为本地文件目录，也可以分布式文件系统hdfs的路径。

当其为前者时，直接指定即可，也可以加前缀file:///

当其伟后者时，必须明确指明hdfs的路径，例如hdfs://mymaster:9000/path

未删表格式化Hadoop重启的问题

可能是硬件或则软件的问题，hbase导入数据极慢，重新格式化后在启动hbase，出现问题

ERROR: Table already exists: dailystats!Here is some help for this command:Create table; pass table name, a dictionary of specifications percolumn family, and optionally a dictionary of table configuration.Dictionaries are described below in the GENERAL NOTES section.Examples:  hbase> create 't1', {NAME => 'f1', VERSIONS => 5}  hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}  hbase> # The above in shorthand would be the following:  hbase> create 't1', 'f1', 'f2', 'f3'  hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}  hbase> create 't1', 'f1', {SPLITS => ['10', '20', '30', '40']}  hbase> create 't1', 'f1', {SPLITS_FILE => 'splits.txt'}  hbase> # Optionally pre-split the table into NUMREGIONS, using  hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}  You can also keep around a reference to the created table:  hbase> t1 = create 't1', 'f1'  Which gives you a reference to the table named 't1', on which you can then  call methods.hbase(main):005:0> enable 'dailystats'ERROR: Table dailystats does not exist.'Here is some help for this command:Start enable of named table: e.g. "hbase> enable 't1'"hbase(main):006:0> disable 'dailystats'ERROR: Table dailystats does not exist.'Here is some help for this command:Start disable of named table: e.g. "hbase> disable 't1'"

第一次碰到这个问题的时候就想砍人，典型的精神分裂患者，到底是存在还是不存在啊。这个原因是因为zookeeper缓存不同步，zookeeper貌似只能在显示drop table_name的时候才能同步信息，我这样直接格式化重启hadoop就不同步了，anyway，删除zookeeper的数据就ok，cloudera的zookeeper数据在/var/lib/zookeeper/

删除在启动，欧了！

HBase Java API

首先非常想吐槽的是，目前网上，书上都没有一个完整的例子告诉你怎么弄，上来就核心代码，真不负责。要不就是Eclipse的各种操作，问题生产环境基本只有命令行，好吗？

都搭建到HBase了，Java肯定都装好了，另外我用到的Hadoop，Hive，Zookeeper，HBase均来自Cloudera。

先上一片测试完整代码（来自）：

import java.util.ArrayList;import java.util.List;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.HColumnDescriptor;import org.apache.hadoop.hbase.HTableDescriptor;import org.apache.hadoop.hbase.KeyValue;import org.apache.hadoop.hbase.client.Delete;import org.apache.hadoop.hbase.client.Get;import org.apache.hadoop.hbase.client.HBaseAdmin;import org.apache.hadoop.hbase.client.HTable;import org.apache.hadoop.hbase.client.Put;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.client.ResultScanner;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.util.Bytes;/** * @author 三劫散仙 *  * **/public class Test {        static Configuration conf=null;    static{                  conf=HBaseConfiguration.create();//hbase的配置信息          conf.set("hbase.zookeeper.quorum", "127.0.0.1");  //zookeeper的地址            }        public static void main(String[] args)throws Exception {                Test t=new Test();        //t.createTable("temp", new String[]{"name","age"});     //t.insertRow("temp", "2", "age", "myage", "100");    // t.getOneDataByRowKey("temp", "2");        t.showAll("temp");         }        /***     * 创建一张表     * 并指定列簇     * */    public void createTable(String tableName,String cols[])throws Exception{     HBaseAdmin admin=new HBaseAdmin(conf);//客户端管理工具类    if(admin.tableExists(tableName)){        System.out.println("此表已经存在.......");    }else{        HTableDescriptor table=new HTableDescriptor(tableName);        for(String c:cols){            HColumnDescriptor col=new HColumnDescriptor(c);//列簇名            table.addFamily(col);//添加到此表中        }             admin.createTable(table);//创建一个表     admin.close();     System.out.println("创建表成功!");    }    }        /**     * 添加数据,     * 建议使用批量添加     * @param tableName 表名     * @param row  行号     * @param columnFamily 列簇     * @param column   列     * @param value   具体的值     *      * **/    public  void insertRow(String tableName, String row,              String columnFamily, String column, String value) throws Exception {          HTable table = new HTable(conf, tableName);          Put put = new Put(Bytes.toBytes(row));          // 参数出分别：列族、列、值          put.add(Bytes.toBytes(columnFamily), Bytes.toBytes(column),                  Bytes.toBytes(value));                table.put(put);          table.close();//关闭        System.out.println("插入一条数据成功!");    }            /**     * 删除一条数据     * @param tableName 表名     * @param row  rowkey     * **/    public void deleteByRow(String tableName,String rowkey)throws Exception{        HTable h=new HTable(conf, tableName);        Delete d=new Delete(Bytes.toBytes(rowkey));        h.delete(d);//删除一条数据        h.close();    }        /**     * 删除多条数据     * @param tableName 表名     * @param row  rowkey     * **/    public void deleteByRow(String tableName,String rowkey[])throws Exception{        HTable h=new HTable(conf, tableName);             List
     
       list=new ArrayList
      
       ();        for(String k:rowkey){            Delete d=new Delete(Bytes.toBytes(k));            list.add(d);        }        h.delete(list);//删除        h.close();//释放资源    }        /**     * 得到一条数据     *      * @param tableName 表名     * @param rowkey 行号     * ***/    public void getOneDataByRowKey(String tableName,String rowkey)throws Exception{        HTable h=new HTable(conf, tableName);                Get g=new Get(Bytes.toBytes(rowkey));        Result r=h.get(g);        for(KeyValue k:r.raw()){                        System.out.println("行号:  "+Bytes.toStringBinary(k.getRow()));            System.out.println("时间戳:  "+k.getTimestamp());            System.out.println("列簇:  "+Bytes.toStringBinary(k.getFamily()));            System.out.println("列:  "+Bytes.toStringBinary(k.getQualifier()));            //if(Bytes.toStringBinary(k.getQualifier()).equals("myage")){            //    System.out.println("值:  "+Bytes.toInt(k.getValue()));            //}else{
                String ss=    Bytes.toString(k.getValue());            System.out.println("值:  "+ss);            //}                                             }        h.close();                    }        /**     * 扫描所有数据或特定数据     * @param tableName     * **/    public void showAll(String tableName)throws Exception{        HTable h=new HTable(conf, tableName);                 Scan scan=new Scan();         //扫描特定区间         //Scan scan=new Scan(Bytes.toBytes("开始行号"),Bytes.toBytes("结束行号"));         ResultScanner scanner=h.getScanner(scan);         for(Result r:scanner){             System.out.println("==================================");           for(KeyValue k:r.raw()){                        System.out.println("行号:  "+Bytes.toStringBinary(k.getRow()));            System.out.println("时间戳:  "+k.getTimestamp());            System.out.println("列簇:  "+Bytes.toStringBinary(k.getFamily()));            System.out.println("列:  "+Bytes.toStringBinary(k.getQualifier()));            //if(Bytes.toStringBinary(k.getQualifier()).equals("myage")){            //    System.out.println("值:  "+Bytes.toInt(k.getValue()));            //}else{
                String ss=    Bytes.toString(k.getValue());            System.out.println("值:  "+ss);            //}             }        }        h.close();            }}

上图红色部分要修改为自己对应的数据，我是伪分布，所以zookeeper的地址就是127.0.0.1，另外主程序里表名改为自己想要用的，函数showAll测试是可以用的。其他暂未测试。

没找到要加载哪些jar包，按照错误提示的一个个在hadoop和hbase目录下找（jar -tf），实在太麻烦了，也很少很久没直接上命令行做java编译，不知道怎么把文件夹加入classpath实现加载全部jar，最后干脆这样了：

javac -cp .:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hbase/* Test.java

这样得到Test.class，然后

java -cp .:/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hbase/* Test

得到的一行数据为：

==================================行号:  100044时间戳:  1395839067455列簇:  activeValue列:  值:  75561行号:  100044时间戳:  1395839067455列簇:  calories列:  值:  203.087463109行号:  100044时间戳:  1395839067455列簇:  cycCal列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  cycDist列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  cycDura列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  day列:  值:  20140309行号:  100044时间戳:  1395839067455列簇:  eTime列:  值:  1395072000.0行号:  100044时间戳:  1395839067455列簇:  goadCal列:  值:  200行号:  100044时间戳:  1395839067455列簇:  goalActiveVal列:  值:  0行号:  100044时间戳:  1395839067455列簇:  goalSteps列:  值:  7000行号:  100044时间戳:  1395839067455列簇:  locations列:  值:  116.352251956_39.9705142606|116.352283224_39.9705169167|116.352249753_39.9705249571|116.352235717_39.9704804303|116.352235717_39.9704804303|116.352255264_39.9705145666|116.352211069_39.9705208877|116.352298561_39.9705498454|116.352174589_39.9705077219|116.352163865_39.9704987024|116.352202454_39.9705452733|116.352271801_39.9705272967|116.352240134_39.9705372404|116.35215525_39.9705303882|116.352256465_39.9705083685|116.352227205_39.9705281169|116.35221338_39.9705648936|116.352255973_39.9705810705|116.352225497_39.9704947125|116.35230949_39.9705885665行号:  100044时间戳:  1395839067455列簇:  pm25suck列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  runCal列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  runDist列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  runDura列:  值:  0.0行号:  100044时间戳:  1395839067455列簇:  sTime列:  值:  1394985600.0行号:  100044时间戳:  1395839067455列簇:  steps列:  值:  3744行号:  100044时间戳:  1395839067455列簇:  walkCal列:  值:  187.210574021行号:  100044时间戳:  1395839067455列簇:  walkDist列:  值:  3924.92937003行号:  100044时间戳:  1395839067455列簇:  walkDura列:  值:  3122.49075413

转载于:https://www.cnblogs.com/aquastar/p/3624027.html

你可能感兴趣的文章