关于HBase的一些零碎事

随着Facebook使用HBase来构建实时消息系统,基于Hadoop的面向列存储的HBase持续升温。

目前稳定版本的HBase0.90.2只能基于Hadoop0.20.x系列版本,暂不支持最新的0.21.x。而且官方版本的Hadoop0.20.2(或者0.203.0)缺少一个重要的特性,HDFS不支持sync模式的持久,这样HBase就有较大的丢失数据的风险。要在生产环境使用HBase,有两个选择,一是使用Cloudera的CDH3版本,Cloudera就类似MySQL的Percona,对官方版本的Hadoop做了很多改进工作,而且经典的《Hadoop:The Definitive Guide》一书的作者Tom White就是Cloudera的一员,这也和《High performance MySQL》一书的作者主要来是Percona一样。另外一种选择,就是自行编译Hadoop branch-0.20-append源码分支,这里有详细的说明

对于HBase这种类似BigTable的系统,其优化之一是消除了磁盘的随机写。付出的代价是将最新的数据保存在内存表中,对内存有较大的需求。如果内存表的数量较多,则每个内存表就会在较小的时候刷到磁盘,导致磁盘文件多而且小。范围读取数据的时候就会跨多个数据文件甚至多个节点。为提升读性能,系统都会设计有compaction操作。另外为了防止某些情况下数据文件过大(hbase.hregion.max.filesize,默认256M,太大的数据文件在compaction等操作是对内存的消耗更大),HBase也设计了split操作。Compaction和Split操作,对于在线应用的响应时间都容易造成波动,他们的策略需要根据应用的特性进行调整。建议在业务低峰期手工调整。

HBase的regionserver宕机超过一定时间后,HMaster会将其所管理的region重新分布到其他存活的regionserver,由于数据和日志都持久在HDFS中,因此该操作不会导致数据丢失。但是重新分配的region需要根据日志恢复原regionserver中的内存表,这会导致宕机的region在这段时间内无法对外提供服务。而一旦重分布,宕机的节点起来后就相当于一个新的regionserver加入集群,为了平衡,需要再次将某些region分布到该server。 因此这个超时建议根据情况进行调整,一般情况下,宕机重启后即可恢复,如果重启需要10分钟,region重分布加恢复的时间要超过5分钟,那么还不如等节点重启。Region Server的内存表memstore如何在节点间做到更高的可用,是HBase的一个较大的挑战。Oceanbase也是采用内存表保持最新的更新数据,和HBase不同的是,Oceanbase使用的是集中的UpdateServer,只需要全力做好UpdateServer的容灾切换即可对业务连续性做到最小影响。分布还是集中,哪些功能分布,哪些功能集中,各自取不同平衡,是目前大部分分布式数据库或者存储的一个主要区别。当然,像Cassandra这种全分布的,架构上看起来很完美,实际应用起来反而问题更多。

对于java应用,线上运维最大的挑战之一就是heap内存管理。GC的不同方式,以及使用内存表和cache对内存的消耗,可能导致局部阻塞应用或者stop the world全局阻塞或者OOM。因此HBase的很多参数设置都是针对这两种情况。HBase使用了较新的CMS GC(-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode)。

默认触发GC的时机是当年老代内存达到90%的时候,这个百分比由 -XX:CMSInitiatingOccupancyFraction=N 这个参数来设置。concurrent mode failed发生在这样一个场景:
当年老代内存达到90%的时候,CMS开始进行并发垃圾收集,于此同时,新生代还在迅速不断地晋升对象到年老代。当年老代CMS还未完成并发标记时,年老代满了,悲剧就发生了。CMS因为没内存可用不得不暂停mark,并触发一次全jvm的stop the world(挂起所有线程),然后采用单线程拷贝方式清理所有垃圾对象。这个过程会非常漫长。为了避免出现concurrent mode failed,我们应该让GC在未到90%时,就触发。

通过设置 -XX:CMSInitiatingOccupancyFraction=N

这个百分比, 可以简单的这么计算。如果你的 hfile.block.cache.size 和 hbase.regionserver.global.memstore.upperLimit 加起来有60%(默认),那么你可以设置 70-80,一般高10%左右差不多。

(以上CMS GC的说明引自HBase性能调优

目前关于HBase的书不多,《Hadoop: The Definitive Guide》第二版有一章,另外最权威的要算官方的这本电子书了。

这篇是最近看HBase过程中的一些零碎的东西,记录于此备忘。

Cassandra 0.7 值得期待

在Cassandra的wiki上,很早就有0.7的一些特性描述,其中很有些吸引人,而8月13号Cassandra 0.7 beta1版本终于发布了,这里可以下载。

个人比较关心的几个主要的新特性:

1. Key Space和Column Family定义可以在线增改,不再需要停集群修改配置文件了。
2. 支持secondary index,可以对column建索引,通过接口get_indexed_slices实现针对column的查询。
3. 支持truncate一个column family。
4. 可以针对keyspace设置replica_placement_strategy和replication_factor。
5. Row cache提升了8倍的读性能。之前版本的测试中,Cassandra写性能令人印象深刻,读性能则不如人意。
6. 支持hadoop格式的输出,可以使得数据仓库更容易从Cassandra中抽取数据。

另外,配置文件从xml改成了yaml格式的,读起来更顺畅些。代码里还有很多细节的改进,有时间需要慢慢去看了。期待0.7版本尽快GA。

0.7.0
=====

Features
--------
    - Row keys are now bytes: keys stored by versions prior to 0.7.0 will be
      returned as UTF-8 encoded bytes. OrderPreservingPartitioner and
      CollatingOrderPreservingPartitioner continue to expect that keys contain
      UTF-8 encoded strings, but RandomPartitioner no longer expects strings.
    - A new ByteOrderedPartitioner supports bytes keys with arbitrary content,
      and orders keys by their byte value. 
    - Truncate thrift method allows clearing an entire ColumnFamily at once
    - DatacenterShardStrategy is ready for use, enabling 
      ConsitencyLevel.DCQUORUM and DCQUORUMSYNC.  See comments in
      `cassandra.yaml.`
    - row size limit increased from 2GB to 2 billion columns
    - Hadoop OutputFormat support
    - Streaming data for repair or node movement no longer requires 
      anticompaction step first
    - keyspace is per-connection in the thrift API instead of per-call
    - optional round-robin scheduling between keyspaces for multitenant
      clusters
    - dynamic endpoint snitch mitigates the impact of impaired nodes
    - significantly faster reads from row cache
    - introduced IntegerType that is both faster than LongType and
      allows integers of both less and more bits than Long's 64

Configuraton
------------
    - Configuration file renamed to cassandra.yaml and log4j.properties to
      log4j-server.properties
    - Added 'bin/config-converter' to convert existing storage-conf.xml or
      cassandra.xml files to a cassandra.yaml file. When executed, it will
      create a cassandra.yaml file in any directory containing a matching
      xml file.
    - The ThriftAddress and ThriftPort directives have been renamed to
      RPCAddress and RPCPort respectively.
    - The keyspaces defined in cassandra.yaml are ignored on startup as a
      result of CASSANDRA-44.  A JMX method has been exposed in the 
      StorageServiceMBean to force a schema load from cassandra.yaml. It
      is a one-shot affair though and you should conduct it on a seed node
      before other nodes. Subsequent restarts will load the schema from the 
      system table and attempts to load the schema from YAML will be ignored.  
      You shoud only have to do this for one node since new nodes will receive
      schema updates on startup from the seed node you updated manually. 
    - EndPointSnitch was renamed to RackInferringSnitch.  A new SimpleSnitch
      has been added.
    - RowWarningThresholdInMB replaced with in_memory_compaction_limit_in_mb
    - GCGraceSeconds is now per-ColumnFamily instead of global
    - Configuration of DatacenterShardStrategy is now a part of the keyspace
      definition using the strategy_options attribute.
      The datacenter.properties file is no longer used.

JMX
---
    - StreamingService moved from o.a.c.streaming to o.a.c.service
    - GMFD renamed to GOSSIP_STAGE
    - {Min,Mean,Max}RowCompactedSize renamed to {Min,Mean,Max}RowSize
      since it no longer has to wait til compaction to be computed

Thrift API
----------
    - Row keys are now 'bytes': see the Features list.
    - The return type for login() is now AccessLevel.
    - The get_string_property() method has been removed.
    - The get_string_list_property() method has been removed.

Other
-----
    - If extending AbstractType, make sure you follow the singleton pattern
      followed by Cassandra core AbstractType extensions.
      e.g. BytesType has a variable called 'instance' and an empty constructor
      with default access

0.7.0-beta1
 * sstable versioning (CASSANDRA-389)
 * switched to slf4j logging (CASSANDRA-625)
 * access levels for authentication/authorization (CASSANDRA-900)
 * add ReadRepairChance to CF definition (CASSANDRA-930)
 * fix heisenbug in system tests, especially common on OS X (CASSANDRA-944)
 * convert to byte[] keys internally and all public APIs (CASSANDRA-767)
 * ability to alter schema definitions on a live cluster (CASSANDRA-44)
 * renamed configuration file to cassandra.xml, and log4j.properties to
   log4j-server.properties, which must now be loaded from
   the classpath (which is how our scripts in bin/ have always done it)
   (CASSANDRA-971)
 * change get_count to require a SlicePredicate. create multi_get_count
   (CASSANDRA-744)
 * re-organized endpointsnitch implementations and added SimpleSnitch
   (CASSANDRA-994)
 * Added preload_row_cache option (CASSANDRA-946)
 * add CRC to commitlog header (CASSANDRA-999)
 * removed multiget thrift method (CASSANDRA-739)
 * removed deprecated batch_insert and get_range_slice methods (CASSANDRA-1065)
 * add truncate thrift method (CASSANDRA-531)
 * http mini-interface using mx4j (CASSANDRA-1068)
 * optimize away copy of sliced row on memtable read path (CASSANDRA-1046)
 * replace constant-size 2GB mmaped segments and special casing for index 
   entries spanning segment boundaries, with SegmentedFile that computes 
   segments that always contain entire entries/rows (CASSANDRA-1117)
 * avoid reading large rows into memory during compaction (CASSANDRA-16)
 * added hadoop OutputFormat (CASSANDRA-1101)
 * efficient Streaming (no more anticompaction) (CASSANDRA-579)
 * split commitlog header into separate file and add size checksum to
   mutations (CASSANDRA-1179)
 * avoid allocating a new byte[] for each mutation on replay (CASSANDRA-1219)
 * revise HH schema to be per-endpoint (CASSANDRA-1142)
 * add joining/leaving status to nodetool ring (CASSANDRA-1115)
 * allow multiple repair sessions per node (CASSANDRA-1190)
 * add dynamic endpoint snitch (CASSANDRA-981)
 * optimize away MessagingService for local range queries (CASSANDRA-1261)
 * make framed transport the default so malformed requests can't OOM the 
   server (CASSANDRA-475)
 * significantly faster reads from row cache (CASSANDRA-1267)
 * take advantage of row cache during range queries (CASSANDRA-1302)
 * make GCGraceSeconds a per-ColumnFamily value (CASSANDRA-1276)
 * keep persistent row size and column count statistics (CASSANDRA-1155)
 * add IntegerType (CASSANDRA-1282)
 * page within a single row during hinted handoff (CASSANDRA-1327)
 * push DatacenterShardStrategy configuration into keyspace definition,
   eliminating datacenter.properties. (CASSANDRA-1066)
 * optimize forward slices starting with '' and single-index-block name 
   queries by skipping the column index (CASSANDRA-1338)
 * streaming refactor (CASSANDRA-1189)

Cassandra运维之道 v0.2

最近几个尝试性的Cassandra应用中碰到了一些问题,在查找问题的过程中发现之前有些理解不到位,或者有偏差遗漏的地方,在v0.1的基础上,修改补充了小部分内容。从实际应用来看,Cassandra节点的稳定性还有很多工作要做,而实际系统的运维也还有很多的细节需要逐步规范下来。此PPT中有错漏或者待补充完善的地方,也欢迎大家指正。

Cassandra运维之道

对于传统的关系数据库Oracle/MySQL等,NoSQL一个相当大的不足是文档资料的缺失。相对而言,Cassandra还能找到不少资料,这个ppt是我根据网上一些资料,结合这几天浏览了一点源代码的一些理解,整理的一个普及资料。名字起的有点大,可能有点地方理解有偏差,权当这是version 0.1吧,接下来一些产品会开始使用Cassandra,结合实际的运维经验,希望能逐步形成类似最佳实践的操作手册。

无觅相关文章插件,快速提升流量