Cassandra 0.7 值得期待

在Cassandra的wiki上,很早就有0.7的一些特性描述,其中很有些吸引人,而8月13号Cassandra 0.7 beta1版本终于发布了,这里可以下载。

个人比较关心的几个主要的新特性:

1. Key Space和Column Family定义可以在线增改,不再需要停集群修改配置文件了。
2. 支持secondary index,可以对column建索引,通过接口get_indexed_slices实现针对column的查询。
3. 支持truncate一个column family。
4. 可以针对keyspace设置replica_placement_strategy和replication_factor。
5. Row cache提升了8倍的读性能。之前版本的测试中,Cassandra写性能令人印象深刻,读性能则不如人意。
6. 支持hadoop格式的输出,可以使得数据仓库更容易从Cassandra中抽取数据。

另外,配置文件从xml改成了yaml格式的,读起来更顺畅些。代码里还有很多细节的改进,有时间需要慢慢去看了。期待0.7版本尽快GA。

0.7.0
=====

Features
--------
    - Row keys are now bytes: keys stored by versions prior to 0.7.0 will be
      returned as UTF-8 encoded bytes. OrderPreservingPartitioner and
      CollatingOrderPreservingPartitioner continue to expect that keys contain
      UTF-8 encoded strings, but RandomPartitioner no longer expects strings.
    - A new ByteOrderedPartitioner supports bytes keys with arbitrary content,
      and orders keys by their byte value. 
    - Truncate thrift method allows clearing an entire ColumnFamily at once
    - DatacenterShardStrategy is ready for use, enabling 
      ConsitencyLevel.DCQUORUM and DCQUORUMSYNC.  See comments in
      `cassandra.yaml.`
    - row size limit increased from 2GB to 2 billion columns
    - Hadoop OutputFormat support
    - Streaming data for repair or node movement no longer requires 
      anticompaction step first
    - keyspace is per-connection in the thrift API instead of per-call
    - optional round-robin scheduling between keyspaces for multitenant
      clusters
    - dynamic endpoint snitch mitigates the impact of impaired nodes
    - significantly faster reads from row cache
    - introduced IntegerType that is both faster than LongType and
      allows integers of both less and more bits than Long's 64

Configuraton
------------
    - Configuration file renamed to cassandra.yaml and log4j.properties to
      log4j-server.properties
    - Added 'bin/config-converter' to convert existing storage-conf.xml or
      cassandra.xml files to a cassandra.yaml file. When executed, it will
      create a cassandra.yaml file in any directory containing a matching
      xml file.
    - The ThriftAddress and ThriftPort directives have been renamed to
      RPCAddress and RPCPort respectively.
    - The keyspaces defined in cassandra.yaml are ignored on startup as a
      result of CASSANDRA-44.  A JMX method has been exposed in the 
      StorageServiceMBean to force a schema load from cassandra.yaml. It
      is a one-shot affair though and you should conduct it on a seed node
      before other nodes. Subsequent restarts will load the schema from the 
      system table and attempts to load the schema from YAML will be ignored.  
      You shoud only have to do this for one node since new nodes will receive
      schema updates on startup from the seed node you updated manually. 
    - EndPointSnitch was renamed to RackInferringSnitch.  A new SimpleSnitch
      has been added.
    - RowWarningThresholdInMB replaced with in_memory_compaction_limit_in_mb
    - GCGraceSeconds is now per-ColumnFamily instead of global
    - Configuration of DatacenterShardStrategy is now a part of the keyspace
      definition using the strategy_options attribute.
      The datacenter.properties file is no longer used.

JMX
---
    - StreamingService moved from o.a.c.streaming to o.a.c.service
    - GMFD renamed to GOSSIP_STAGE
    - {Min,Mean,Max}RowCompactedSize renamed to {Min,Mean,Max}RowSize
      since it no longer has to wait til compaction to be computed

Thrift API
----------
    - Row keys are now 'bytes': see the Features list.
    - The return type for login() is now AccessLevel.
    - The get_string_property() method has been removed.
    - The get_string_list_property() method has been removed.

Other
-----
    - If extending AbstractType, make sure you follow the singleton pattern
      followed by Cassandra core AbstractType extensions.
      e.g. BytesType has a variable called 'instance' and an empty constructor
      with default access

0.7.0-beta1
 * sstable versioning (CASSANDRA-389)
 * switched to slf4j logging (CASSANDRA-625)
 * access levels for authentication/authorization (CASSANDRA-900)
 * add ReadRepairChance to CF definition (CASSANDRA-930)
 * fix heisenbug in system tests, especially common on OS X (CASSANDRA-944)
 * convert to byte[] keys internally and all public APIs (CASSANDRA-767)
 * ability to alter schema definitions on a live cluster (CASSANDRA-44)
 * renamed configuration file to cassandra.xml, and log4j.properties to
   log4j-server.properties, which must now be loaded from
   the classpath (which is how our scripts in bin/ have always done it)
   (CASSANDRA-971)
 * change get_count to require a SlicePredicate. create multi_get_count
   (CASSANDRA-744)
 * re-organized endpointsnitch implementations and added SimpleSnitch
   (CASSANDRA-994)
 * Added preload_row_cache option (CASSANDRA-946)
 * add CRC to commitlog header (CASSANDRA-999)
 * removed multiget thrift method (CASSANDRA-739)
 * removed deprecated batch_insert and get_range_slice methods (CASSANDRA-1065)
 * add truncate thrift method (CASSANDRA-531)
 * http mini-interface using mx4j (CASSANDRA-1068)
 * optimize away copy of sliced row on memtable read path (CASSANDRA-1046)
 * replace constant-size 2GB mmaped segments and special casing for index 
   entries spanning segment boundaries, with SegmentedFile that computes 
   segments that always contain entire entries/rows (CASSANDRA-1117)
 * avoid reading large rows into memory during compaction (CASSANDRA-16)
 * added hadoop OutputFormat (CASSANDRA-1101)
 * efficient Streaming (no more anticompaction) (CASSANDRA-579)
 * split commitlog header into separate file and add size checksum to
   mutations (CASSANDRA-1179)
 * avoid allocating a new byte[] for each mutation on replay (CASSANDRA-1219)
 * revise HH schema to be per-endpoint (CASSANDRA-1142)
 * add joining/leaving status to nodetool ring (CASSANDRA-1115)
 * allow multiple repair sessions per node (CASSANDRA-1190)
 * add dynamic endpoint snitch (CASSANDRA-981)
 * optimize away MessagingService for local range queries (CASSANDRA-1261)
 * make framed transport the default so malformed requests can't OOM the 
   server (CASSANDRA-475)
 * significantly faster reads from row cache (CASSANDRA-1267)
 * take advantage of row cache during range queries (CASSANDRA-1302)
 * make GCGraceSeconds a per-ColumnFamily value (CASSANDRA-1276)
 * keep persistent row size and column count statistics (CASSANDRA-1155)
 * add IntegerType (CASSANDRA-1282)
 * page within a single row during hinted handoff (CASSANDRA-1327)
 * push DatacenterShardStrategy configuration into keyspace definition,
   eliminating datacenter.properties. (CASSANDRA-1066)
 * optimize forward slices starting with '' and single-index-block name 
   queries by skipping the column index (CASSANDRA-1338)
 * streaming refactor (CASSANDRA-1189)


无觅相关文章插件,快速提升流量

3条评论

  • At 2011.03.07 22:35, theseus yang said:

    江枫你好:
    不知道你留没留意过SpringSource 下的GemFire 分布式缓存数据管理平台。我确信GemFire的性能要比Cassandra或者Hbase性能要更好,更稳定。而且,淘宝现在Oracle RAC节点已经扩展到了24个,我听说RAC节点越多对性能下降的越厉害,I/O瓶颈怎么解决?我一直很迷惑淘宝怎么做到这一点的,希望我们能有交流的机会。谢谢!

    • At 2011.03.09 16:27, NinGoo said:

      RAC已经是过去式,共享存储的架构早已经不能适应大规模数据分析的需要了。 Hadoop在分布式计算平台的先发优势已经很大,也确实能够满足现阶段的数据分析与计算的需求。

      • At 2011.04.14 12:51, theseus yang said:

        现在大型企业的应用还是 O.I.B结构,虽然已经不能承载高业务负载,很多企业也尝试迁移到分布式计算平台下, 不知道在不对原来的信息系统架构做大幅度调整的前提下,不做停机割接处理保持生产系统正常运行而且不影响生产系统性能,怎样把Oracle 数据迁移到分布式缓存系统中,log sniffing?有没有什么好的方式迁移数据?


        (Required)
        (Required, will not be published)