Some Notes About Ops in Druid OLAP System

I have talked about Druid a lot in blog, including the architecture of druid and how to change time zone in druid. This post will focus on the basic operation in druid which operate everyday.

First, how to write the Hadoop map reduce index spec file


{
  "dataSchema" : {
    "dataSource" : "ingestion_test",
    "parser" : {
      "type" : "hadoopyString",
      "parseSpec" : {
        "format" : "tsv",
        "timestampSpec" : {
          "column" : "dt",
          "format" : "posix"
        },
        "dimensionsSpec" : {
          "dimensions": ["grade","src_flag","school_id","gender_desc","prov_name","city_name"]
        },
        "delimiter":"\u0001",
        "listDelimiter":"\u0002",
        "columns":  ["dt","uid","grade","src_flag","school_id","gender_desc","prov_name","city_name","flag"]
      }
    },
    "metricsSpec" : [
              {
                    "type": "count",
                    "name": "count_druid"
              },
              {
                    "type": "hyperUnique",
                    "name": "uv",
                    "fieldName" : "uid"
              },
              {
                    "type": "longSum",
                    "name": "count",
                    "fieldName" : "flag"
              }
    ],
    "granularitySpec" : {
      "type" : "uniform",
      "segmentGranularity" : "HOUR",
      "queryGranularity" : "NONE",
      "intervals" : [ "2017-3-13/2017-3-14" ]
    }
  },
  "ioConfig" : {
    "type" : "hadoop",
    "inputSpec" : {
      "type" : "static",
      "paths" : "/data//000000_0"
    },
    "metadataUpdateSpec" : {
                "type":"mysql",
                "connectURI":"jdbc:mysql://ip:3306/druid",
                "password" : "password",
                "user" : "user",
                "segmentTable" : "druid_segments"
    },
    "segmentOutputPath" : "hdfs://ns1/user/druid/localStorage"
  },
  "tuningConfig" : {
    "type" : "hadoop",
    "workingPath": "hdfs://ns1/user/druid/localStorage/workdir",
    "partitionsSpec" : {
      "type" : "hashed",
      "numShards" : 3
    },
    "shardSpecs" : { },
    "leaveIntermediate" : false,
    "cleanupOnFailure" : true,
    "overwriteFiles" : false,
    "ignoreInvalidRows" : false,
    "jobProperties" : { },
    "combineText" : false,
    "persistInHeap" : false,
    "ingestOffheap" : false,
    "bufferSize" : 134217728,
    "aggregationBufferRatio" : 0.5,
    "rowFlushBoundary" : 300000,
    "useCombiner" : true,
    "buildV9Directly" : true
  }
}

In the spec file ,you can assign reduce number in numShareds parameters.

Second, the example spec file which directly write to druid using tranquility


{
  "dataSources": [
    {
      "spec": {
        "dataSchema": {
          "dataSource": "main_static_log_tranq1",
          "parser": {
            "type": "string",
            "parseSpec": {
              "format": "json",
              "timestampSpec": {
                "column": "timestamp",
                "format": "posix"
              },
              "dimensionsSpec": {
                "dimensions": ["typeSignII",  "typeSignI", "typeSignIII", "typeSignIV",  "responseCode",  "processTotalTime", "serverIp", "terminal", "type", "service"],
                "dimensionExclusions": [],
                "spatialDimensions": []
              }
            }
          },
          "metricsSpec": [
            {
              "type": "count",
              "name": "count"
            },{
              "type": "doubleSum",
              "name": "mProcessTotalTime",
              "fieldName" : "mProcessTotalTime"
            }
          ],
          "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "SIX_HOUR",
            "queryGranularity": "MINUTE"
          }
        },
        "tuningConfig": {
          "type": "realtime",
          "maxRowsInMemory": 100000,
          "intermediatePersistPeriod": "PT10m",
          "windowPeriod": "PT60m"
        }
      },
      "properties" : {
            "task.partitions" : "1",
            "task.replicants" : "2"
      }
    }
  ],
  "properties": {
    "zookeeper.connect": "10.39.2.161:2181",
    "druid.selectors.indexing.serviceName": "overlord",
    "druid.discovery.curator.path": "/druid/discovery",
    "druidBeam.overlordPollPeriod": "PT20S"
  }
}

You can assign partitions and replications using task.partitions and task.replicants parameters.

How to Submit a Merge Job to Index Service in Druid

Normally, we use index service instead of realtime nodes in Druid to ingest realtime data. If you have multiple partitions in one time, and each of them is small, you have to merge them together to form a big segments to boost query effeciency.
For example we have two segments in the same time interval, just as below

two segments on the same time interval
What we care is to merge them together. Here is how to write the merge.json file and submit:


{
    "type": "merge",
    "dataSource": "main_static_log_tt",
    "aggregations": [
                {
                    "type": "count",
                    "name": "count"
                },{
                    "type": "doubleSum",
                    "name": "mProcessTotalTime",
                    "fieldName" : "mProcessTotalTime"
                }
    ],
    "rollup": "false",
    "segments": [
{"dataSource":"main_static_log_tt","interval":"2017-03-27T10:05:00.000Z/2017-03-27T10:06:00.000Z","version":"2017-03-27T10:05:00.000Z","loadSpec":{"type":"local","path":"/data0/test/file/main_static_log_tt/2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z/2017-03-27T10:05:00.000Z/0/index.zip"},"dimensions":"processTotalTime,responseCode,serverIp,typeSignI,typeSignII,typeSignIII,typeSignIV","metrics":"count,mProcessTotalTime","shardSpec":{"type":"none"},"binaryVersion":9,"size":129991,"identifier":"main_static_log_tt_2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z_2017-03-27T10:05:00.000Z"},
{"dataSource":"main_static_log_tt","interval":"2017-03-27T10:05:00.000Z/2017-03-27T10:06:00.000Z","version":"2017-03-27T10:05:00.000Z","loadSpec":{"type":"local","path":"/data0/test/file/main_static_log_tt/2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z/2017-03-27T10:05:00.000Z/1/index.zip"},"dimensions":"processTotalTime,responseCode,serverIp,typeSignI,typeSignII,typeSignIII,typeSignIV","metrics":"count,mProcessTotalTime","shardSpec":{"type":"none"},"binaryVersion":9,"size":190243,"identifier":"main_static_log_tt_2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z_2017-03-27T10:05:00.000Z_1"}
    ]
}

Remember to change the shardSpec type to none, because the merge function only merge that type, it ignore hash or linear type. But we can avoid it, we just change the type to none, it has some problem, later post i will talk about how to change the code and make it work.
After edit the json file, you can submit to your overlord node as below:

curl http://host:port/druid/indexer/v1/task -H "Content-Type:application/json" -X POST --data @merge.json

Kill the Job

Some time, you just want to take a test and later on you can kill the task to free the slot in index service. Here is how to write kill.json file and submit it:


{
    "type": "kill",
    "id": "sbsina",
    "dataSource": "main_static_log_tt",
    "interval": "2017-03-22T07:47:00.000Z/2017-03-28T07:48:00.000Z"
}

Submit it:

curl http://host:port/druid/indexer/v1/task -H "Content-Type:application/json" -X POST --data @kill.json

Disable Middle Manager to Update

submit a post to middle manager http port:

curl -X POST http://ip:port/druid/worker/v1/disable

Druid Ingest Format问题

 

Druid使用过程中需要对历史数据进行索引,由于历史数据都是hive表形式,分隔符为\001,所以需要druid对ingest的format 任意delimiter进行支持,以下是支持的形式:

"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "tsv",
"timestampSpec" : {
"column" : "dt",
"format" : "posix"
},
"dimensionsSpec" : {
"dimensions": ["grade","src_flag","school_name","gender_desc","prov_name","city_name","school_prov"]
},
"delimiter":"\u0001",
"listDelimiter":"\u0002",
"columns": ["dt","uid","grade","src_flag","school_name","gender_desc","prov_name","city_name","school_prov","flag"]
}
},

指定解析格式是tsv,delimiter为\u0001,listDelimiter是multi时候使用,目前没有使用,定义为\u0002,启动任务即可。

Druid Batch Job 问题汇总

最近在整理使用Druid Batch Job所遇到的问题,下面一一记录,我使用的Druid版本是0.8.2,所以以下方法适用于0.8.2。
一、依赖包问题
因为很多机器都无法连接外网,所以有必要修改druid的源文件,使他能够从我们的本地中心库下载文件,所需要修改的文件为ExtensionsConfig.java,修改内容如下

   private List<String> remoteRepositories = ImmutableList.of(
-      "https://repo1.maven.org/maven2/",
-      "https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local"
+      "http://10.39.0.110:8081/nexus/content/groups/public",
+      "http://10.39.0.110:8081/nexus/content/repositories/thirdparty"
   );

这样将会从我们的中心库下载所需要的包。
修改完后我们可以把所需要的包拉取的本地用于提交,修改$DRUID_HOME/config/_common/common.runtime.properties

druid.extensions.localRepository=/usr/local/druid-0.8.2/localRepo

指定本地仓库的位置。 随后在$DRUID_HOME目录下执行

java -cp  config/_common:config/broker:/usr/local/hadoop-2.4.0/etc/hadoop:lib/*  io.druid.cli.Main tools pull-deps

这样会将依赖包下载。
如果需要增加自己的的第三方依赖包,也修改$DRUID_HOME/config/_common/common.runtime.properties

druid.extensions.coordinates=["com.sina.hivexec:hive-exec:0.13.0"]

这样提交的时候会将依赖包加入到classpath中。

二、counter计数问题
我们的问题是由于一个第三方包导致获取counter计数抛异常,随后导致任务失败,目前解决方法是通过修改源码,catch住异常,查看是否是由于counter计数引起的,并且查看任务是否成功,成功的话继续下一步任务,而不是直接抛出异常结束。

三、有关reduce个数问题
简单说一下druid的流程,只讲一下partitionsSpec type为hash的情况。
1、如果不指定numShards 那么会分两个任务,第一个任务通过hyperloglog对每一个分区去查找基数大小,reduce会将每个分区的基数大小输出。
随后job会根据targetPartitionSize决定由几个shards来跑这个第二个任务,第二个任务就是生成index,基本流程跟realtime一样,根据日志,生成index。但是如果shards为1,相当于只有一个reduce去跑,会比较慢。这样如果基数是20000, “targetPartitionSize” : 10000,那么每个时间分区就只有20000/10000=2个reduce去跑。
2、如果指定numShards,那么就只有index一个任务,每个时间分区启动numShards个reduce,如果知道大概的数据量以及基数,可以直接指定numShareds.

四、时区问题
由于提交的时候指定的是UTC时区,所以需要在map 以及reduce阶段也制定时区,指定方法为,修改提交机器的mapred-site.xml

  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx1280M -Duser.timezone=UTC</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx1536M -Duser.timezone=UTC</value>
  </property>

How to Avoid Druid Write /Tmp Directory Full

Recently, i have noticed when i started some of the realtime node, it is easy for druid to write /tmp directory full. The file is all start with filePeon.
After i investigate the code and the configuration of druid, i found druid write the index file in druid.indexer.task.baseDir, and the default value is System.getProperty(“java.io.tmpdir”).
So we can set java.io.tmpdir to another directory when we start the realtime node as below:

java -Djava.io.tmpdir=/data0/druid/tmp -Xmx10g -Xms10g -XX:NewSize=2g -XX:MaxNewSize=2g -XX:MaxDirectMemorySize=25g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=config/realtime/wb_ad_interest_druid.spec -classpath config/_common:config/realtime:/usr/local/hadoop-2.4.0/etc/hadoop:lib/* io.druid.cli.Main server realtime

Druid 系统架构说明(二)

一.说明

续前一篇Druid 系统架构说明 ,主要介绍了druid的基本架构以及使用说明。本篇更新内容,主要介绍的是使用Realtime Index service 替代之前介绍的realtime node来完成实时ingest,index build,hand off等任务。
首先要说明一下realtime node与index server的一些区别:
Alt text
可以看出当druid集群规模增大时,使用Realtime Index Service是必须的。

二.架构与流程

相比于之前博客缩写的架构,使用Realtime Index Service的Druid系统增加了几个组件,现在的系统架构图如下:
Druid推模式
上一篇博客主要介绍的是druid的拉模式,数据通过不同的Realtime Node通过kafka等拉取数据,建立索引,handoff到Historical Node。随着Druid业务增多,规模扩大,对Realtime Node的管理变成了非常繁琐的事情,所以Druid开发了推模式,解决这一问题。相信这也是很多分布式系统应用最后都需要解决的问题,就是使部署运维简单化,自动化。
这一篇主要介绍的是推模式,推模式增加了一些角色,分别是Overlord Node, MiddleManager Node, peon以及客户端的Tranquility. 下面一一介绍各个模块的功能以及流程。

(一)角色

1.Tranquility
客户端发送工具,用户通过Tranquility将数据实时的发送到Druid中。Tranquility负责与Zk通信,与Overlord交互,根据timestamp将有效数据发送到Peon中。
2.Overlord
负责分配任务到不同的Middle Manager中,类似于ResourceManager。
3.Middle Manager
负责根据不同的任务启动Peon,并且负责Peon启动后运行的状态,类似于NodeManager。
4.Peon
Peon代替了Realtime Node的大部分功能,通过Middle Manager启动,以独立进程的形式启动。

(二)流程说明

1.用户的spec文件在Tranquility中定义,首先Tranquility通过spec初始化,获得zk中Overlord的地址,与Overlord通信。
2.Overlord得到新写入任务后,查询zk节点信息,选择一个Middle Manager节点启动来启动peon,并将信息写入到zk中。
3.Middle Manager一直监控zk,发现有新的任务分配后,启动一个Peon进程,并监控Peon进程的状态。
4.Peon与Realtime Node流程基本一致,所不同的是Peon使用的是HTTP接口来接收数据,RealTime Node更多的是内部的线程不断的拉取Kafka的数据。
5.Tranquility随后通过zk获取Peon机器地址和端口,将数据不断的发送到Peon中。
6.Peon根据spec规则,定时或者定量将数据build index,handoff到deep storage(HDFS)中。
7.随后就是Coordinator根据Peon在zk中信息,将数据写入到sql中,并分配Historical Node去deep storage拉取index数据。
8.Historical Node到deep storage拉取index数据到本地,重建index到内存中,至此数据流入完成。

三.总结

通过realtime index service的推模式,Druid的部署运维管理更加简单,易用度更高。后面一些blog会对Druid代码进行分析。