Some Notes About Ops in Druid OLAP System

I have talked about Druid a lot in blog, including the architecture of druid and how to change time zone in druid. This post will focus on the basic operation in druid which operate everyday.

First, how to write the Hadoop map reduce index spec file


{
  "dataSchema" : {
    "dataSource" : "ingestion_test",
    "parser" : {
      "type" : "hadoopyString",
      "parseSpec" : {
        "format" : "tsv",
        "timestampSpec" : {
          "column" : "dt",
          "format" : "posix"
        },
        "dimensionsSpec" : {
          "dimensions": ["grade","src_flag","school_id","gender_desc","prov_name","city_name"]
        },
        "delimiter":"\u0001",
        "listDelimiter":"\u0002",
        "columns":  ["dt","uid","grade","src_flag","school_id","gender_desc","prov_name","city_name","flag"]
      }
    },
    "metricsSpec" : [
              {
                    "type": "count",
                    "name": "count_druid"
              },
              {
                    "type": "hyperUnique",
                    "name": "uv",
                    "fieldName" : "uid"
              },
              {
                    "type": "longSum",
                    "name": "count",
                    "fieldName" : "flag"
              }
    ],
    "granularitySpec" : {
      "type" : "uniform",
      "segmentGranularity" : "HOUR",
      "queryGranularity" : "NONE",
      "intervals" : [ "2017-3-13/2017-3-14" ]
    }
  },
  "ioConfig" : {
    "type" : "hadoop",
    "inputSpec" : {
      "type" : "static",
      "paths" : "/data//000000_0"
    },
    "metadataUpdateSpec" : {
                "type":"mysql",
                "connectURI":"jdbc:mysql://ip:3306/druid",
                "password" : "password",
                "user" : "user",
                "segmentTable" : "druid_segments"
    },
    "segmentOutputPath" : "hdfs://ns1/user/druid/localStorage"
  },
  "tuningConfig" : {
    "type" : "hadoop",
    "workingPath": "hdfs://ns1/user/druid/localStorage/workdir",
    "partitionsSpec" : {
      "type" : "hashed",
      "numShards" : 3
    },
    "shardSpecs" : { },
    "leaveIntermediate" : false,
    "cleanupOnFailure" : true,
    "overwriteFiles" : false,
    "ignoreInvalidRows" : false,
    "jobProperties" : { },
    "combineText" : false,
    "persistInHeap" : false,
    "ingestOffheap" : false,
    "bufferSize" : 134217728,
    "aggregationBufferRatio" : 0.5,
    "rowFlushBoundary" : 300000,
    "useCombiner" : true,
    "buildV9Directly" : true
  }
}

In the spec file ,you can assign reduce number in numShareds parameters.

Second, the example spec file which directly write to druid using tranquility


{
  "dataSources": [
    {
      "spec": {
        "dataSchema": {
          "dataSource": "main_static_log_tranq1",
          "parser": {
            "type": "string",
            "parseSpec": {
              "format": "json",
              "timestampSpec": {
                "column": "timestamp",
                "format": "posix"
              },
              "dimensionsSpec": {
                "dimensions": ["typeSignII",  "typeSignI", "typeSignIII", "typeSignIV",  "responseCode",  "processTotalTime", "serverIp", "terminal", "type", "service"],
                "dimensionExclusions": [],
                "spatialDimensions": []
              }
            }
          },
          "metricsSpec": [
            {
              "type": "count",
              "name": "count"
            },{
              "type": "doubleSum",
              "name": "mProcessTotalTime",
              "fieldName" : "mProcessTotalTime"
            }
          ],
          "granularitySpec": {
            "type": "uniform",
            "segmentGranularity": "SIX_HOUR",
            "queryGranularity": "MINUTE"
          }
        },
        "tuningConfig": {
          "type": "realtime",
          "maxRowsInMemory": 100000,
          "intermediatePersistPeriod": "PT10m",
          "windowPeriod": "PT60m"
        }
      },
      "properties" : {
            "task.partitions" : "1",
            "task.replicants" : "2"
      }
    }
  ],
  "properties": {
    "zookeeper.connect": "10.39.2.161:2181",
    "druid.selectors.indexing.serviceName": "overlord",
    "druid.discovery.curator.path": "/druid/discovery",
    "druidBeam.overlordPollPeriod": "PT20S"
  }
}

You can assign partitions and replications using task.partitions and task.replicants parameters.

How to Submit a Merge Job to Index Service in Druid

Normally, we use index service instead of realtime nodes in Druid to ingest realtime data. If you have multiple partitions in one time, and each of them is small, you have to merge them together to form a big segments to boost query effeciency.
For example we have two segments in the same time interval, just as below

two segments on the same time interval
What we care is to merge them together. Here is how to write the merge.json file and submit:


{
    "type": "merge",
    "dataSource": "main_static_log_tt",
    "aggregations": [
                {
                    "type": "count",
                    "name": "count"
                },{
                    "type": "doubleSum",
                    "name": "mProcessTotalTime",
                    "fieldName" : "mProcessTotalTime"
                }
    ],
    "rollup": "false",
    "segments": [
{"dataSource":"main_static_log_tt","interval":"2017-03-27T10:05:00.000Z/2017-03-27T10:06:00.000Z","version":"2017-03-27T10:05:00.000Z","loadSpec":{"type":"local","path":"/data0/test/file/main_static_log_tt/2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z/2017-03-27T10:05:00.000Z/0/index.zip"},"dimensions":"processTotalTime,responseCode,serverIp,typeSignI,typeSignII,typeSignIII,typeSignIV","metrics":"count,mProcessTotalTime","shardSpec":{"type":"none"},"binaryVersion":9,"size":129991,"identifier":"main_static_log_tt_2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z_2017-03-27T10:05:00.000Z"},
{"dataSource":"main_static_log_tt","interval":"2017-03-27T10:05:00.000Z/2017-03-27T10:06:00.000Z","version":"2017-03-27T10:05:00.000Z","loadSpec":{"type":"local","path":"/data0/test/file/main_static_log_tt/2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z/2017-03-27T10:05:00.000Z/1/index.zip"},"dimensions":"processTotalTime,responseCode,serverIp,typeSignI,typeSignII,typeSignIII,typeSignIV","metrics":"count,mProcessTotalTime","shardSpec":{"type":"none"},"binaryVersion":9,"size":190243,"identifier":"main_static_log_tt_2017-03-27T10:05:00.000Z_2017-03-27T10:06:00.000Z_2017-03-27T10:05:00.000Z_1"}
    ]
}

Remember to change the shardSpec type to none, because the merge function only merge that type, it ignore hash or linear type. But we can avoid it, we just change the type to none, it has some problem, later post i will talk about how to change the code and make it work.
After edit the json file, you can submit to your overlord node as below:

curl http://host:port/druid/indexer/v1/task -H "Content-Type:application/json" -X POST --data @merge.json

Kill the Job

Some time, you just want to take a test and later on you can kill the task to free the slot in index service. Here is how to write kill.json file and submit it:


{
    "type": "kill",
    "id": "sbsina",
    "dataSource": "main_static_log_tt",
    "interval": "2017-03-22T07:47:00.000Z/2017-03-28T07:48:00.000Z"
}

Submit it:

curl http://host:port/druid/indexer/v1/task -H "Content-Type:application/json" -X POST --data @kill.json

Disable Middle Manager to Update

submit a post to middle manager http port:

curl -X POST http://ip:port/druid/worker/v1/disable

谈谈HftpFileSystem

最近在排查一个distcp反复失败的问题,我们知道distcp是对两个集群进行数据拷贝的工具,对于远程集群或者版本不一致的集群,我们使用hftp拉取数据,使用的方法如下:

hadoop distcp hftp://sourceFS:50070/src hdfs://destFS:50070/dest

Hftp的Schema是hftp开头的,实现的类为HftpFileSystem。
那Hftp的读取流程是如何呢,比如读取一个文件,流程如下:
流程说明
仔细说明一下hftp的流程:
1、首先Hftp会拼出NameNode的http URL,所有的Hftp交互都是通过http端口进行的,NameNode默认的http端口是50070,代码为

f = f.makeQualified(getUri(), getWorkingDirectory());
String path = "/data" + ServletUtil.encodePath(f.toUri().getPath());
String query = addDelegationTokenParam("ugi=" + getEncodedUgiParameter());
URL u = getNamenodeURL(path, query);
return new FSDataInputStream(new RangeHeaderInputStream(connectionFactory, u));

这个时候还没有与NameNode进行连接。当openInputStream的时候才会进行实际连接,代码为

protected InputStream openInputStream() throws IOException {
// Use the original url if no resolved url exists, eg. if
// it's the first time a request is made.
final boolean resolved = resolvedURL.getURL() != null;
final URLOpener opener = resolved? resolvedURL: originalURL;

final HttpURLConnection connection = opener.connect(startPos, resolved);

2、NameNode通过NamenodeWebHdfsMethods类来处理所有的HTTP请求,对于hftp发送的Open请求,NameNode的处理逻辑是根据位置选择一个最适合的DataNode,然后重定向把URL设置成该DataNode的http端口返回给Hftp client端。

final URI uri = redirectURI(namenode, ugi, delegation, username, doAsUser,
fullpath, op.getValue(), offset.getValue(), -1L, offset, length, bufferSize);
return Response.temporaryRedirect(uri).type(MediaType.APPLICATION_OCTET_STREAM).build();

至此,完成了Client与NameNode的交互,Client然后与DataNode进行交互。
3、Client根据重定向的URL与需要读取数据的DataNode建立连接,注意这一步走的还是HTTP端口,DataNode默认的HTTP端口是50075。

resolvedURL.setURL(getResolvedUrl(connection));

InputStream in = connection.getInputStream();

4、在DataNode内部通过DatanodeWebHdfsMethods类来处理HTTP请求,对于OPEN的请求,处理方式为

case OPEN:
{
final int b = bufferSize.getValue(conf);
final DFSClient dfsclient = newDfsClient(nnId, conf);
HdfsDataInputStream in = null;
try {
in = new HdfsDataInputStream(dfsclient.open(fullpath, b, true));
in.seek(offset.getValue());
} catch(IOException ioe) {
IOUtils.cleanup(LOG, in);
IOUtils.cleanup(LOG, dfsclient);
throw ioe;
}

final long n = length.getValue() != null ?
Math.min(length.getValue(), in.getVisibleLength() - offset.getValue()) :
in.getVisibleLength() - offset.getValue();

/**
* Allow the Web UI to perform an AJAX request to get the data.
*/
return Response.ok(new OpenEntity(in, n, dfsclient))
.type(MediaType.APPLICATION_OCTET_STREAM)
.header("Access-Control-Allow-Methods", "GET")
.header("Access-Control-Allow-Origin", "*")
.build();
}

内部如果打开了本地读那么就生成了一个本地读(BlockReaderLocal)的DFSClient来读取数据,如果没有开启本地读,那么就生成一个Socket连接来读取数据。
5,6、返回给client inputstream用于数据读取。至此一个Hftp读取的流程结束。

我们遇到的问题

当执行distcp的时候有一端为hftp,发现任务都fail read一个文件,该文件所在的机器状态不太正常,查看网络连接,50075端口有很多CLOSE_WAIT,并且datanode进程写入、读取都非常慢,判断当时情况网络连接不正常,因为NameNode的HTTP接口只返回了一个DataNode地址,并不是像RPC接口一样,返回DataNode列表,如果一个出现问题,还可以读取其他的DataNode。在这个时候由于这台出问题的DataNode导致了反复无法读取数据,当时解决的方法是重启该进程,使得读取到其他的DataNode上。
至于深层原因,根当时的DN状态有关,由于没有保存完整堆栈信息,只能在下次出错的时候继续排查,但是基本思路是DN的socket server响应慢,由于内部逻辑导致的。

Druid Ingest Format问题

 

Druid使用过程中需要对历史数据进行索引,由于历史数据都是hive表形式,分隔符为\001,所以需要druid对ingest的format 任意delimiter进行支持,以下是支持的形式:

"parser" : {
"type" : "hadoopyString",
"parseSpec" : {
"format" : "tsv",
"timestampSpec" : {
"column" : "dt",
"format" : "posix"
},
"dimensionsSpec" : {
"dimensions": ["grade","src_flag","school_name","gender_desc","prov_name","city_name","school_prov"]
},
"delimiter":"\u0001",
"listDelimiter":"\u0002",
"columns": ["dt","uid","grade","src_flag","school_name","gender_desc","prov_name","city_name","school_prov","flag"]
}
},

指定解析格式是tsv,delimiter为\u0001,listDelimiter是multi时候使用,目前没有使用,定义为\u0002,启动任务即可。