TensorFlow so file conclict with PyArrow new version

These days, i try to use tensorflow to train on ecs machine. It was okay at the first days, every thing is fine, suddenly, i found i cannot use TensorFlow anymore, even very simple command, like ‘import tensorflow’, can not be executed.
Python crash with Segmentation fault, so i just enable faulthandler.
The error is as below:

Fatal Python error: Segmentation fault

Current thread 0x00007f5f2acd5740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 922 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 684 in _load
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/imp.py", line 343 in load_dynamic
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/imp.py", line 243 in load_module
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28 in swig_import_helper
  File "/usr/lib/conda/envs/python3.6/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 32 in <module>

The code is simple:

_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)

just import a so file.

It confused me for two days, i have to check what i have done before tensorflow cannot be used. Finally , i remember i just upgraded pyarrow from 0.10.0 to 0.14.0.
So i just downgraded from 0.14.0 to 0.10.0, everything turned to be good.
It is very hard for people to know what is the main reason python throw Segmentation Error. If i don’t remember i upgrade the pyarrow, i think i would not thought it is confict between so file.

TensorFlow Session restore checkpoint

最近写了一些TensorFlow的小程序,遇到了一个session无法restore checkpoint的问题,写法非常简单,使用的是

with tf.train.MonitoredTrainingSession(master = server.target,
                                           is_chief = task_index == 0,
                                           checkpoint_dir= checkpoint_dir,
                                           save_checkpoint_secs=20) as sess:

按理说,tf.train.MonitoredTrainingSession能够save和restore checkpoint,经过测试发现,save是没问题的,但是每次训练都是新的模型,没有持续训练。
经过多次查找,才发现正确的写法。
根据TF官网,save 和restore只需要使用tf.train.Saver()就可以解决问题。但是我根据官网去实现,却报了很多错误,最后发现,通过以下方法才可以,首先要获取latest checkpoint文件,就是最近的checkpoint文件,包括模型参数。
然后将模型restore出来,但是别忘了之前要reset这次训练的graph。
代码如下:

checkpoint_dir = "hdfs://emr-header-1:9000/movie"
saver = tf.train.Saver()
epoch = 0

with tf.train.MonitoredTrainingSession(master = server.target,
                                           is_chief = task_index == 0,
                                           checkpoint_dir= checkpoint_dir,
                                           save_checkpoint_secs=20) as sess:
     tf.reset_default_graph()
     sess.run(init)
     latest_path = tf.train.latest_checkpoint(checkpoint_dir=checkpoint_dir)
     saver.restore(sess, latest_path)

pyspark with jupyter

首先配置jupyter config文件。

jupyter-notebook --generate-config

修改jupyter config文件

c.NotebookApp.port = 18888
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.allow_root = True

当然要配置好spark,emr环境spark已经完全配置正确。配置pyspark参数

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

启动pyspark即可。

pyspark --master yarn