greenplum 集群启动失败

有一个 gp 集群,部分 seg 挂了,带伤跑了一阵子,执行过几次 gprecovery,又中断了 recovery 进程,最后发现 gprecovery 无法继续,总是会触发节点 down,然后继续作死,把之前备份的一份 datadir 的内容替换上去,再次重启 gp 集群,发现这下好了,彻底起不来,报错

20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-----------------------------------------------------
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-DBID:17  FAILED  host:'10.240.128.21' datadir:'/data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Mirror_seg4_17' with reason:'PG_CTL failed.'
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-DBID:23  FAILED  host:'10.240.128.21' datadir:'/data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Mirror_seg10_23' with reason:'PG_CTL failed.'
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-DBID:6  FAILED  host:'10.240.128.14' datadir:'/data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg4_6' with reason:'Failure in segment mirroring; check segment logfile'
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-DBID:12  FAILED  host:'10.240.128.14' datadir:'/data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg10_12' with reason:'Failure in segment mirroring; check segment logfile'
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-DBID:3  FAILED  host:'10.240.128.21' datadir:'/data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg1_3' with reason:'PG_CTL failed.'
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-DBID:9  FAILED  host:'10.240.128.21' datadir:'/data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg7_9' with reason:'PG_CTL failed.'
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-----------------------------------------------------


20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-----------------------------------------------------
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-   Successful segment starts                                                     = 14
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-Failed segment starts, from mirroring connection between primary and mirror   = 2    <<<<<<<<
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-Other failed segment starts                                                   = 4    <<<<<<<<
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-Skipped segment starts (segments are marked down in configuration)            = 2    <<<<<<<<
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-----------------------------------------------------
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-Successfully started 14 of 20 segment instances, skipped 2 other segments <<<<<<<<
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-----------------------------------------------------
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-Segment instance startup failures reported
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-Failed start 6 of 20 segment instances <<<<<<<<
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-Review /data/gprds/gpAdminLogs/gpstart_20180705.log
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-----------------------------------------------------
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-****************************************************************************
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-There are 2 segment(s) marked down in the database
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-To recover from this current state, review usage of the gprecoverseg
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-management utility which will recover failed segment instance databases.
20180705:21:19:47:021616 gpstart:10:gprds-[WARNING]:-****************************************************************************
20180705:21:19:47:021616 gpstart:10:gprds-[INFO]:-Commencing parallel segment instance shutdown, please wait...
.............

20180705:21:20:11:021616 gpstart:10:gprds-[ERROR]:-gpstart error: Do not have enough valid segments to start the array.

用 gpstart error: Do not have enough valid segments to start the array. 这个错误搜了一下,前几篇没什么收获,继续找,看到这里, http://blog.163.com/digoal@126… 发现大家同样出现 ‘PG_CTL failed.’ 的报错,所以他的解决方案是增加超时时间

去看回 gpstart 的说明文档,也是有类似的阐述

-t | --timeout <number_of_seconds>

  Specifies a timeout in seconds to wait for a segment instance to
  start up. If a segment instance was shutdown abnormally (due to
  power failure or killing its postgres database listener process,
  for example), it may take longer to start up due to the database
  recovery and validation process. If not specified, the default timeout
  is 60 seconds.

不过,他的日志中是明确出现了 database system was not properly shut down; automatic recovery in progress 的提示的

2011-03-24 01:51:49.539239 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","database system was interrupted at 2011-03-23 18:35:06 CST",,,,,,,0,,"xlog.c",5623,
2011-03-24 01:51:49.627039 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","checkpoint record is at 111/CFA98080",,,,,,,0,,"xlog.c",5700,
2011-03-24 01:51:49.627083 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","redo record is at 111/CFA98080; undo record is at 0/0; shutdown FALSE",,,,,,,0,,"xlog.c",5739,
2011-03-24 01:51:49.627099 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","next transaction ID: 0/21258188; next OID: 595602762",,,,,,,0,,"xlog.c",5743,
2011-03-24 01:51:49.627110 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","next MultiXactId: 1; next MultiXactOffset: 0",,,,,,,0,,"xlog.c",5746,
2011-03-24 01:51:49.627142 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,0,,"xlog.c",5829,
2011-03-24 01:51:49.627323 CST,,,p16689,th-507526368,,,,0,,,seg-1,,,,,"LOG","00000","redo starts at 111/CFA980D8",,,,,,,0,,"xlog.c",5893,
2011-03-24 01:51:49.729438 CST,"greenplum","postgres",p16690,th-507526368,"[local]",,2011-03-24 01:51:49 CST,0,,,seg-1,,,,,"FATAL","57P03","the database system is starting up",,,,,,,0,,"postmaster.c",1887,"Traceback 0: 0x98615e: /opt/greenplumdb/3.3.6.1/greenplum-db-3.3.6.1/bin/postgres errstart+0x3be
Traceback 1: 0x7c10e8: /opt/greenplumdb/3.3.6.1/greenplum-db-3.3.6.1/bin/postgres +0x7c10e8
Traceback 2: 0x7c2216: /opt/greenplumdb/3.3.6.1/greenplum-db-3.3.6.1/bin/postgres +0x7c2216
Traceback 3: 0x7c35e5: /opt/greenplumdb/3.3.6.1/greenplum-db-3.3.6.1/bin/postgres PostmasterMain+0x945
Traceback 4: 0x6e533b: /opt/greenplumdb/3.3.6.1/greenplum-db-3.3.6.1/bin/postgres main+0x48b
Traceback 5: 0x3f2501d994: /lib64/libc.so.6 __libc_start_main+0xf4
Traceback 6: 0x45c899: /opt/greenplumdb/3.3.6.1/greenplum-db-3.3.6.1/bin/postgres +0x45c899

反观我们的日志,提示的是

[gprds@10 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg4_6/pg_log]$ cat -n postgresql-21.log
2018-07-05 21:09:45.067408 CST,"gprds","postgres",p46768,th365434944,"[local]",,2018-07-05 21:09:45 CST,0,,,seg-1,,,,,"FATAL","57M01","the database system is in mirror or uninitialized mode",,,,,,,0,,"postmaster.c",2946,
2018-07-05 21:09:45.269644 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","setsockopt(TCP_KEEPCNT) failed: Invalid argument",,,,,,,0,,"pqcomm.c",1953,
2018-07-05 21:09:45.270808 CST,,,p46772,th365434944,"127.0.0.1","8902",2018-07-05 21:09:45 CST,0,,,seg-1,,,,,"LOG","00000","received transition request packet. processing the request",,,,,,,0,,"postmaster.c",2696,
2018-07-05 21:09:45.271015 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: shutting down filerep backends",,,,,,,0,,"primary_mirror_mode.c",1988,
2018-07-05 21:09:45.271044 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep backends to shutdown",,,,,,,0,,"primary_mirror_mode.c",1991,
2018-07-05 21:09:45.271068 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: shutting down filerep",,,,,,,0,,"primary_mirror_mode.c",1998,
2018-07-05 21:09:45.271101 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep to shutdown",,,,,,,0,,"primary_mirror_mode.c",2001,
2018-07-05 21:09:45.271131 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: completed filerep to shutdown",,,,,,,0,,"primary_mirror_mode.c",2004,
2018-07-05 21:09:45.271155 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: initializing XLog Startup",,,,,,,0,,"primary_mirror_mode.c",2011,
2018-07-05 21:09:45.271207 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: starting filerep",,,,,,,0,,"primary_mirror_mode.c",2049,
2018-07-05 21:09:45.271636 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:09:45.271911 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'role not initialized' mirroring state 'not initialized' segment state 'not initialized' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L2598' 'FileRep_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.271954 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'guc 'gp_segment_connect_timeout' value '600' ', mirroring role 'role not initialized' mirroring state 'not initialized' segment state 'not initialized' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L2724' 'FileRep_SetFileRepRetry'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.276103 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'mirror transition, primary address(port) '10.240.128.14(40011)' mirror address(port) '10.240.128.21(40011)' ', mirroring role 'primary role' mirroring state 'sync' segment state 'not initialized' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L3514' 'FileRep_Main'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.276155 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","mirror transition, primary address(port) '10.240.128.14(40011)' mirror address(port) '10.240.128.21(40011)'",,,,,"mirroring role 'primary role' mirroring state 'sync' segment state 'not initialized' process name(pid) 'filerep main process(46773)' filerep state 'not initialized' ",,0,,"cdbfilerep.c",3524,
2018-07-05 21:09:45.277155 CST,,,p46774,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'initialization and recovery' process name(pid) 'primary receiver ack process(46774)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277210 CST,,,p46774,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'start receiver ack', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary receiver ack process(46774)' 'cdbfilerepprimaryack.c' 'L123' 'FileRepAckPrimary_StartReceiver'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277255 CST,,,p46774,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'primary address(port) '10.240.128.14(40011)' mirror address(port) '10.240.128.21(40011)' ', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary receiver ack process(46774)' 'cdbfilerepprimaryack.c' 'L134' 'FileRepAckPrimary_StartReceiver'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277374 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'initialization and recovery' process name(pid) 'primary sender process(46775)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277441 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'start sender', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary sender process(46775)' 'cdbfilerepprimary.c' 'L1519' 'FileRepPrimary_StartSender'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277478 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'primary address(port) '10.240.128.14(40011)' mirror address(port) '10.240.128.21(40011)' ', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary sender process(46775)' 'cdbfilerepprimary.c' 'L1530' 'FileRepPrimary_StartSender'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277520 CST,,,p46776,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'initialization and recovery' process name(pid) 'primary consumer ack process(46776)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277554 CST,,,p46776,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'run consumer', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary consumer ack process(46776)' 'cdbfilerepprimaryack.c' 'L907' 'FileRepAckPrimary_StartConsumer'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.277713 CST,,,p46777,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'initialization and recovery' process name(pid) 'primary recovery process(46777)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.278881 CST,,,p46777,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'start recovery', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary recovery process(46777)' 'cdbfilerepprimaryrecovery.c' 'L41' 'FileRepPrimary_StartRecovery'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:09:45.278924 CST,,,p46777,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'run recovery', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary recovery process(46777)' 'cdbfilerepprimaryrecovery.c' 'L80' 'FileRepPrimary_StartRecoveryInSync'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:10:35.549605 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:11:25.828588 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:12:16.105980 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:13:06.385168 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:13:56.663586 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:14:46.941051 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:15:37.219287 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:16:27.498683 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:17:17.778104 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:18:08.056911 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:18:58.336861 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:19:45.023110 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'initialization and recovery' process name(pid) 'primary sender process(46775)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023196 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"WARNING","01000","PostmasterPrimaryMirrorTransition (2) Finished with Error",,,,,,,0,,"primary_mirror_mode.c",1708,
2018-07-05 21:19:45.023225 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'fault' process name(pid) 'primary sender process(46775)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023258 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'fault' process name(pid) 'primary sender process(46775)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023291 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary sender process(46775)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023325 CST,,,p46776,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'run consumer', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary consumer ack process(46776)' 'cdbfilerepprimaryack.c' 'L966' 'FileRepAckPrimary_RunConsumer'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023360 CST,,,p46777,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'run recovery of flat files', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary recovery process(46777)' 'cdbfilerepprimaryrecovery.c' 'L145' 'FileRepPrimary_RunRecoveryInSync'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023416 CST,,,p46775,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary sender process(46775)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023447 CST,,,p46776,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary consumer ack process(46776)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023477 CST,,,p46776,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary consumer ack process(46776)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.023512 CST,,,p46777,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","failure is detected in segment mirroring, failover requested",,,,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(46777)' filerep state 'initialization and recovery' ",,0,,"cdbfilerepprimary.c",265,
2018-07-05 21:19:45.028875 CST,,,p46772,th365434944,"127.0.0.1","8902",2018-07-05 21:09:45 CST,0,,,seg-1,,,,,"WARNING","01000","PrimaryMirrorTransitionRequest (2) Result: Transition to primary/mirror mode PrimarySegment, data state InSync resulted in Error",,,,,,,0,,"primary_mirror_mode.c",1324,
2018-07-05 21:19:45.042936 CST,,,p46774,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(46774)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.042991 CST,,,p46774,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(46774)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.308632 CST,,,p46754,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","received immediate shutdown request",,,,,,,0,,"postmaster.c",4112,
2018-07-05 21:19:58.309333 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary receiver ack process' process pid '46774' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.309411 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary consumer ack process' process pid '46776' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.309485 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary recovery process' process pid '46777' exit status '2' ', mirroring role 'primaryrole' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.309736 CST,,,p46773,th365434944,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary sender process' process pid '46775' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46773)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,

以及

[gprds@10 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg10_12/pg_log]$ cat -n postgresql-21.log
2018-07-05 21:10:35.566722 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:11:25.844929 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:12:16.122051 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:13:06.399411 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:13:56.675466 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:14:46.953732 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:15:37.230129 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:16:27.505817 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:17:17.782397 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:18:08.060997 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:18:58.339081 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:19:45.028348 CST,,,p46782,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'initialization and recovery' process name(pid) 'primary sender process(46782)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028424 CST,,,p46782,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'fault' process name(pid) 'primary sender process(46782)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028470 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"WARNING","01000","PostmasterPrimaryMirrorTransition (2) Finished with Error",,,,,,,0,,"primary_mirror_mode.c",1708,
2018-07-05 21:19:45.028495 CST,,,p46782,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery'filerep state 'fault' process name(pid) 'primary sender process(46782)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028528 CST,,,p46782,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary sender process(46782)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028564 CST,,,p46782,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary sender process(46782)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028596 CST,,,p46783,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'run consumer', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary consumer ack process(46783)' 'cdbfilerepprimaryack.c' 'L966' 'FileRepAckPrimary_RunConsumer'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028630 CST,,,p46784,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'run recovery of flat files', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary recovery process(46784)' 'cdbfilerepprimaryrecovery.c' 'L145' 'FileRepPrimary_RunRecoveryInSync'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028680 CST,,,p46783,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary consumer ack process(46783)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028711 CST,,,p46783,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary consumer ack process(46783)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.028746 CST,,,p46784,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","failure is detected in segment mirroring, failover requested",,,,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(46784)' filerep state 'initialization and recovery' ",,0,,"cdbfilerepprimary.c",265,
2018-07-05 21:19:45.037534 CST,,,p46779,th479004736,"127.0.0.1","41654",2018-07-05 21:09:45 CST,0,,,seg-1,,,,,"WARNING","01000","PrimaryMirrorTransitionRequest (2) Result: Transition to primary/mirror mode PrimarySegment, data state InSync resulted in Error",,,,,,,0,,"primary_mirror_mode.c",1324,
2018-07-05 21:19:45.060347 CST,,,p46781,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(46781)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:45.060419 CST,,,p46781,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(46781)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.308578 CST,,,p46753,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","received immediate shutdown request",,,,,,,0,,"postmaster.c",4112,
2018-07-05 21:19:58.309359 CST,,,p46780,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary receiver ack process' process pid '46781' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46780)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.309401 CST,,,p46780,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary consumer ack process' process pid '46783' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46780)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.309523 CST,,,p46780,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary recovery process' process pid '46784' exit status '2' ', mirroring role 'primaryrole' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46780)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 21:19:58.309843 CST,,,p46780,th479004736,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary sender process' process pid '46782' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(46780)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,

其中也有提到 recovery,不过并不是那么的明确

不过死马当作活马医,还是试试在主节点上执行 gpstart -B 1 -t 3600 -v 看看,其中 -B 1 让他串行,-t 3600 把超时置为一个钟,-v 让他打印详细日志

但是依然在启动到 QE_Primary_seg10_12 的时候失败,从报错看,还是类似的报错

2018-07-05 21:56:08.740151 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:56:59.017695 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:57:49.296098 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:58:39.574285 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 21:59:29.852117 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:00:20.129710 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:01:10.407472 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:02:00.687284 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:02:50.967451 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:03:41.246303 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:04:31.527052 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2054,
2018-07-05 22:05:18.008223 CST,,,p59333,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary sender process(59333)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008310 CST,,,p59333,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'fault' process name(pid) 'primary sender process(59333)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008353 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"WARNING","01000","PostmasterPrimaryMirrorTransition (2) Finished with Error",,,,,,,0,,"primary_mirror_mode.c",1708,
2018-07-05 22:05:18.008390 CST,,,p59333,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'fault' process name(pid) 'primary sender process(59333)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008432 CST,,,p59333,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary sender process(59333)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008469 CST,,,p59333,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary sender process(59333)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008506 CST,,,p59337,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'run recovery of flat files', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary recovery process(59337)' 'cdbfilerepprimaryrecovery.c' 'L145' 'FileRepPrimary_RunRecoveryInSync'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008568 CST,,,p59335,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'run consumer', mirroring role 'primary role' mirroring state 'sync' segment state 'initialization and recovery' filerep state 'initialization and recovery' process name(pid) 'primary consumer ack process(59335)' 'cdbfilerepprimaryack.c' 'L966' 'FileRepAckPrimary_RunConsumer'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008607 CST,,,p59335,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary consumer ack process(59335)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008642 CST,,,p59335,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary consumer ack process(59335)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.008682 CST,,,p59328,th-1729120192,"127.0.0.1","53902",2018-07-05 21:55:18 CST,0,,,seg-1,,,,,"WARNING","01000","PrimaryMirrorTransitionRequest (2) Result: Transition to primary/mirror mode PrimarySegment, data state InSync resulted in Error",,,,,,,0,,"primary_mirror_mode.c",1324,
2018-07-05 22:05:18.008717 CST,,,p59337,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","failure is detected in segment mirroring, failover requested",,,,,"mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' process name(pid) 'primary recovery process(59337)' filerep state 'initialization and recovery' ",,0,,"cdbfilerepprimary.c",265,
2018-07-05 22:05:18.028315 CST,,,p59331,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(59331)' 'cdbfilerep.c' 'L2457' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:18.028373 CST,,,p59331,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'set filerep state', mirroring role 'primary role' mirroring state 'sync' segment state 'in fault' filerep state 'fault' process name(pid) 'primary receiver ack process(59331)' 'cdbfilerepservice.c' 'L565' 'FileRepSubProcess_SetState'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:32.044313 CST,,,p59312,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","received immediate shutdown request",,,,,,,0,,"postmaster.c",4112,
2018-07-05 22:05:32.044937 CST,,,p59330,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary consumer ack process' process pid '59335' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(59330)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:32.044984 CST,,,p59330,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary recovery process' process pid '59337' exit status '2' ', mirroring role 'primary role'mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(59330)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:32.045051 CST,,,p59330,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary receiver ack process' process pid '59331' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(59330)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,
2018-07-05 22:05:32.045410 CST,,,p59330,th-1729120192,,,,0,,,seg-1,,,,,"LOG","00000","'process exit, process name 'primary sender process' process pid '59333' exit status '2' ', mirroring role 'primary role' mirroring state 'sync' segment state 'in immediate shutdown' filerep state 'not initialized' process name(pid) 'filerep main process(59330)' 'cdbfilerep.c' 'L2129' 'LogChildExit'",,,,,,,0,,"cdbfilerep.c",1839,

仔细读了一下日志,原因应该是从 filerep state ‘fault’ 这里开始导致的,并因此导致了 data state InSync resulted in Error 最后进程结束退出

拿着这个 filerep state ‘fault’ 搜一下呢?很欣喜的发现来到了无人区,全网只有一条结果 https://github.com/greenplum-d… 并且还没什么解决方案

于是接下来就开始进入自主探索阶段,首先想到去 QE_Primary_seg10_12 的 Mirror 节点看看,在 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Mirror_seg10_23/pg_log 我们从日志中发现这么一个提示

2018-07-05 13:54:14.508270 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""logging_collector"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.508881 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_destination"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.509455 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_directory"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.509982 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_destination"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.510511 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_directory"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.511090 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""logging_collector"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.511640 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_destination"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.512169 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_directory"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.512692 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_destination"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.513231 GMT,,,p40047,th63346752,,,,0,,,seg-1,,,,,"WARNING","01000","""log_directory"": setting is ignored because it is defunct",,,,,,,,"set_config_option","guc.c",4641,
2018-07-05 13:54:14.514707 GMT,,,p40050,th63346752,,,,0,,,seg-1,,,,,"FATAL","53100","could not write lock file ""postmaster.pid"": No space left on device",,,,,,,,"CreateLockFile","miscinit.c",1019,1 0x8c9846 postgres errstart + 0x1f6
2    0x8da406 postgres <symbol not found> + 0x8da406
3    0x768d3a postgres PostmasterMain + 0x88a
4    0x488ddb postgres main + 0x3bb
5    0x7f8a0281eb35 libc.so.6 __libc_start_main + 0xf5
6    0x488ef9 postgres <symbol not found> + 0x488ef9

说是磁盘没有空间了,顺手 df -h 看一眼

[gprds@10 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Mirror_seg10_23/pg_log]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G  3.5G   16G  19% /
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G   16K   32G   1% /dev/shm
tmpfs            32G  3.2G   29G  11% /run
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/sda3        20G  857M   18G   5% /usr/local
/dev/sda4       1.8T   14G  1.7T   1% /data
/dev/md127p1    9.1T  8.6T     0 100% /data1
tmpfs           6.3G     0  6.3G   0% /run/user/0
tmpfs           6.3G     0  6.3G   0% /run/user/1000

我的天,9.1T 的盘被吃的干干净净,这是什么情况

[gprds@10 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10]$ du -h --max-depth=1
3.0T	./QE_Primary_seg7_9
78G	./QE_Mirror_seg4_17
5.5T	./QE_Primary_seg1_3
76G	./QE_Mirror_seg10_23
8.6T	.

进去这两个一看发现 core 得尸横遍野

连删都删不掉

[gprds@10 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg1_3]$ rm core.61416
mkdir: cannot create directory '/data1/.Trash/gprds/2018070522': No space left on device

看起来是 alias 了 rm 命令,但是看了一下 alias 又没有

[gprds@10 /data1/gprds/RDSDIR/db_data/data/Greenplum/4.3.8.2/data/greenplum-da21lyj10/QE_Primary_seg1_3]$ alias
alias egrep='egrep --color=auto'
alias fgrep='fgrep --color=auto'
alias grep='grep --color=auto'
alias l.='ls -d .* --color=auto'
alias ll='ls -l --color=auto'
alias ls='ls --color=auto'
alias vi='vim'
alias which='alias | /usr/bin/which --tty-only --read-alias --show-dot --show-tilde'

不管,先 echo “” > core.61416 清掉一个文件的内容再说,剩下的 core 一个个删除,但是发现删除的东西都被移动到 Trash 里面,空间并没有释放出来

于是尝试用 unlink 来删除 core 文件,发现这个是个好东西,可以把文件删掉,但是比较慢,删除一个 10G 的 core 文件要好一会,试了试 find ./ -name “core.48980” -delete 的方式,也是差不多一样慢,不过好在都可以把文件删掉

腾出来空间之后再次通过 gpstart -B 1 -t 3600 -v 来尝试启动集群,这回总算把集群启动起来了

但是这样并没有解决集群的数据恢复问题,于是冒着作死的精神,又把之前那份备份的 Mirror 节点的 datadir 拷贝过来了

居然可以起来,于是重新开始 gprecoverseg -F 的漫长过程

Leave a Reply

Your email address will not be published. Required fields are marked *