ES 노드 서버 이상 전원 끄기 재부팅으로 인해 shard가 부팅되지 않는 문제 해결

8948 단어 Elasticsearch
오늘elasticsearch 두 노드 서버가 이상하게 전원이 꺼져 다시 켜지고translog에 손상된 이상이 발생하면 복구하는 과정을 기록합니다.

1. 문제


단일 컴퓨터 데이터량은 2억 +, 하나의 index, 20+ 개의 필드가 있으며, bulk를 사용하여 끊임없이 데이터를 쓴다. bulk.size=5000, 이때 기계가 의외로 단전되어 다운됩니다.
기계가 복구된 후 ES를 재부팅하면 translogCorrruptedException 예외가 발생합니다.
[2018-04-18 16:29:25,950][WARN ][indices.cluster          ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [ocslog-2018.04.18][0] failed to recover shard
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
        ... 4 more
Caused by: java.io.EOFException
        at org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)
        at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)
        at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)
        at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)
        at org.elasticsearch.index.translog.Translog$Index.readFrom(Translog.java:495)
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
        ... 5 more
[2018-04-18 16:29:25,959][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] sending failed shard for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[ocslog-2018.04.18][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2018-04-18 16:29:25,959][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[ocslog-2018.04.18][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]
[2018-04-18 16:29:26,304][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [master [elasticsearch_43_ssd][hROI0lMDQvqFNa2pcdlrGg][ocsbak][inet[/192.168.0.43:9300]]{tag=hot, max_local_storage_nodes=1, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]
[2018-04-18 16:29:26,624][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [master [elasticsearch_43_ssd][hROI0lMDQvqFNa2pcdlrGg][ocsbak][inet[/192.168.0.43:9300]]{tag=hot, max_local_storage_nodes=1, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]
[2018-04-18 16:29:27,174][WARN ][cluster.action.shard     ] [elasticsearch_43_ssd] [ocslog-2018.04.18][0] received shard failed for [ocslog-2018.04.18][0], node[hROI0lMDQvqFNa2pcdlrGg], [P], s[INITIALIZING], indexUUID [opNRuZb0QS-UseDQlAadfA], reason [master [elasticsearch_43_ssd][hROI0lMDQvqFNa2pcdlrGg][ocsbak][inet[/192.168.0.43:9300]]{tag=hot, max_local_storage_nodes=1, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]

알림:ocslog-2018.04.18, 메인 슬라이드 실패, 다음과 같은 오류 메시지가 있습니다. Translog Corrrupted Exception [translog corrruption while reading from stream] 이상 탈전으로 인해 Translog 로그가 이상한 것 같습니다.

2. 해결

  • 먼저 집단을 닫아라
  • ocslog-2018.04.18 인덱스에 대응하는 Translog 복구 로그를 지우고translog 파일이 있는 디렉터리를 찾습니다:/home/local/elasticsearch/data/elasticsearch_ocs/nodes/0/indices/ocslog-2018.04.18/0/translog 아래에translog*가 있습니다.recorving 파일을 백업하고 삭제합니다
  • 집단을 재개한 후 문제가 회복됩니다

  • 3. 요약


    ES의translog에는 ES에 대한 모든 변경 사항이 포함되어 있으며 데이터 백업과 복구의 중요한 구성 요소입니다.만약translog를 쓸 때 다운타임이 발생하면translog 쓰기 프로세스가 정상적으로 끝나지 않고 translog 파일의 끝에 정확한 종료 기호가 없기 때문에 eof Exception이 발생합니다.자세한 참조:https://blog.csdn.net/jiao_fuyou/article/details/79997292

    좋은 웹페이지 즐겨찾기