elasticsearch indices repair after hardware failure

22 December 2017 — Troubleshooting System administration

What do you do when your elasticsearch index is damaged?

This could have happened if you had some hardware issues and you had no replicas. You should have some replicas.

If you don’t, there is still something can be done.

At first, take a look at cluster health:

# curl -s "localhost:9200/_cluster/health?level=shards&pretty"         
{
  "cluster_name" : "search-index",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 2,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 2,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0,
  "indices" : {
    "search" : {
      "status" : "red",
      "number_of_shards" : 4,
      "number_of_replicas" : 0,
      "active_primary_shards" : 2,
      "active_shards" : 2,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 2,
      "shards" : {
	"0" : {
	  "status" : "green",
	  "primary_active" : true,
	  "active_shards" : 1,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 0
	},
	"1" : {
	  "status" : "green",
	  "primary_active" : true,
	  "active_shards" : 1,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 0
	},
	"2" : {
	  "status" : "red",
	  "primary_active" : false,
	  "active_shards" : 0,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 1
	},
	"3" : {
	  "status" : "red",
	  "primary_active" : false,
	  "active_shards" : 0,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 1
	}
      }
    }
  }
}

Yay! Status is red and two shards (on the faulty server) are damaged. But that does not tell the whole story. You have to make sure you don’t have any other problem first:

Look at the logs (in /var/log/elasticsearch/). If something seriosly horrible happened, you will see tons of errors.
Run the following query:

curl -XGET localhost:9200/_cat/shards

search 3 p UNASSIGNED
search 2 p UNASSIGNED
search 1 p STARTED 24436479 25.5gb 192.168.1.174 node-2 search 0 p STARTED 24481100 25.6gb 192.168.1.174 node-2 1 p S

In case your data is really damaged (like in my situation), this won’t help you even a bit. In this case, continue reading.

Run this query

# curl -XPOST 'localhost:9200/_cluster/reroute?explain=true&pretty'
{
  "acknowledged": true,
  "state": {
		...
    },
    "routing_table": {
      "indices": {
	"search": {
	  "shards": {
		...
	    "3": [
	      {
		"state": "UNASSIGNED",
		"primary": true,
		"node": null,
		"relocating_node": null,
		"shard": 3,
		"index": "search",
		"recovery_source": {
		  "type": "EXISTING_STORE"
		},
		"unassigned_info": {
		  "reason": "CLUSTER_RECOVERED",
		  "at": "2017-12-22T22:54:14.684Z",
		  "delayed": false,
		  "allocation_status": "no_valid_shard_copy"
		}
	      }
	    ],
	    "2": [
	      {
		"state": "UNASSIGNED",
		"primary": true,
		"node": null,
		"relocating_node": null,
		"shard": 2,
		"index": "search",
		"recovery_source": {
		  "type": "EXISTING_STORE"
		},
		"unassigned_info": {
		  "reason": "CLUSTER_RECOVERED",
		  "at": "2017-12-22T22:54:14.684Z",
		  "delayed": false,
		  "allocation_status": "no_valid_shard_copy"
		}
	      }
	    ]
	  }
	}
      }
    },
    "routing_nodes": {
      "unassigned": [
	{
	  "state": "UNASSIGNED",
	  "primary": true,
	  "node": null,
	  "relocating_node": null,
	  "shard": 3,
	  "index": "search",
	  "recovery_source": {
	    "type": "EXISTING_STORE"
	  },
	  "unassigned_info": {
	    "reason": "CLUSTER_RECOVERED",
	    "at": "2017-12-22T22:54:14.684Z",
	    "delayed": false,
	    "allocation_status": "no_valid_shard_copy"
	  }
	},
	{
	  "state": "UNASSIGNED",
	  "primary": true,
	  "node": null,
	  "relocating_node": null,
	  "shard": 2,
	  "index": "search",
	  "recovery_source": {
	    "type": "EXISTING_STORE"
	  },
	  "unassigned_info": {
	    "reason": "CLUSTER_RECOVERED",
	    "at": "2017-12-22T22:54:14.684Z",
	    "delayed": false,
	    "allocation_status": "no_valid_shard_copy"
	  }
	}
      ],
      "nodes": {
	...
      }
    }
  },
  "explanations": []
}

You can see that there’s no shard backup and the only explanation for this is your stupidity. It might be still possible to remove corrupted files and recover the most of the data successfully. Do the following:

Stop elasticsearch
Run CheckIndex tool (your path will slightly differ, this is only for shard #2):

$ cd /usr/share/elasticsearch/lib/

this will run for a long time, will find and move corrupted parts

$ java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/nodes/0/indices/Lf1TzlFTSDaYE_gQnrsTJQ/2/index/ -exorcise

delete corrupted parts

$ find -name ‘corrupted’ -delete

reset files ownership if you work from the superuser

$ chown elasticsearch:elasticsearch -R /var/lib/elasticsearch/nodes/0
If previous steps (logs or shard state) indicated corrupted translog, you can wipe it too:

$ /usr/share/elasticsearch/bin/elasticsearch-translog truncate -d /var/lib/elasticsearch/nodes/0/indices/Lf1TzlFTSDaYE_gQnrsTJQ/2/translog
Restart all nodes, they seem to remember about shard corruption and don’t let them to start
Start the failed node
This should work now. Run the commands from the diagnosis section to be sure everything is OK.

Now go and create some replicas.

curl -XGET localhost:9200/_cat/shards

this will run for a long time, will find and move corrupted parts

delete corrupted parts

reset files ownership if you work from the superuser