Der's blag

Der's blag

elasticsearch indices repair after hardware failure

What do you do when your elasticsearch index is damaged?

This could have happened if you had some hardware issues and you had no replicas. You should have some replicas.

If you don’t, there is still something can be done.

At first, take a look at cluster health:

# curl -s "localhost:9200/_cluster/health?level=shards&pretty"         
{
  "cluster_name" : "search-index",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 2,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 2,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0,
  "indices" : {
    "search" : {
      "status" : "red",
      "number_of_shards" : 4,
      "number_of_replicas" : 0,
      "active_primary_shards" : 2,
      "active_shards" : 2,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 2,
      "shards" : {
	"0" : {
	  "status" : "green",
	  "primary_active" : true,
	  "active_shards" : 1,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 0
	},
	"1" : {
	  "status" : "green",
	  "primary_active" : true,
	  "active_shards" : 1,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 0
	},
	"2" : {
	  "status" : "red",
	  "primary_active" : false,
	  "active_shards" : 0,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 1
	},
	"3" : {
	  "status" : "red",
	  "primary_active" : false,
	  "active_shards" : 0,
	  "relocating_shards" : 0,
	  "initializing_shards" : 0,
	  "unassigned_shards" : 1
	}
      }
    }
  }
}

Yay! Status is red and two shards (on the faulty server) are damaged. But that does not tell the whole story. You have to make sure you don’t have any other problem first:

  1. Look at the logs (in /var/log/elasticsearch/). If something seriosly horrible happened, you will see tons of errors.

  2. Run the following query:

    curl -XGET localhost:9200/_cat/shards

    search 3 p UNASSIGNED
    search 2 p UNASSIGNED
    search 1 p STARTED 24436479 25.5gb 192.168.1.174 node-2 search 0 p STARTED 24481100 25.6gb 192.168.1.174 node-2 1 p S

In case your data is really damaged (like in my situation), this won’t help you even a bit. In this case, continue reading.

Run this query

# curl -XPOST 'localhost:9200/_cluster/reroute?explain=true&pretty'
{
  "acknowledged": true,
  "state": {
		...
    },
    "routing_table": {
      "indices": {
	"search": {
	  "shards": {
		...
	    "3": [
	      {
		"state": "UNASSIGNED",
		"primary": true,
		"node": null,
		"relocating_node": null,
		"shard": 3,
		"index": "search",
		"recovery_source": {
		  "type": "EXISTING_STORE"
		},
		"unassigned_info": {
		  "reason": "CLUSTER_RECOVERED",
		  "at": "2017-12-22T22:54:14.684Z",
		  "delayed": false,
		  "allocation_status": "no_valid_shard_copy"
		}
	      }
	    ],
	    "2": [
	      {
		"state": "UNASSIGNED",
		"primary": true,
		"node": null,
		"relocating_node": null,
		"shard": 2,
		"index": "search",
		"recovery_source": {
		  "type": "EXISTING_STORE"
		},
		"unassigned_info": {
		  "reason": "CLUSTER_RECOVERED",
		  "at": "2017-12-22T22:54:14.684Z",
		  "delayed": false,
		  "allocation_status": "no_valid_shard_copy"
		}
	      }
	    ]
	  }
	}
      }
    },
    "routing_nodes": {
      "unassigned": [
	{
	  "state": "UNASSIGNED",
	  "primary": true,
	  "node": null,
	  "relocating_node": null,
	  "shard": 3,
	  "index": "search",
	  "recovery_source": {
	    "type": "EXISTING_STORE"
	  },
	  "unassigned_info": {
	    "reason": "CLUSTER_RECOVERED",
	    "at": "2017-12-22T22:54:14.684Z",
	    "delayed": false,
	    "allocation_status": "no_valid_shard_copy"
	  }
	},
	{
	  "state": "UNASSIGNED",
	  "primary": true,
	  "node": null,
	  "relocating_node": null,
	  "shard": 2,
	  "index": "search",
	  "recovery_source": {
	    "type": "EXISTING_STORE"
	  },
	  "unassigned_info": {
	    "reason": "CLUSTER_RECOVERED",
	    "at": "2017-12-22T22:54:14.684Z",
	    "delayed": false,
	    "allocation_status": "no_valid_shard_copy"
	  }
	}
      ],
      "nodes": {
	...
      }
    }
  },
  "explanations": []
}

You can see that there’s no shard backup and the only explanation for this is your stupidity. It might be still possible to remove corrupted files and recover the most of the data successfully. Do the following:

  1. Stop elasticsearch

  2. Run CheckIndex tool (your path will slightly differ, this is only for shard #2):

    $ cd /usr/share/elasticsearch/lib/

    this will run for a long time, will find and move corrupted parts

    $ java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/nodes/0/indices/Lf1TzlFTSDaYE_gQnrsTJQ/2/index/ -exorcise

    delete corrupted parts

    $ find -name ‘corrupted’ -delete

    reset files ownership if you work from the superuser

    $ chown elasticsearch:elasticsearch -R /var/lib/elasticsearch/nodes/0

  3. If previous steps (logs or shard state) indicated corrupted translog, you can wipe it too:

    $ /usr/share/elasticsearch/bin/elasticsearch-translog truncate -d /var/lib/elasticsearch/nodes/0/indices/Lf1TzlFTSDaYE_gQnrsTJQ/2/translog

  4. Restart all nodes, they seem to remember about shard corruption and don’t let them to start

  5. Start the failed node

  6. This should work now. Run the commands from the diagnosis section to be sure everything is OK.

Now go and create some replicas.