Deleting Documents from Elasticsearch

Elasticsearch provides REST API methods for deleting documents or an entire index. There are two main ways to delete documents. One by using a query to find documents and other is deleting document by its ID.

How ES documents deletion work:

Lucene simply marks a bit in a per-segment bitset to record that the document is deleted. All subsequent searches simply skip any deleted documents. It is not until segments are merged that the bytes consumed by deleted documents are reclaimed.

Option 1: Elasticsearch delete by search API:

ES document deletion with the query, here we run term query to extract products.



POST /indexId/_delete_by_query
{
"query": {
"should": [
{
"term": {
"product_id": "ID#123"
}
},
{
"term": {
"product_id": "ID#124"
}
}
]
}
}

Java implementation:

DeleteByQueryRequest  deleteByQueryRequest = new DeleteByQueryRequest(indexName);

BoolQueryBuilder boolQueryBuilder = new BoolQueryBuilder();

boolQueryBuilder.should(new TermQueryBuilder("id", "value1"));
boolQueryBuilder.should(new TermQueryBuilder("id", "value2"));

deleteByQueryRequest.setQuery(boolQueryBuilder);

elasticsearchClientManager.getElasticsearchClient(clusterMetadata)
.deleteByQuery(deleteByQueryRequest,  RequestOptions.DEFAULT);

Behavior and limitations of DeleteByQuery:

  1. If a document changes between the time that the snapshot is taken and the delete operation is processed, it results in a version conflict and the delete operation fails.
  2. If a search or bulk request is rejected, the requests are retried up to 10 times, with exponential backoff. If the maximum retry limit is reached, processing halts, and all failed requests are returned in the response. Any delete requests that are completed successfully still stick, they are not rolled back.
  3. To control the rate at which delete by query issues batches of delete operations, you can set requests_per_second to any positive decimal number. This pads each batch with a wait time to throttle the rate. Set requests_per_second to -1 to disable throttling.

Option 2: Bulk delete API

ES delete API:
ES allows to run requests in bulk, this reduces overhead.


curl -X DELETE "localhost:9200/my-index-000001/_doc/1?timeout=5m&pretty"

The primary shard assigned to perform the delete operation might not be available when the delete operation is executed. Some reasons for this might be that the primary shard is currently recovering from a store or undergoing relocation.

By default, the delete operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error. The timeout parameter can be used to explicitly specify how long it waits.

Example:

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "delete" : { "_index" : "test", "_id" : "1" } }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "delete" : { "_index" : "test", "_id" : "3" } }
'

Java implementation:


BulkRequest request = new BulkRequest();

#indexID and #productID
request.add(new DeleteRequest("indexID_1", "1")); 

request.add(new DeleteRequest("indexID_1", "2")); 

request.add(new DeleteRequest("indexID_1", "3")); 

Metrics to monitor

  • Total time taken for the request to complete on average and p95
  • Number of failures in deletion due to errors in ES for example primary shard being unavailable, etc
  • Elasticsearch JVM memory and CPU in case there are bulk of delete requests

About the author

Muaaz

View all posts