Hey everyone!
We played around a bit last time with our radar data to build a model that we could train outside Elasticsearch, loading it through Eland and then applying it using an ingest pipeline.
But since our data is in the form of vectors, could we actually exploit Elasticsearch vector database functionality and perform a sort of K-Nearest Neighbors classifier?
Of course we can! Let’s dive right into this (as usual, the code shown in the article can be found attached as a Jupyter Notebook).
We start by creating the connection to Elasticsearch, using the Python Elasticsearch client.
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
import json
es = Elasticsearch(
hosts=["https://localhost:9200"],
basic_auth=(<username>, <password>)
)
In terms of authentication, use whatever’s most suitable for your case, keeping in mind that the client also supports certificate-based authentication.
As we did last time, we can use pandas to load the dataset and inspect it a bit.
radar_data = pd.read_csv("ionosphere.csv")
radar_data.head()
attribute1 | attribute2 | attribute3 | ….. | attribute34 | class |
0.99539 | -0.05889 | 0.85243 | … | -0.45300 | good |
1 | -0.18829 | 0.93035 | … | -0.02447 | bad |
1 | -0.03365 | 1 | … | -0.38238 | good |
1 | -0.45161 | 1 | … | 1 | bad |
Okay, as we can see from the structure of the dataset, we have the numerical attributes spread among the different columns. But to exploit the vector functionalities of Elasticsearch we need them all to be in a single column containing the actual vector of features.
So let’s reshape the dataset a bit to bring it into this form, while also extracting features (X) and labels (y) into different variables:
X = radar_data.iloc[:,:34].values.tolist()
y = radar_data["class"].to_list()
To be able to classify some new samples based on their neighbors, we need to divide the dataset in two different subsets:
In our case, we can for example decide to use 80% of the dataset for training purposes and the other 20% for testing:
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=0,
shuffle=True
)
Okay, time to move to the Elasticsearch side of the game. We’d now like to index our documents to then be able to apply a KNN search approach to the documents.
As usual, in order to be able to index the data in the format we’d like, we need to define the correct mappings for our index. In our case, the most important part is the type of the column that we’ll use for our attributes, since it needs to be of type dense vector.
We will thus upload the following index template, using the Elasticsearch Python client:
{
"mappings": {
"properties": {
"attributes": {
"type": "dense_vector",
"dims": 34,
"index": true,
"similarity": "l2_norm"
},
"class": {
"type": "keyword"
}
}
}
}
# We read the JSON template
template = None
with open("index_template.json", "r") as template_file:
template = json.load(template_file)
# and then we load it using the associated function
if (template is not None and
not es.indices.exists_index_template(name="radar-data")):
es.indices.put_index_template(
name="radar-data",
index_patterns="radar-data",
priority=500,
template=template
)
And now after that we can finally index our data. Please note that, since we’d like to index more than one document, we can use the bulk API, which expects us to provide a list of actions, one per document.
We can assemble the actual document in the _source field, since it’s composed of just two columns, and then use the bulk helper to perform the bulk request.
# Create the list of actions to be sent to Elasticsearch
actions = [
{
"_index": "radar-data",
"_source": {
"attributes": attributes,
"class": c
}
} for attributes, c in zip(X_train, y_train)
]
# Perform the bulk operations
bulk(es, actions=actions)
Okay, now it’s time to move to the classifier itself. First of all, what is a K-Nearest Neighbors classifier?
Well, the idea is quite simple: given a certain new element that we would like to classify, we can base our classification on the known elements nearest to it, taking for example the class expressed by the majority of its neighbors, since we have two classes.
In our case, since we are talking about vectors, an appropriate vector distance measure will be used, such as their cosine similarity.
Okay, so why is it actually called K-Nearest Neighbors? The K plays a very important role, since it determines how many classified neighbors we should take into consideration when looking at a certain new data point.
So this is the core idea, and the nice part is that we can actually do the whole job using Elasticsearch.
How? Using a KNN query of course! This type of query allows us to obtain the K nearest neighbors of a data point.
Note: to be more precise, by using this type of query we are doing an approximated KNN search. Why approximated? Because the query accepts an additional parameter, in addition to k, which is the number of candidate neighbors to be returned. To improve the performance of such a search, Elasticsearch will first return a certain number of candidate neighbors from each shard (based on the HNSW algorithm), then compute the similarity w.r.t. the provided vector only with those elements and not the full set of documents present in the shard. This boosts performance but can affect accuracy, since the result may not always contain the exact k nearest neighbors.
Furthermore, we can also compute the most-seen class among the neighbors using the terms aggregation, something that allows us to avoid doing particular processing of the response, apart from extracting the result.
y_pred = []
for test_instance in X_test:
search_results = es.search(
index="radar-data",
knn={
"field": "attributes",
"query_vector": test_instance,
"k": 5,
"num_candidates": 50
},
fields=["class"],
source=False,
aggregations={
"top_class": {
"terms": {
"field": "class",
"size": 1
}
}
}
)
aggs_result = search_results["aggregations"]
pred_class = aggs_result["top_class"]["buckets"][0]["key"]
y_pred.append(pred_class)
And now that we’ve accumulated the predicted classes, we can calculate the accuracy using the accuracy_score function:
accuracy_score(y_test, y_pred)
which in our case returns about 83%, not too bad compared with the 89% we obtained from the decision tree during last experiment.
In this article we saw how it’s possible to use Elasticsearch as a vector database to perform KNN searches, and how this can already be used out-of-the-box as a KNN classifier.
And since we didn’t explore too many different values for the neighbor parameters, as our goal was not to obtain the best possible classifier yet, feel free to play with them and explore further optimizations 😀
Are you passionate about performance metrics or other modern IT challenges? Do you have the experience to drive solutions like the one above? Our customers often present us with problems that need customized solutions. In fact, we’re currently hiring for roles just like this as well as other roles here at Würth Phoenix.