Data Analytics

Improving Business Performance with Machine Learning | by Juan Jose Munoz | Jun, 2024


Because we are using an unsupervised learning algorithm, there is not a widely available measure of accuracy. However, we can use domain knowledge to validate our groups.

Visually inspecting the groups, we can see some benchmarking groups have a mix of Economy and Luxury hotels, which doesn’t make business sense as the demand for hotels is fundamentally different.

We can scroll to the data and note some of those differences, but can we come up with our own accuracy measure?

We want to create a function to measure the consistency of the recommended Benchmarking sets across each feature. One way of doing this is by calculating the variance in each feature for each set. For each cluster, we can compute an average of each feature variance, and we can then average each hotel cluster variance to get a total model score.

From our domain knowledge, we know that in order to set up a comparable benchmark set, we need to prioritize hotels in the same Brand, possibly the same market, and the same country, and if we use different markets or countries, then the market tier should be the same.

With that in mind, we want our measure to have a higher penalty for variance in those features. To do so, we will use a weighted average to calculate each benchmark set variance. We will also print the variance of the key features and secondary features separately.

To sum up, to create our accuracy measure, we need to:

  1. Calculate variance for categorical variables: One common approach is to use an “entropy-based” measure, where higher diversity in categories indicates higher entropy (variance).
  2. Calculate variance for numerical variables: we can compute the standard deviation or the range (difference between maximum and minimum values). This measures the spread of numerical data within each cluster.
  3. Normalize the data: normalize the variance scores for each category before applying weights to ensure that no single feature dominates the weighted average due to scale differences alone.
  4. Apply weights for different metrics: Weight each type of variance based on its importance to the clustering logic.
  5. Calculating weighted averages: Compute the weighted average of these variance scores for each cluster.
  6. Aggregating scores across clusters: The total score is the average of these weighted variance scores across all clusters or rows. A lower average score would indicate that our model effectively groups similar hotels together, minimizing intra-cluster variance.
from scipy.stats import entropy
from sklearn.preprocessing import MinMaxScaler
from collections import Counter

def categorical_variance(data):
"""
Calculate entropy for a categorical variable from a list.
A higher entropy value indicates datas with diverse classes.
A lower entropy value indicates a more homogeneous subset of data.
"""
# Count frequency of each unique value
value_counts = Counter(data)
total_count = sum(value_counts.values())
probabilities = [count / total_count for count in value_counts.values()]
return entropy(probabilities)

#set scoring weights giving higher weights to the most important features
scoring_weights = {"BRAND": 0.3,
"Room_count": 0.025,
"Market": 0.25,
"Country": 0.15,
"Market Tier": 0.15,
"HCLASS": 0.05,
"Demand": 0.025,
"Price range": 0.025,
"distance_to_airport": 0.025}

def calculate_weighted_variance(df, weights):
"""
Calculate the weighted variance score for clusters in the dataset
"""
# Initialize a DataFrame to store the variances
variance_df = pd.DataFrame()

# 1. Calculate variances for numerical features
numerical_features = ['Room_count', 'Demand', 'Price range', 'distance_to_airport']
for feature in numerical_features:
variance_df[f'{feature}'] = df[feature].apply(np.var)

# 2. Calculate entropy for categorical features
categorical_features = ['BRAND', 'Market','Country','Market Tier','HCLASS']
for feature in categorical_features:
variance_df[f'{feature}'] = df[feature].apply(categorical_variance)

# 3. Normalize the variance and entropy values
scaler = MinMaxScaler()
normalized_variances = pd.DataFrame(scaler.fit_transform(variance_df),
columns=variance_df.columns,
index=variance_df.index)

# 4. Compute weighted average

cat_weights = {f'{feature}': weights[f'{feature}'] for feature in categorical_features}
num_weights = {f'{feature}': weights[f'{feature}'] for feature in numerical_features}

cat_weighted_scores = normalized_variances[categorical_features].mul(cat_weights)
df['cat_weighted_variance_score'] = cat_weighted_scores.sum(axis=1)

num_weighted_scores = normalized_variances[numerical_features].mul(num_weights)
df['num_weighted_variance_score'] = num_weighted_scores.sum(axis=1)

return df['cat_weighted_variance_score'].mean(), df['num_weighted_variance_score'].mean()

To keep our code clean and track our experiments , let’s also define a function to store the results of our experiments.

# define a function to store the results of our experiments
def model_score(data: pd.DataFrame,
weights: dict = scoring_weights,
model_name: str ="model_0"):
cat_score,num_score = calculate_weighted_variance(data,weights)
results ={"Model": model_name,
"Primary features score": cat_score,
"Secondary features score": num_score}
return results

model_0_score= model_score(results_model_0,scoring_weights)
model_0_score

Baseline model results.

Now that we have a baseline, let’s see if we can improve our model.

Improving our Model Through Experimentation

Up until now, we did not have to know what was going on under the hood when we ran this code:

nns = NearestNeighbors()
nns.fit(data_scaled)
nns_results_model_0 = nns.kneighbors(data_scaled)[1]

To improve our model, we will need to understand the model parameters and how we can interact with them to get better benchmark sets.

Let’s start by looking at the Scikit Learn documentation and source code:

# the below is taken directly from scikit learn source

from sklearn.neighbors._base import KNeighborsMixin, NeighborsBase, RadiusNeighborsMixin

class NearestNeighbors_(KNeighborsMixin, RadiusNeighborsMixin, NeighborsBase):
"""Unsupervised learner for implementing neighbor searches.
Parameters
----------
n_neighbors : int, default=5
Number of neighbors to use by default for :meth:`kneighbors` queries.

radius : float, default=1.0
Range of parameter space to use by default for :meth:`radius_neighbors`
queries.

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
Algorithm used to compute the nearest neighbors:

- 'ball_tree' will use :class:`BallTree`
- 'kd_tree' will use :class:`KDTree`
- 'brute' will use a brute-force search.
- 'auto' will attempt to decide the most appropriate algorithm
based on the values passed to :meth:`fit` method.

Note: fitting on sparse input will override the setting of
this parameter, using brute force.

leaf_size : int, default=30
Leaf size passed to BallTree or KDTree. This can affect the
speed of the construction and query, as well as the memory
required to store the tree. The optimal value depends on the
nature of the problem.

metric : str or callable, default='minkowski'
Metric to use for distance computation. Default is "minkowski", which
results in the standard Euclidean distance when p = 2. See the
documentation of `scipy.spatial.distance
<https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and
the metrics listed in
:class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric
values.

p : float (positive), default=2
Parameter for the Minkowski metric from
sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric_params : dict, default=None
Additional keyword arguments for the metric function.
"""

def __init__(
self,
*,
n_neighbors=5,
radius=1.0,
algorithm="auto",
leaf_size=30,
metric="minkowski",
p=2,
metric_params=None,
n_jobs=None,
):
super().__init__(
n_neighbors=n_neighbors,
radius=radius,
algorithm=algorithm,
leaf_size=leaf_size,
metric=metric,
p=p,
metric_params=metric_params,
n_jobs=n_jobs,
)

There are quite a few things going on here.

The Nearestneighbor class inherits fromNeighborsBase, which is the case class for nearest neighbor estimators. This class handles the common functionalities required for nearest-neighbor searches, such as

  • n_neighbors (the number of neighbors to use)
  • radius (the radius for radius-based neighbor searches)
  • algorithm (the algorithm used to compute the nearest neighbors, such as ‘ball_tree’, ‘kd_tree’, or ‘brute’)
  • metric (the distance metric to use)
  • metric_params (additional keyword arguments for the metric function)

The Nearestneighbor class also inherits fromKNeighborsMixin and RadiusNeighborsMixinclasses. These Mixin classes add specific neighbor-search functionalities to the Nearestneighbor

  • KNeighborsMixin provides functionality to find the nearest fixed number k of neighbors to a point. It does that by finding the distance to the neighbors and their indices and constructing a graph of connections between points based on the k-nearest neighbors of each point.
  • RadiusNeighborsMixin is based on the radius neighbors algorithm, which finds all neighbors within a given radius of a point. This method is useful in scenarios where the focus is on capturing all points within a meaningful distance threshold rather than a fixed number of points.

Based on our scenario, KNeighborsMixin provides the functionality we need.

We need to understand one key parameter before we can improve our model; this is the distance metric.

The documentation mentions that the NearestNeighbor algorithm uses the “Minkowski” distance by default and gives us a reference to the SciPy API.

In scipy.spatial.distance, we can see two mathematical representations of “Minkowski” distance:

∥u−v∥ p​=( i ∑​∣u i​−v i​∣ p ) 1/p

This formula calculates the p-th root of the sum of powered differences across all elements.

The second mathematical representation of “Minkowski” distance is:

∥u−v∥ p​=( i ∑​w i​(∣u i​−v i​∣ p )) 1/p

This is very similar to the first one, but it introduces weights wi to the differences, emphasizing or de-emphasizing specific dimensions. This is useful where certain features are more relevant than others. By default, the setting is None, which gives all features the same weight of 1.0.

This is a great option for improving our model as it allows us to pass domain knowledge to our model and emphasize similarities that are most relevant to users.

If we look at the formulas, we see the parameter. p. This parameter affects the “path” the algorithm takes to calculate the distance. By default, p=2, which represents the Euclidian distance.

You can think of the Euclidian distance as calculating the distance by drawing a straight line between 2 points. This is usally the shortest distance, however, this is not always the most desirable way of calculating the distance, specially in higher dimention spaces. For more information on why this is the case, there is this great paper online: https://bib.dbvis.de/uploadedFiles/155.pdf

Another common value for p is 1. This represents the Manhattan distance. You think of it as the distance between two points measured along a grid-like path.

On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding elements of the vectors. It essentially measures the worst-case difference, making it useful in scenarios where you want to ensure that no single feature varies too much.

By reading and getting familiar with the documentation, we have uncovered a few possible options to improve our model.

By default n_neighbors is 5, however, for our benchmark set, we want to compare each hotel to the 3 most similar hotels. To do so, we need to set n_neighbors = 4 (Subject hotel + 3 peers)

nns_1= NearestNeighbors(n_neighbors=4)
nns_1.fit(data_scaled)
nns_1_results_model_1 = nns_1.kneighbors(data_scaled)[1]
results_model_1 = clean_results(nns_results=nns_1_results_model_1,
encoders=encoders,
data=data_clean)
model_1_score= model_score(results_model_1,scoring_weights,model_name="baseline_k_4")
model_1_score
Slight improvement in our primary features. Image by author

Based on the documentation, we can pass weights to the distance calculation to emphasize the relationship across some features. Based on our domain knowledge, we have identified the features we want to emphasize, in this case, Brand, Market, Country, and Market Tier.

# set up weights for distance calculation
weights_dict = {"BRAND": 5,
"Room_count": 2,
"Market": 4,
"Country": 3,
"Market Tier": 3,
"HCLASS": 1.5,
"Demand": 1,
"Price range": 1,
"distance_to_airport": 1}
# Transform the wieghts dictionnary into a list by keeping the scaled data column order
weights = [ weights_dict[idx] for idx in list(scaler.get_feature_names_out())]

nns_2= NearestNeighbors(n_neighbors=4,metric_params={ 'w': weights})
nns_2.fit(data_scaled)
nns_2_results_model_2 = nns_2.kneighbors(data_scaled)[1]
results_model_2 = clean_results(nns_results=nns_2_results_model_2,
encoders=encoders,
data=data_clean)
model_2_score= model_score(results_model_2,scoring_weights,model_name="baseline_with_weights")
model_2_score

Primary features score keeps improving. Image by author

Passing domain knowledge to the model via weights increased the score significantly. Next, let’s test the impact of the distance measure.

So far, we have been using the Euclidian distance. Let’s see what happens if we use the Manhattan distance instead.

nns_3= NearestNeighbors(n_neighbors=4,p=1,metric_params={ 'w': weights})
nns_3.fit(data_scaled)
nns_3_results_model_3 = nns_3.kneighbors(data_scaled)[1]
results_model_3 = clean_results(nns_results=nns_3_results_model_3,
encoders=encoders,
data=data_clean)
model_3_score= model_score(results_model_3,scoring_weights,model_name="Manhattan_with_weights")
model_3_score
Significant decrease in primary score. image by author

Decreasing p to 1 resulted in some good improvements. Let’s see what happens as p approximates infinity.

To use the Chebyshev distance, we will change the metric parameter to Chebyshev. The default sklearn Chebyshev metric doesn’t have a weight parameter. To get around this, we will define a custom weighted_chebyshev metric.

#  Define the custom weighted Chebyshev distance function
def weighted_chebyshev(u, v, w):
"""Calculate the weighted Chebyshev distance between two points."""
return np.max(w * np.abs(u - v))

nns_4 = NearestNeighbors(n_neighbors=4,metric=weighted_chebyshev,metric_params={ 'w': weights})
nns_4.fit(data_scaled)
nns_4_results_model_4 = nns_4.kneighbors(data_scaled)[1]
results_model_4 = clean_results(nns_results=nns_4_results_model_4,
encoders=encoders,
data=data_clean)
model_4_score= model_score(results_model_4,scoring_weights,model_name="Chebyshev_with_weights")
model_4_score

Better than the baseline but higher than the previous experiment. Image by author

We managed to decrease the primary feature variance scores through experimentation.

Let’s visualize the results.

results_df = pd.DataFrame([model_0_score,model_1_score,model_2_score,model_3_score,model_4_score]).set_index("Model")
results_df.plot(kind='barh')
Experimentation results. Image by author

Using Manhattan distance with weights seems to give the most accurate benchmark sets according to our needs.

The last step before implementing the benchmark sets would be to examine the sets with the highest Primary features scores and identify what steps to take with them.

# Histogram of Primary features score
results_model_3["cat_weighted_variance_score"].plot(kind="hist")
Score distribution. Image by author
exceptions = results_model_3[results_model_3["cat_weighted_variance_score"]>=0.4]

print(f" There are {exceptions.shape[0]} benchmark sets with significant variance across the primary features")

Image by author

These 18 cases will need to be reviewed to ensure the benchmark sets are relevant.

As you can see, with a few lines of code and some understanding of Nearest neighbor search, we managed to set internal benchmark sets. We can now distribute the sets and start measuring hotels’ KPIs against their benchmark sets.

You don’t always have to focus on the most cutting-edge machine learning methods to deliver value. Very often, simple machine learning can deliver great value.

What are some low-hanging fruits in your business that you could easily tackle with Machine learning?

World Bank. “World Development Indicators.” Retrieved June 11, 2024, from https://datacatalog.worldbank.org/search/dataset/0038117

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (n.d.). On the Surprising Behavior of Distance Metrics in High Dimensional Space. IBM T. J. Watson Research Center and Institute of Computer Science, University of Halle. Retrieved from https://bib.dbvis.de/uploadedFiles/155.pdf

SciPy v1.10.1 Manual. scipy.spatial.distance.minkowski. Retrieved June 11, 2024, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html

GeeksforGeeks. Haversine formula to find distance between two points on a sphere. Retrieved June 11, 2024, from https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/

scikit-learn. Neighbors Module. Retrieved June 11, 2024, from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors



Source

Related Articles

Back to top button