In the realm of Natural Language Processing (NLP), understanding the semantic relationship between sentences is crucial for various applications, from search engines and chatbots to sentiment analysis and text summarization. This article delves into a practical implementation of sentence similarity using the powerful BERT model and a Flask web application, allowing you to easily generate sentence embeddings and calculate cosine similarity scores.
BERT (Bidirectional Encoder Representations from Transformers) has revolutionized NLP by capturing contextual relationships within text. Unlike traditional word embeddings that assign a single vector to each word, BERT generates dynamic embeddings that adapt to the surrounding words, leading to a deeper understanding of sentence meaning.
We’ll create a simple Flask application that allows users to input two sentences and receive their similarity score and embeddings.
First, ensure you have Python installed. Then, install the necessary libraries:
pip install flask transformers torch scikit-learn
Here's the backend Flask application code:
from flask import Flask, request, jsonify, render_template_string
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
app = Flask(__name__)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def get_embedding(sentence):
inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True, max_length=128)
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1).detach().numpy()
@app.route('/')
def index():
return render_template_string('''
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>BERT Sentence Similarity</title>
<style>
/* CSS styles as defined in this HTML file */
</style>
</head>
<body>
<div class="container">
<h1>BERT Sentence Similarity Calculator</h1>
<form id="similarity-form">
<label for="sentence1">Sentence 1:</label>
<textarea id="sentence1" name="sentence1" rows="4" required></textarea>
<label for="sentence2">Sentence 2:</label>
<textarea id="sentence2" name="sentence2" rows="4" required></textarea>
<button type="submit">Calculate Similarity</button>
<button type="button" id="clear-button">Clear</button>
</form>
<div id="results-section" style="display: none;">
<h2>Results</h2>
<p id="similarity-score"></p>
<h3>Sentence 1 Embedding:</h3>
<div class="embedding-table-container">
<table id="embedding1-table">
<thead>
<tr><th>Index</th><th>Value</th></tr>
</thead>
<tbody></tbody>
</table>
</div>
<h3>Sentence 2 Embedding:</h3>
<div class="embedding-table-container">
<table id="embedding2-table">
<thead>
<tr><th>Index</th><th>Value</th></tr>
</thead>
<tbody></tbody>
</table>
</div>
</div>
<div class="collapsible-header" id="info-header">How it Works</div>
<div class="collapsible-content" id="info-content">
<h3>Understanding the Application</h3>
<p>This application uses the BERT (Bidirectional Encoder Representations from Transformers) model to convert sentences into numerical vectors, known as embeddings. These embeddings capture the semantic meaning of the sentences.</p>
<p>The cosine similarity is then calculated between these two sentence embeddings. Cosine similarity measures the cosine of the angle between two vectors. A score close to 1 indicates high similarity, while a score close to 0 indicates low similarity.</p>
<h3>Key Components:</h3>
<ul>
<li><strong>BERT Model & Tokenizer:</strong> Loaded from the Hugging Face Transformers library to process text.</li>
<li><strong>get_embedding(sentence)
function:</strong> Converts a given sentence into its BERT embedding.</li>
<li><strong>Flask Routes:</strong> Handles web requests for the main page (/
), embedding calculation (/embed
), and similarity calculation (/similarity
).</li>
<li><strong>Cosine Similarity:</strong> Used from sklearn.metrics.pairwise
to quantify the semantic similarity between two embeddings.</li>
</ul>
</div>
</div>
<script>
document.getElementById('similarity-form').addEventListener('submit', async function(event) {
event.preventDefault();
const sentence1 = document.getElementById('sentence1').value;
const sentence2 = document.getElementById('sentence2').value;
const response = await fetch('/similarity', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ sentence1, sentence2 })
});
const data = await response.json();
document.getElementById('similarity-score').textContent = `Similarity Score: ${data.similarity.toFixed(4)}`;
const embedding1TableBody = document.querySelector('#embedding1-table tbody');
embedding1TableBody.innerHTML = '';
data.embedding1[0].forEach((value, index) => {
const row = embedding1TableBody.insertRow();
const indexCell = row.insertCell();
const valueCell = row.insertCell();
indexCell.textContent = index;
valueCell.textContent = value.toFixed(6);
});
const embedding2TableBody = document.querySelector('#embedding2-table tbody');
embedding2TableBody.innerHTML = '';
data.embedding2[0].forEach((value, index) => {
const row = embedding2TableBody.insertRow();
const indexCell = row.insertCell();
const valueCell = row.insertCell();
indexCell.textContent = index;
valueCell.textContent = value.toFixed(6);
});
document.getElementById('results-section').style.display = 'block';
});
document.getElementById('clear-button').addEventListener('click', function() {
document.getElementById('sentence1').value = '';
document.getElementById('sentence2').value = '';
document.getElementById('results-section').style.display = 'none';
document.getElementById('similarity-score').textContent = '';
document.querySelector('#embedding1-table tbody').innerHTML = '';
document.querySelector('#embedding2-table tbody').innerHTML = '';
});
document.getElementById('info-header').addEventListener('click', function() {
const content = document.getElementById('info-content');
if (content.style.display === 'block') {
content.style.display = 'none';
} else {
content.style.display = 'block';
}
});
</script>
</body>
</html>
''')
@app.route('/embed', methods=['POST'])
def embed():
data = request.json
sentence = data.get('sentence', '')
embedding = get_embedding(sentence)
return jsonify({'embedding': embedding.tolist()})
@app.route('/similarity', methods=['POST'])
def similarity():
data = request.json
sentence1 = data.get('sentence1', '')
sentence2 = data.get('sentence2', '')
embedding1 = get_embedding(sentence1)
embedding2 = get_embedding(sentence2)
similarity_score = cosine_similarity(embedding1, embedding2)[0][0]
return jsonify({'similarity': float(similarity_score), 'embedding1': embedding1.tolist(), 'embedding2': embedding2.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
get_embedding(sentence)
: This function tokenizes the input sentence, passes it through the BERT model, and returns the mean of the last hidden state as the sentence embedding./
: Renders the HTML form for inputting sentences./embed
: receives a sentence via POST request and returns its embedding./similarity
: receives two sentences, calculates their embeddings, computes the cosine similarity, and returns the results.sklearn.metrics.pairwise.cosine_similarity
to calculate the similarity score between the two sentence embeddings.Save the code as app.py
and run it from your terminal:
python app.py
Open your browser and navigate to http://0.0.0.0:5000
. You'll see the form where you can input two sentences and get their similarity score and embeddings.
The HTML portion of the code provides a user-friendly interface. It includes:
This application can be extended for various NLP tasks:
By combining the power of BERT with the simplicity of Flask, you can easily build practical NLP applications that leverage sentence similarity. This project serves as a foundation for exploring more advanced NLP techniques and building innovative solutions.