Sentence Similarity: BERT & Flask in Action

In the realm of Natural Language Processing (NLP), understanding the semantic relationship between sentences is crucial for various applications, from search engines and chatbots to sentiment analysis and text summarization. This article delves into a practical implementation of sentence similarity using the powerful BERT model and a Flask web application, allowing you to easily generate sentence embeddings and calculate cosine similarity scores.

Why BERT?

BERT (Bidirectional Encoder Representations from Transformers) has revolutionized NLP by capturing contextual relationships within text. Unlike traditional word embeddings that assign a single vector to each word, BERT generates dynamic embeddings that adapt to the surrounding words, leading to a deeper understanding of sentence meaning.

Building the Flask Application

We’ll create a simple Flask application that allows users to input two sentences and receive their similarity score and embeddings.

1. Setting up the Environment:

First, ensure you have Python installed. Then, install the necessary libraries:

pip install flask transformers torch scikit-learn

2. The Python Code (app.py):

Here's the backend Flask application code:

from flask import Flask, request, jsonify, render_template_string
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

app = Flask(__name__)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embedding(sentence):
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True, max_length=128)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

@app.route('/')
def index():
    return render_template_string('''
        <!DOCTYPE html>
        <html lang="en">
        <head>
            <meta charset="UTF-8">
            <meta name="viewport" content="width=device-width, initial-scale=1.0">
            <title>BERT Sentence Similarity</title>
            <style>
                /* CSS styles as defined in this HTML file */
            </style>
        </head>
        <body>
            <div class="container">
                <h1>BERT Sentence Similarity Calculator</h1>
                <form id="similarity-form">
                    <label for="sentence1">Sentence 1:</label>
                    <textarea id="sentence1" name="sentence1" rows="4" required></textarea>

                    <label for="sentence2">Sentence 2:</label>
                    <textarea id="sentence2" name="sentence2" rows="4" required></textarea>

                    <button type="submit">Calculate Similarity</button>
                    <button type="button" id="clear-button">Clear</button>
                </form>

                <div id="results-section" style="display: none;">
                    <h2>Results</h2>
                    <p id="similarity-score"></p>

                    <h3>Sentence 1 Embedding:</h3>
                    <div class="embedding-table-container">
                        <table id="embedding1-table">
                            <thead>
                                <tr><th>Index</th><th>Value</th></tr>
                            </thead>
                            <tbody></tbody>
                        </table>
                    </div>

                    <h3>Sentence 2 Embedding:</h3>
                    <div class="embedding-table-container">
                        <table id="embedding2-table">
                            <thead>
                                <tr><th>Index</th><th>Value</th></tr>
                            </thead>
                            <tbody></tbody>
                        </table>
                    </div>
                </div>

                <div class="collapsible-header" id="info-header">How it Works</div>
                <div class="collapsible-content" id="info-content">
                    <h3>Understanding the Application</h3>
                    <p>This application uses the BERT (Bidirectional Encoder Representations from Transformers) model to convert sentences into numerical vectors, known as embeddings. These embeddings capture the semantic meaning of the sentences.</p>
                    <p>The cosine similarity is then calculated between these two sentence embeddings. Cosine similarity measures the cosine of the angle between two vectors. A score close to 1 indicates high similarity, while a score close to 0 indicates low similarity.</p>
                    <h3>Key Components:</h3>
                    <ul>
                        <li><strong>BERT Model & Tokenizer:</strong> Loaded from the Hugging Face Transformers library to process text.</li>
                        <li><strong>get_embedding(sentence) function:</strong> Converts a given sentence into its BERT embedding.</li>
                        <li><strong>Flask Routes:</strong> Handles web requests for the main page (/), embedding calculation (/embed), and similarity calculation (/similarity).</li>
                        <li><strong>Cosine Similarity:</strong> Used from sklearn.metrics.pairwise to quantify the semantic similarity between two embeddings.</li>
                    </ul>
                </div>
            </div>

            <script>
                document.getElementById('similarity-form').addEventListener('submit', async function(event) {
                    event.preventDefault();
                    const sentence1 = document.getElementById('sentence1').value;
                    const sentence2 = document.getElementById('sentence2').value;

                    const response = await fetch('/similarity', {
                        method: 'POST',
                        headers: {
                            'Content-Type': 'application/json'
                        },
                        body: JSON.stringify({ sentence1, sentence2 })
                    });

                    const data = await response.json();

                    document.getElementById('similarity-score').textContent = `Similarity Score: ${data.similarity.toFixed(4)}`;

                    const embedding1TableBody = document.querySelector('#embedding1-table tbody');
                    embedding1TableBody.innerHTML = '';
                    data.embedding1[0].forEach((value, index) => {
                        const row = embedding1TableBody.insertRow();
                        const indexCell = row.insertCell();
                        const valueCell = row.insertCell();
                        indexCell.textContent = index;
                        valueCell.textContent = value.toFixed(6);
                    });

                    const embedding2TableBody = document.querySelector('#embedding2-table tbody');
                    embedding2TableBody.innerHTML = '';
                    data.embedding2[0].forEach((value, index) => {
                        const row = embedding2TableBody.insertRow();
                        const indexCell = row.insertCell();
                        const valueCell = row.insertCell();
                        indexCell.textContent = index;
                        valueCell.textContent = value.toFixed(6);
                    });

                    document.getElementById('results-section').style.display = 'block';
                });

                document.getElementById('clear-button').addEventListener('click', function() {
                    document.getElementById('sentence1').value = '';
                    document.getElementById('sentence2').value = '';
                    document.getElementById('results-section').style.display = 'none';
                    document.getElementById('similarity-score').textContent = '';
                    document.querySelector('#embedding1-table tbody').innerHTML = '';
                    document.querySelector('#embedding2-table tbody').innerHTML = '';
                });

                document.getElementById('info-header').addEventListener('click', function() {
                    const content = document.getElementById('info-content');
                    if (content.style.display === 'block') {
                        content.style.display = 'none';
                    } else {
                        content.style.display = 'block';
                    }
                });
            </script>
        </body>
        </html>
    ''')

@app.route('/embed', methods=['POST'])
def embed():
    data = request.json
    sentence = data.get('sentence', '')
    embedding = get_embedding(sentence)
    return jsonify({'embedding': embedding.tolist()})

@app.route('/similarity', methods=['POST'])
def similarity():
    data = request.json
    sentence1 = data.get('sentence1', '')
    sentence2 = data.get('sentence2', '')
    embedding1 = get_embedding(sentence1)
    embedding2 = get_embedding(sentence2)
    similarity_score = cosine_similarity(embedding1, embedding2)[0][0]
    return jsonify({'similarity': float(similarity_score), 'embedding1': embedding1.tolist(), 'embedding2': embedding2.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
                

3. Understanding the Code:

4. Running the Application:

Save the code as app.py and run it from your terminal:

python app.py

Open your browser and navigate to http://0.0.0.0:5000. You'll see the form where you can input two sentences and get their similarity score and embeddings.

5. HTML Structure and Functionality

The HTML portion of the code provides a user-friendly interface. It includes:

6. Practical Applications:

This application can be extended for various NLP tasks:

7. Further Improvements:

By combining the power of BERT with the simplicity of Flask, you can easily build practical NLP applications that leverage sentence similarity. This project serves as a foundation for exploring more advanced NLP techniques and building innovative solutions.