Me & BERT are gonna have a lot of fun...

fine tuning BERT. use case Legal & General

Dec 12, 2024

this one’s not so exciting for non programmers but those who do program may like it.

I promise more content that’s useful for non programmers. e.g. haiper.ai is likely your best video generator (free , lightly watermarked) ideogram.ai and piclumen are excellent graphics generators. i still use canva.com ; here goes.

import torch

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load the model and tokenizer

model_name = "pile-of-law/legalbert-large-1.7M-2"

print(f"Loading {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForMaskedLM.from_pretrained(model_name)

print("Model loaded successfully.")

def generate_legal_text(context, question, max_length=150):

"""

Generate a legal response based on the provided context and question.

"""

# Refined structured prompt

prompt = (

f"Legal Context:\n{context}\n\n"

f"Question:\n{question}\n\n"

f"Answer clearly and concisely:"

)

input_ids = tokenizer.encode(prompt, return_tensors="pt", max_length=512, truncation=True)

# Generate text with beam search for coherence

outputs = model.generate(

input_ids,

max_length=max_length,

num_beams=5, # Beam search for better coherence

no_repeat_ngram_size=2,

early_stopping=True

)

# Decode and clean up the output

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

return clean_up_response(generated_text, prompt)

def clean_up_response(response, prompt):

"""

Post-process the response to make it coherent and relevant.

"""

# Extract answer portion from the response

answer_start = response.find("Answer clearly and concisely:") + len("Answer clearly and concisely:")

if answer_start != -1:

response = response[answer_start:].strip()

# Remove repetitive or nonsensical fragments

lines = response.split('\n')

cleaned_response = [line for line in lines if "now" not in line.lower() and len(line.split()) > 3]

return ' '.join(cleaned_response).strip()

print("\nLegal-BERT Model ready. Type 'quit' to exit.")

while True:

user_context = input("\nEnter the legal context or relevant text (or 'quit' to exit): ")

if user_context.lower() == 'quit':

break

user_question = input("\nEnter your legal question: ")

if user_question.lower() == 'quit':

break

try:

generated_answer = generate_legal_text(user_context, user_question)

print(f"\nAnswer: {generated_answer}\n")

except Exception as e:

print(f"An error occurred: {e}")

print("Thank you for using Legal-BERT!")

To fine-tune the Legal-BERT model for use as a legal consultant, you will need to prepare your training data and follow specific steps to train the model. Below is a comprehensive guide on how to format your Q&A legal documents as training data and the steps involved in fine-tuning.

Preparing Your Training Data

Data Format

You can use either CSV or JSON formats for your training data. Here are templates for both formats:

CSV Format

question,answer

"What is a contract?", "A contract is a legally binding agreement between two or more parties."

"What is tort law?", "Tort law deals with civil wrongs and damages."

JSON Format

        "question": "What is a contract?",

        "answer": "A contract is a legally binding agreement between two or more parties."

},

        "question": "What is tort law?",

        "answer": "Tort law deals with civil wrongs and damages."

Collecting Training Data

Legal Textbooks: Extract key concepts, definitions, and explanations from your textbooks.
Q&A Documents: Use existing Q&A documents or create your own by formulating questions based on legal topics relevant to your needs.
Court Cases: Summarize court cases into questions and answers that highlight legal principles or outcomes.

Fine-Tuning the Model

Steps to Fine-Tune Legal-BERT

Set Up Your Environment:
Ensure you have Python installed along with the transformers library from Hugging Face.
Install any required libraries: bash pip install torch transformers datasets
Load Your Dataset: Use the Hugging Face datasets library to load your CSV or JSON file.
Fine-Tuning Script: Below is an example script to fine-tune the model using the Hugging Face Trainer API:

import torch

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, Trainer, TrainingArguments

from datasets import load_dataset

# Load the dataset

dataset = load_dataset('csv', data_files='your_data.csv')  # Change to 'json' if using JSON format

# Load model and tokenizer

model_name = "pile-of-law/legalbert-large-1.7M-2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Tokenization function

def preprocess_function(examples):

    return tokenizer(examples['question'], examples['answer'], truncation=True)

# Tokenize the dataset

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Set training arguments

training_args = TrainingArguments(

    output_dir='./results',

    evaluation_strategy='epoch',

    learning_rate=2e-5,

    per_device_train_batch_size=8,

    num_train_epochs=3,

    weight_decay=0.01,

# Create Trainer instance

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_dataset['train'],

# Start training

trainer.train()

Evaluate and Save the Model: After training, evaluate the model's performance on a validation set if available, and save it for future use:

trainer.save_model("fine-tuned-legalbert")

Conclusion

By following these steps, you can fine-tune Legal-BERT using your legal documents formatted as either CSV or JSON files. This process will tailor the model to better respond to legal inquiries based on your specific context and data.

Citations: [1] https://huggingface.co/pile-of-law/legalbert-large-1.7M-2

https://osintbrief.substack.com

Dec 12

I apologize. I had intended to post this to my artificial intelligence substack https://artificialint.substack.com

I wish the U/I were less "clean" and more "cleaR". My a.i. readers likely don't care about international relations. My international relations readers likely do not care about a.i., at least not about programming ai.

Again, i am sorry. Than you for understanding.

Expand full comment