this one’s not so exciting for non programmers but those who do program may like it.
I promise more content that’s useful for non programmers. e.g. haiper.ai is likely your best video generator (free , lightly watermarked) ideogram.ai and piclumen are excellent graphics generators. i still use canva.com ; here goes.
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load the model and tokenizer
model_name = "pile-of-law/legalbert-large-1.7M-2"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
print("Model loaded successfully.")
def generate_legal_text(context, question, max_length=150):
"""
Generate a legal response based on the provided context and question.
"""
# Refined structured prompt
prompt = (
f"Legal Context:\n{context}\n\n"
f"Question:\n{question}\n\n"
f"Answer clearly and concisely:"
)
input_ids = tokenizer.encode(prompt, return_tensors="pt", max_length=512, truncation=True)
# Generate text with beam search for coherence
outputs = model.generate(
input_ids,
max_length=max_length,
num_beams=5, # Beam search for better coherence
no_repeat_ngram_size=2,
early_stopping=True
)
# Decode and clean up the output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return clean_up_response(generated_text, prompt)
def clean_up_response(response, prompt):
"""
Post-process the response to make it coherent and relevant.
"""
# Extract answer portion from the response
answer_start = response.find("Answer clearly and concisely:") + len("Answer clearly and concisely:")
if answer_start != -1:
response = response[answer_start:].strip()
# Remove repetitive or nonsensical fragments
lines = response.split('\n')
cleaned_response = [line for line in lines if "now" not in line.lower() and len(line.split()) > 3]
return ' '.join(cleaned_response).strip()
print("\nLegal-BERT Model ready. Type 'quit' to exit.")
while True:
user_context = input("\nEnter the legal context or relevant text (or 'quit' to exit): ")
if user_context.lower() == 'quit':
break
user_question = input("\nEnter your legal question: ")
if user_question.lower() == 'quit':
break
try:
generated_answer = generate_legal_text(user_context, user_question)
print(f"\nAnswer: {generated_answer}\n")
except Exception as e:
print(f"An error occurred: {e}")
print("Thank you for using Legal-BERT!")
To fine-tune the Legal-BERT model for use as a legal consultant, you will need to prepare your training data and follow specific steps to train the model. Below is a comprehensive guide on how to format your Q&A legal documents as training data and the steps involved in fine-tuning.
Preparing Your Training Data
Data Format
You can use either CSV or JSON formats for your training data. Here are templates for both formats:
CSV Format
question,answer
"What is a contract?", "A contract is a legally binding agreement between two or more parties."
"What is tort law?", "Tort law deals with civil wrongs and damages."
JSON Format
[
{
"question": "What is a contract?",
"answer": "A contract is a legally binding agreement between two or more parties."
},
{
"question": "What is tort law?",
"answer": "Tort law deals with civil wrongs and damages."
}
]
Collecting Training Data
Legal Textbooks: Extract key concepts, definitions, and explanations from your textbooks.
Q&A Documents: Use existing Q&A documents or create your own by formulating questions based on legal topics relevant to your needs.
Court Cases: Summarize court cases into questions and answers that highlight legal principles or outcomes.
Fine-Tuning the Model
Steps to Fine-Tune Legal-BERT
Set Up Your Environment:
Ensure you have Python installed along with the transformers library from Hugging Face.
Install any required libraries: bash pip install torch transformers datasets
Load Your Dataset: Use the Hugging Face datasets library to load your CSV or JSON file.
Fine-Tuning Script: Below is an example script to fine-tune the model using the Hugging Face Trainer API:
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
# Load the dataset
dataset = load_dataset('csv', data_files='your_data.csv') # Change to 'json' if using JSON format
# Load model and tokenizer
model_name = "pile-of-law/legalbert-large-1.7M-2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# Tokenization function
def preprocess_function(examples):
return tokenizer(examples['question'], examples['answer'], truncation=True)
# Tokenize the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)
# Set training arguments
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
)
# Create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
)
# Start training
trainer.train()
Evaluate and Save the Model: After training, evaluate the model's performance on a validation set if available, and save it for future use:
trainer.save_model("fine-tuned-legalbert")
Conclusion
By following these steps, you can fine-tune Legal-BERT using your legal documents formatted as either CSV or JSON files. This process will tailor the model to better respond to legal inquiries based on your specific context and data.
Citations: [1] https://huggingface.co/pile-of-law/legalbert-large-1.7M-2
I apologize. I had intended to post this to my artificial intelligence substack https://artificialint.substack.com
I wish the U/I were less "clean" and more "cleaR". My a.i. readers likely don't care about international relations. My international relations readers likely do not care about a.i., at least not about programming ai.
Again, i am sorry. Than you for understanding.