Text Extraction Tool
Learn how to use the text extraction tool with Peargent agents
Overview
The Text Extraction Tool is a built-in Peargent Tool that enables Agents to extract plain text from various document formats. It supports HTML, PDF, DOCX, TXT, and Markdown files, as well as URLs. The tool can optionally extract metadata such as title, author, page count, and character counts.
Supported Formats
- HTML/XHTML - Web pages with metadata extraction (title, description, author)
- PDF - PDF documents with metadata (title, author, subject, page count)
- DOCX - Microsoft Word documents with document properties
- TXT - Plain text files with automatic encoding detection
- Markdown - Markdown files with title extraction from headers
- URLs - HTTP/HTTPS web resources with built-in SSRF protection
Usage with Agents
The Text Extraction Tool is most powerful when integrated with Agents. Agents can use the tool to automatically extract and process document content.
Creating an Agent with Text Extraction
To use the text extraction tool with an agent, you need to configure it with a Model and pass the tool to the agent's tools parameter:
from peargent import create_agent
from peargent.tools import text_extractor
from peargent.models import gemini
# Create an agent with text extraction capability
agent = create_agent(
name="DocumentAnalyzer",
description="Analyzes documents and extracts key information",
persona=(
"You are a document analysis expert. When asked about a document, "
"use the text extraction tool to extract its content, then analyze "
"and summarize the information."
),
model=gemini("gemini-2.5-flash-lite"),
tools=[text_extractor]
)
# Use the agent to analyze a document
response = agent.run("Summarize the key points from document.pdf")
print(response)Examples
Example 1: Extract Text with Metadata
from peargent.tools import text_extractor
# Extract text and metadata from an HTML file
result = text_extractor.run({
"file_path": "article.html",
"extract_metadata": True
})
if result["success"]:
print(f"Title: {result['metadata']['title']}")
print(f"Author: {result['metadata']['author']}")
print(f"Word Count: {result['metadata']['word_count']}")
print(f"Content:\n{result['text']}")
else:
print(f"Error: {result['error']}")Example 2: Extract from URL
from peargent.tools import text_extractor
# Extract text from a web page
result = text_extractor.run({
"file_path": "https://example.com/article",
"extract_metadata": True
})
if result["success"]:
print(f"Website Title: {result['metadata']['title']}")
print(f"Content: {result['text'][:500]}...")Example 3: Extract with Length Limit
from peargent.tools import text_extractor
# Extract text but limit to 1000 characters
result = text_extractor.run({
"file_path": "long_document.pdf",
"extract_metadata": True,
"max_length": 1000
})
print(f"Text (max 1000 chars): {result['text']}")Example 4: Batch Processing Multiple Files
from peargent.tools import text_extractor
import os
documents = ["doc1.pdf", "doc2.docx", "doc3.html"]
for file_path in documents:
if os.path.exists(file_path):
result = text_extractor.run({
"file_path": file_path,
"extract_metadata": True
})
if result["success"]:
print(f"\n{file_path} ({result['format']})")
print(f"Words: {result['metadata'].get('word_count', 'N/A')}")
print(f"Preview: {result['text'][:150]}...")
else:
print(f"Error processing {file_path}: {result['error']}")Example 5: Agent Document Analysis
from peargent import create_agent
from peargent.tools import text_extractor
from peargent.models import gemini
# Create a document analysis agent
agent = create_agent(
name="ResearchAssistant",
description="Analyzes research papers and extracts key information",
persona=(
"You are a research assistant specializing in document analysis. "
"When given a document, extract its content and identify: "
"1) Main topic, 2) Key findings, 3) Methodology, 4) Conclusions"
),
model=gemini("gemini-2.5-flash-lite"),
tools=[text_extractor]
)
# Ask the agent to analyze a research paper
response = agent.run(
"Please analyze research_paper.pdf and provide a structured summary"
)
print(response)Common Use Cases
- Document Summarization: Extract text from documents and have agents summarize them
- Information Extraction: Extract specific information (emails, phone numbers, etc.) from documents
- Content Analysis: Analyze document sentiment, topics, or keywords
- Batch Processing: Process multiple documents programmatically
- Web Scraping: Extract text from web pages while preserving structure
- Research Assistance: Analyze research papers and academic documents
- Compliance Review: Extract and review document contents for compliance checking
Parameters
The text extraction tool accepts the following parameters:
- file_path (string, required): Path to the file or URL to extract text from
- extract_metadata (boolean, optional, default: False): Whether to extract metadata like title, author, page count, etc.
- max_length (integer, optional): Maximum text length to return. If exceeded, text is truncated with "..." appended
Return Value
The tool returns a dictionary with the following structure:
{
"text": "Extracted plain text content",
"metadata": {
"title": "Document Title",
"author": "Author Name",
# ... additional metadata depending on format
},
"format": "pdf", # Detected file format
"success": True,
"error": None
}Metadata by Format
Different document formats provide different metadata:
HTML/XHTML:
title- Page titledescription- Meta description tagauthor- Meta author tagword_count- Number of wordschar_count- Number of characters
PDF:
title- Document titleauthor- Document authorsubject- Document subjectcreator- Application that created the PDFproducer- PDF producercreation_date- When the document was createdpage_count- Total number of pagesword_count- Total word countchar_count- Total character count
DOCX:
title- Document titleauthor- Document authorsubject- Document subjectcreated- Creation date and timemodified- Last modification date and timeword_count- Total word countchar_count- Total character countparagraph_count- Number of paragraphs
TXT/Markdown:
encoding- Text encoding usedword_count- Total word countchar_count- Total character countline_count- Total line counttitle- (Markdown only) Title extracted from first heading
Troubleshooting
ImportError for document libraries
If you encounter ImportError when extracting specific formats, install the required dependencies:
# For all formats
pip install peargent[text-extraction]
# Or individually
pip install beautifulsoup4 pypdf python-docxSSRF Protection Errors
If you receive "Access to localhost is not allowed" error, ensure you're using a public URL:
# This will fail
result = text_extractor.run({"file_path": "http://localhost:8000/doc"})
# Use a public URL instead
result = text_extractor.run({"file_path": "https://example.com/doc"})Encoding Issues with Text Files
For text files with non-standard encoding, the tool automatically detects encoding. If issues persist, ensure the file is properly encoded.