Beginner’s Guide to Data Extraction with LangExtract and LLMs

Image by Author

# Introduction

Did you know that a large portion of valuable information still exists in unstructured text? For example, research papers, clinical notes, financial reports, etc. Extracting reliable and structured information from these texts has always been a challenge. LangExtract is an open-source Python library (released by Google) that solves this problem using large language models (LLMs). You can define what to extract via simple prompts and a few examples, and then it uses LLMs (like Google’s Gemini, OpenAI, or local models) to pull out that information from documents of any length. Another thing that makes it useful is its support for very long documents (through chunking and multi-pass processing) and interactive visualization of results. Let’s explore this library in more detail.

# 1. Installing and Setting Up

To install LangExtract locally, first ensure you have Python 3.10+ installed. The library is available on PyPI. In a terminal or virtual environment, run:

For an isolated environment, you may first create and activate a virtual environment:

python -m venv langextract_env
source langextract_env/bin/activate  # On Windows: .\langextract_env\Scripts\activate
pip install langextract

There are other options from source and using Docker as well that you can check from here.

# 2. Setting Up API Keys (for Cloud Models)

LangExtract itself is free and open-source, but if you use cloud-hosted LLMs (like Google Gemini or OpenAI GPT models), you must supply an API key. You can set the LANGEXTRACT_API_KEY environment variable or store it in a .env file in your working directory. For example:

export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"

or in a .env file:

cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
echo '.env' >> .gitignore

On-device LLMs via Ollama or other local backends do not require an API key. To enable OpenAI, you would run pip install langextract[openai], set your OPENAI_API_KEY, and use an OpenAI model_id. For Vertex AI (enterprise users), service account authentication is supported.

# 3. Defining an Extraction Task

LangExtract works by you telling it what information to extract. You do this by writing a clear prompt description and supplying one or more ExampleData annotations that show what a correct extraction looks like on sample text. For instance, to extract characters, emotions, and relationships from a line of literature, you might write:

import langextract as lx

prompt = """
  Extract characters, emotions, and relationships in order of appearance.
  Use exact text for extractions. Do not paraphrase or overlap entities.
  Provide meaningful attributes for each entity to add context."""
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? ...",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            )
        ]
    )
]

These examples (taken from LangExtract’s README) tell the model exactly what kind of structured output is expected. You can create similar examples for your domain.

# 4. Running the Extraction

Once your prompt and examples are defined, you simply call the lx.extract() function. The key arguments are:

text_or_documents: Your input text, or a list of texts, or even a URL string (LangExtract can fetch and process text from a Gutenberg or other URL).
prompt_description: The extraction instructions (a string).
examples: A list of ExampleData that illustrate the desired output.
model_id: The identifier of the LLM to use (e.g. "gemini-2.5-flash" for Google Gemini Flash, or an Ollama model like "gemma2:2b", or an OpenAI model like "gpt-4o").
Other optional parameters: extraction_passes (to re-run extraction for higher recall on long texts), max_workers (to do parallel processing on chunks), fence_output, use_schema_constraints, etc.

For example:

input_text=""'JULIET. O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO. Shall I hear more, or shall I speak at this?
JULIET. 'Tis but thy name that is my enemy;
Thou art thyself, though not a Montague.
What’s in a name? That which we call a rose
By any other name would smell as sweet.'''


result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"
)

This sends the prompt and examples along with the text to the chosen LLM and returns a Result object. LangExtract automatically handles tokenizing long texts into chunks, batching calls in parallel, and merging the outputs.

# 5. Handling Output and Visualization

The output of lx.extract() is a Python object (often called result) that contains the extracted entities and attributes. You can inspect it programmatically or save it for later. LangExtract also provides helper functions to save results: for example, you can write the results to a JSONL (JSON Lines) file (one document per line) and generate an interactive HTML review. For example:

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html = lx.visualize("extraction_results.jsonl")
with open("viz.html", "w") as f:
    f.write(html if isinstance(html, str) else html.data)

This writes an extraction_results.jsonl file and an interactive viz.html file. The JSONL format is convenient for large datasets and further processing, and the HTML file highlights each extracted span in context (color-coded by class) for easy human inspection like this:

# 6. Supporting Input Formats

LangExtract is flexible about input. You can supply:

Plain text strings: Any text you load into Python (e.g. from a file or database) can be processed.
URLs: As shown above, you can pass a URL (e.g. a Project Gutenberg link) as text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt". LangExtract will download and extract from that document.
List of texts: Pass a Python list of strings to process multiple documents in one call.
Rich text or Markdown: Since LangExtract works at the text level, you could also feed in Markdown or HTML if you pre-process it to raw text. (LangExtract itself doesn’t parse PDFs or images, you need to extract text first.)

# 7. Conclusion

LangExtract makes it easy to turn unstructured text into structured data. With high accuracy, clear source mapping, and simple customization, it works well when rule-based methods fall short. It is especially useful for complex or domain-specific extractions. While there is room for improvement, LangExtract is already a strong tool for extracting grounded information in 2025.

Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

Source link

Beginner’s Guide to Data Extraction with LangExtract and LLMs

# Introduction

# 1. Installing and Setting Up

# 2. Setting Up API Keys (for Cloud Models)

# 3. Defining an Extraction Task

# 4. Running the Extraction

# 5. Handling Output and Visualization

# 6. Supporting Input Formats

# 7. Conclusion

Leave a comment Cancel reply

You May Also Like

25 Free Courses to Master Data Science, Data Engineering, Machine Learning, MLOps, and Generative AI

25 Free Books to Master SQL, Python, Data Science, Machine Learning, and Natural Language Processing