LLM-powered Text to Dataset Creator

You can extract precise datasets from web pages and PDF directories quickly with basic Python skills.

Datasets extracted from PDFs or web pages can be invaluable when the data is difficult to source.

They can be used to build indicators of sentiment from news stories, pricing shares from investor reports, and interpreting the hawkish/dovish tone of central bank announcements.

However, creating structured text-based datasets from a large number of PDF documents or web links can be painstaking and may require convoluted Python code.

Attempting such tasks directly in a ChatGPT UI can also quickly hit context window limits or upload limits when you have a large number of documents.

Instead you can leverage large language models (LLMs) with Python and OpenAI's API to extract information from an entire PDF directory or list of web URLs in a few minutes.

It provides the output in a precise data table in a specific format (e.g., string, timestamp, number) and then export it to a CSV file.

The Python script can then be automated to run on a regular basis (e.g. monthly) or when a new file is added without manual overhead.

Included in this guide are two notebook templates that you can follow interactively in your browser with Google Colab:

Text2Data webloader template: use an LLM to extract data from a list of URLs.
- Demo included: Using announcements from Central Banks as an example, I'll illustrate how we can automate the classification of Fed statements on a scale from very dovish (-1) to very hawkish (1).
Text2Data PDF loader template: use an LLM to extract and analyze information from an entire directory of PDFs.
- Demo included: systematically collecting structured data on R&D expenditure from financial statements.

All you require is an OpenAI API key. Costs and tokens used are transparently displayed within the notebook.

Try it out and automate your data extraction tasks.

Check out my blog post too for an in-depth guide on the templates.

Add to cart

A simple template on how to use OpenAI models with Python to extract datasets from web links and PDF directories.

LLM-powered Text to Dataset Creator

Included: 2 interactive Python notebooks via Google Colab