Textricator is a tool to extract text from documents and generate structured data.
If you have a bunch of PDFs with the same format (or one big, consistently formatted PDF) and you want to extract the data to CSV or JSON, Textricator can help! It can even work on OCR'ed documents!
Textricator is released under the GNU Affero General Public License Version 3.
Textricator is deployed to Maven Central with GAV io.mfj:textricator. More Info »
This application is actively used and developed by Measures for Justice. We welcome feedback, bug reports, and contributions. Create an issue, send a pull request, or email us at email@example.com. If you use Textricator, please let us know. Send us your mailing address and we will mail you a sticker.
io.mfj.textricator.Textricator is the main entry point for library usage.
io.mfj.textricator.cli.TextricatorCli is the command-line interface.
The CLI has three subcommands, to use the three main features of Textricator:
text - Extract text from the PDF and generate JSON.
table - Parse the text that is in columns and rows. See Table section.
form - Parse the text with a configured finite state machine. See Form section.
Links to official Textricator sites
Official Website Facebook GitHub
No features added
Add a feature
clean-data extract-words-from-text extractor extractor-pdf extractor-text harvesting pdf-scraper pdf-scraping pdf-web-harvesting scrap-data scraper scraping text-harvesting web-harvesting web-scraping