What do the outputs look like?
We used a real academic PDF to generate the outputs below. Each format serves a different purpose — from raw readability to tokenized inputs for LLMs.
Original PDF (Page 1)

Verbatim TXT
Human-readable, searchable, often used for compliance and chain-of-custody. Preserves pages numbers, headers and footers.
Social Sciences & Humanities Open (2024) 100864 Contents lists available at ScienceDirect Social Sciences & Humanities Open journal homepage: www.sciencedirect.com/ journal/social-sciences-and-humanities-open Review Article Stock market prediction using artificial intelligence: A systematic review of systematic reviews Chin Yang Lin , João Alexandre Lobo Marques *
Structured TXT
Human-readable, with page numbers, headers and footers stripped, punctuation preserved. Used for audit and narrative.
Social Sciences & Humanities Open journal homepage: www.sciencedirect.com/journal/social-sciences-and-humanities-open Review Article Stock market prediction using artificial intelligence: A systematic review of systematic reviews Chin Yang Lin , Joao Alexandre Lobo Marques * Faculty of Business and Law, University of Saint Joseph, Macau ARTICLE INFO ABSTRACT Keywords: Machine learning Deep learning Support vector machines (SVM) Long short-term memory (LSTM) Neural networks (NN
Standardized TXT
Lowercased, depunctuated, stripped to essentials — ideal for readability and ML scoring. No further pre-processing required.
social sciences humanities open contents lists available at sciencedirect stock market prediction using artificial intelligence a systematic review of systematic reviews chin yang lin joao alexandre lobo marques machine learning deep learning support vector machines long short term memory lstm
Tokenized TXT
1-token-per-line output, ready for LLM input, n-gram pipelines, statistical modelling. Good for sentiment, reading level and other analysis.
social sciences humanities open contents lists available at sciencedirect stock market prediction using artificial