PDF Scrapping:
We may needed to scrap the pdf using java. This involves parsing tables and different areas of pdf.
How can we proceed ?
We can proceed this way,
1. Converting pdf to html/xml
2. Parsing the converted html/xml using the Jsoup open source Jar (Jsoup-1.7.2).
Converting pdf to html:
To convert pdf to html we can use pdfbox jar (opensource). Disadvantage with pdfbox jar is, it will not convert the tables as exactly as there and also it will convert all the things as
tag. Which is also not predictable one in some cases.
For more options on Apache pdfbox refer this,
http://pdfbox.apache.org/commandline/
Command:
[plain gutter=”0″]java -jar pdfbox-app-1.8.4.jar ExtractText -html 1.pdf 1.html[/plain]
Ensure pdfbox jar/1.pdf are exist in the same place. And run this command in the same path.
Converting Pdf to xml:
You can convert any pdf to xml using PDFTextStream Jar. Its free till some extend.
You can download the jar and also find the sample code here,
http://snowtide.com/downloads
Advantage with PDFTextStream jar is, it will convert any pdf to xml with x, y coordination values. So its easy to parse and take the values of tables using jsoup.
Thanks for reading this post…!