Extracting the actual in-text title from a PDF -
Extracting the actual in-text title from a PDF -
there seems lot of questions extracting title pdf (using metadata). however, big bulk of titles not seem exist in metadata. found out when using http://pybrary.net/pypdf/pythondoc-pypdf.pdf.html .
is there anyway retrieve in text title pdf? tried export text file search there no consistent formatting. there way export pdf document formatting, check font size >= 14 ?
this question. applications create pdfs don't seem useful available metadata fields.
take pdflatex example: when 1 sets \title{...} , \author{...} in preamble, info not reflected in metadata. after quick search, solution appears to introduce block in preamble read pdflatex [1]:
\pdfinfo { /title{...} /author{...} ... }
...which placed in the relevant metadata fields of pdf. unusual necessary, though.
i cannot speak word processors word or writer. 1 presumes such metadata fields have set manually user.
perhaps heuristic approach way can approach problem if pdfs not generated you. [2] seems similar want, guess depends how published pdfs -- tool seems scientific-paper oriented.
i hope @ to the lowest degree help.
[1] http://wlug.org.nz/pdflatexnotes [2] http://www.molspaces.com/d_cb2bib-metadata.php
pdf title extraction
Comments
Post a Comment