Convert djvu in pdf ubuntu
Get via App Store Read this post in our app!
Converting DJVU to PDF
I want to convert a DJVU document into a PDF document, separating and preserving the text layer and the images while also keeping the structure from the DJVU. How can I do this in Ubuntu?
(I will then be using Calibre to convert to ePub/Mobi, so if there were a Calibre plug-in for this entire process that would be perfect for me!)
Note1: Printing from Evince, exporting from DJview, or anything using the package ddjvu, are not adequate solutions as they discard the text layer, saving only images.
Note2: Using DJVULibre seems to only extract the text layer and pictures are not extracted. Similarly, copying the text "manually" loses the both document structure and the pictures.
Method 1
Simply use DJView and export as PDF
- Goto Synaptic Package Manager
- Install DJview4
- Run DJview (Applications - Graphics - DJView4)
- Open your .djvu document
- : Menu - Export As: PDF
Open the djvu file in evince
Select print ----> print to file
change .ps to .pdf and click print
- Goto Synaptic Package Manager
djvulibre-bin libdjvulibre21 okular-extra-backends evince libevdocument3 libevview3
Goto terminal and write
Goto the directory where the djvu file is present. Click the right mouse button. Goto âOpen In Terminalâ option. Click on it. A terminal will open.
In that terminal write
There is also an online converter DjVu to PDF converter
Here is one way, which would require some not so common tools:
We can use djvu2hocr command (from ocrodjvu package) to extract hidden text layer from DjVu file (it doesn't do any OCR or similar, it just extracts text layer with geometry), i.e.:
djvu2hocr -p 10 sample.djvu | sed 's/ocrx/ocr/g' > pg10.html
sed intervention corrects class names in output hOCR (which is just simple HTML file)
Now we extract DjVu page to TIFF format with:
ddjvu -format=tiff -page=10 sample.djvu pg10.tif
so that we end with these file in out work folder:
This is where pdfbeads comes in play, and we simple execute:
pdfbeads -o pg10.pdf
then this nifty program takes care of everything that's inside this folder (HTML and TIFF files with same base name) and produces output PDF file with some by-products:
which is identical to input DjVu file and has text layer inside:
Another saner, but slower approach is use of regular OCR GUI tools. gscan2pdf (> 1.0) is suggested as possible candidate for Linux PC
Using DJVULibre, one can extract the text layer via the terminal command:
djvutxt myfile.djvu > myfile-ocr.txt or djvused myfile.djvu -e 'print-pure-txt' > myfile.txt
(both do the same thing, and were found here)
Formatting requires some effort (as many symbols are not converted properly) and pictures are not recovered.
There is djvu2pdf but it relies on ghostscript so it might be another printing option. I still suggest you give it a look, just in case it's more clever than I'm giving it credit.
It's not in the repos but you can download a deb from the makers' site: http://0x2a.at/s/projects/djvu2pdf
** Insert mandatory notice about downloading/installing things from outside the repos here **
The easiest way: use gscan2pdf to import the djvu, then OCR it with tesseract, and finally save it as a pdf. The OCR'd text in the pdf might be slightly different from the original djvu, and the conversion may take a while, but this method is a no-brainer and it works.