This seems a simple enough question. For new books, publishers clearly just use the files they already have on their computer to create a download in the correct format. Or send it to a small publishing unit to print a copy from master files. It’s not much trouble to convert current books, but even ones only a decade old, this may become significantly harder…
Computer technology changes relatively quickly, so the programs used to produce the files used for printing may no longer be readable by more current machines. Now most publishers probably kept updating and converting files as they went if it was an author that was still producing, but if it was converted multiple times its not unusual for errors to creep in. These are errors that just won’t be caught the computer program itself, they’ll only be caught by an editor actually reading the document.
For even older materials, there may be no computerized document at all. At that point, publishers may be left staring at a printed copy and its just as much trouble to redo the old book as it is to do a new manuscript, likely even more so since most authors would turn in an already computerized document, not a typed one. This then leaves them staring at the printed copy and debating whether its worthwhile to have someone retype the whole thing.
What about scanning it? This is a bit hit or miss since most programs for converting image to text are easily deceived by dust, foxing, creases, unsual characters, or even if the page just isn’t in the scanner perfectly straight. There’s high end programs that do a much better job… but you’ll pay for those. Even those aren’t entirely accurate and may turn sections into utter garble.
Just leaving the scanned pages as images sometimes works… but if the page wasn’t pristine or isn’t in straight, a reprint may be filled with print speckles, pages where some of the text is sliced off because the image was crooked, or any number of other visual problems. Trying to clean the scans to fix those can take just as long, or longer, than retyping the page itself.
All this comes back to the idea that once digitized, SOMEONE must read the manuscript and make sure the words are correct. This is a lot of manhours for each book. The older the book, the harder it is likely to be as scans will be less clean and require more careful reading. Once the labor cost is factored in, it becomes more obvious why publishers don’t mine their undigitized backlist as often. There’s a fairly large labor investment which may not pay out. (This is also why some print on demand titles are such poor quality, the labor was put into scanning… but not into cleaning and editing. you’re getting a fancy photocopy)
And this is totally ignoring any issues with copyright or royalties paid to the author! For some marginal titles, even if they could afford the labor cost to bring it back into print and pay royalties, the BOOKKEEPING may wipe any anticipated profit.
So the major hurdle to digitizing older books right now is the labor involved… but YOU, yes you, reading this post, HAVE helped digitize documents without ever realizing it. Ever had to solve a reCAPTCHA? You helped digitize a document. (most likely part of the New York Time archives or something drawn from Google books)
The single word CAPTCHA you need to solve to prove you’re human when posting on the internet aren’t part of that process, but the two word reCAPTCHA, the second word is drawn from a scanned document. You only need to get the first word right to prove you’re human. The second word will be sent back to the reCAPTCHA service to confirm what it is. Once there’s a certain number of answers that agree with each other (example, four people say the word is “bog”, one says its “dog”), its marked as confirmed and new words are shown.
200 Million CAPTCHAs are solved every day… but even at that rate capturing a few seconds of so many peoples’ time, it still takes a looooooooooooong time… and a editor still needs to read the final document for grammar and formatting errors. Even if the individual words are right, that doesn’t mean the punctuation and spacing is right, or that diagrams and illustrations are placed correctly. That takes an editor.
There’s certainly software out there that can do a large part of the work, but because there really is no substitute for paying an editor to actually READ the text, converting backstock to ebooks isn’t the easy money that it first appears. As programs improve, you’ll start to see publishers mine the backlist for “new” work… but right now its often more expensive to do that than try out a new author!