How I converted Venmurasu Webpages into a series of ebooks

I am a huge fan of author Jeyamohan. I have been wanting to read his Magnum Opus “Venmurasu” which is the tale of Mahabharata in the form of Novel spanning across 26 volumes, 22000 pages and took more than 7 years to write. The book is released as a webapp and anyone can read it for free from this website.

Though the site is comfortable to read and it saves the reading history by using cookies and localStorage, it wasn’t enough for me. I want the book to be available offline, highlight passages and read it on all my devices. So naturally I was looking for an ebook version. Since there wasn’t any readily available, I put my python skills to help me.

I decided to scrape the webapp, extract the text, clean it up and arrange them in proper order. This formatted text will be stored as a text file and can easily be converted into an ebook using Calibre.

Python has a beautiful web parser called beautifulsoup. One can install that by typing

python -m pip install beautifulsoup4

Upon opening the webapp in a browser, I was able to decode how the content was organized. The format is

https://venmurasu.in/{{bookname}}/chapter-{{no}}

I opened all books manually in the browser and noted down the number of chapters that particular book had. Then I created a comma separated text file in the format

bookname, first_chapter_no, last_chapter_no

example
muthalaavin,1,120

I now have a text file with 26 book names and their chapter details. All I need to do is to read this text file in python, get the book name, chapter details and then construct the URL for the parser. So here is the code which is sufficiently commented and self explanatory

import requests
from bs4 import BeautifulSoup

#name of file containing information about books
text_file='venmurasu.txt" 

with open(text_file) as file:
    lines=file.readlines()   #Reading all the lines in the file. Returns a list

books=[]
for line in lines:
    info=line.split(',')    #Splitting every line into three distinct values
    book={
        'title': info[0],
        'start_no': int(info[1])  #Python reads only strings, so I am converting that into int here
        'end_no':int(info[2])+1   #Converting to int and adding 1 to include the last chapter as well
    }

    books.append(book)          #This holds a list of dicts with all information about book

for book in books:

    contents=[]                 #Empty list to hold the content of the particular book

    print("Parsing Book ", book['title'])

    for i in range(book['start_no'],book['end_no']):

        #Creating the url for the book and the chapter
        URL="https://venmurasu.in/{}/chapter-{}".format(book['title'],i)
        page = requests.get(URL)
        print("Parsing Chapter ", i)

        #Using the BS for parsing the page
        soup = BeautifulSoup(page.content, "html.parser")

        #Finding the content
        results=soup.find_all('p')

        for result in results:
            #striping away any stylings other than text
            contents.append(result.text)

    #Once the content is captured, writing that into a file with two line spacing for every paragraph
    with open(book['title']+'.txt', 'w',encoding='utf-8') as f:
        for content in contents:
            f.write(content)
            f.write("\n\n")

By running the script I had all the books in the form of text files. Now I can simply convert all of them into ebooks via Calibre Software. Once I had all my files in epub format, I simply send them all to my Kindle via Send to Kindle.

Now I can read the books in all my devices, highlight, annotate and share my notes.

Note

  1. Eventually the ebook version is released in Amazon Kindle platform by the author himself. Here is the Link.
  2. I am not promoting piracy in this post. This is simply an utility script which I created for my own convenience. I have not and will not share the books I created to anyone or anywhere without proper permissions.