Softpage... and How it's made.

Softpage... and How it's made.

ยท

6 min read

undraw_code_thinking_1jeh.png

'Learn from the mistakes of others. You can't live long enough to make them all yourself.' - Eleanor Roosevelt

I knew someone said this, just googled the name, giving credits is something you learn in developer community. This what is going to happen for you throughout this post.

It started as announcement from python faculty, make something for 15 marks of innovative assignment, just week before pandemic vacation started in 1st week of March. I wanted to make something that makes checking the assignment task easy for faculty kind of automate it if possible. So less time gets wasted in this and they don't indulge in this kind of silly thing, rather they can focus on contributing to other important stuff/experimenting compare to just announce lab title and get back to assessment.

All I had was just a simple concept, get the file, get similar topic from google and then compare. How to do it? What is NLP? What else will be needed? I had no idea, just got myself started, by reading file.

I thought OCR is good thing, but not for my use case. Pytesserasct is Google's image to text extraction library for python. I tried its not good to read bulk hand written text. It's not there yet to read End-Moment-Engineer's-Attempt. Only God and writer himself can read it smoothly, there are chances of God making error on the way though.๐Ÿ˜œ

OCR not good choice, So '.txt' files worked (why didn't I thought that earlier) and then I found out that NLP exist, came across with libraries like spacy and nltk which are really good and kind of must you should know if you want to do something properly in NLP (Natural Language Processing). This helped me get how to extract necessary part of sentence. I am thinking to add pdf and MS doc support as well.

So now pandemic is real and we all are home, and college discovered that zoom exist. New task everyday still I managed to learn tkinter (GUI library for python), and with web scrapping my Google Comparator or any other not so good though version of this software was ready. It summarized, compared online and locally but was very rough implementation, even I won't trust it but it served the purpose of innovative assignment.

Now when classes were online, we came to know that faculty are using Plagiarism detection tools like turnitin and one factor was also I don't have any proper project to put on GitHub and resume and haven't made something in really long time, also it's 6 semester so placements are close and to test my python and many other minor factors thought that lets make this thing useful and in somewhat a proper project , Can't say it's good but I tried.

So now let's dive into how I made this thing called Softpage .

undraw_Code_review_re_woeb.png

Day 1 of version 2.0 for this, my words for my own code "What is this shit?" I didn't wanted to work on this. My friend told me once "Companies spend lot of money to rewrite old codebase" I realized it now.

So yeah now, lets decide what we need and what we have to make, but be flexible, in middle you can have a better idea of what could be improvement and what could be clutter to your work. Be sure of what project should serve but be flexible on how and what features you provide.

I thought why don't just use something from rapidapi it will work right and less code would look clean. Searched 3 hours of 'Free plagiarism API', 'plagiarism detection API', 'Online plagiarism API free' and few other combinations found many but all were paid. 2 weeks ago I was like why they are charging money there should be something for free at least, Now I realize it, if you want to use my code I will also charge money now, even though it's not good but I have worked on it and it gives some output so yeah why not.

If you want the world value you, "Start by yourself". - anonymous

My first version had 1 file, easy to write, easy to work... for a while, then it's clutter. Hard to find where's function I was working last night. Another keep the names of at least functions on what purpose they serve, it's okay to use x, z, a[] but when it's part of global scope, use meaningful conventions.

Divide Project in small features, features in functions and code those function, think the flow of how you want to provide functionality and code functions serving the task in manageable segments. Once one feature is done, put all the function serving it in one file, name it after the feature. That's what I did in new Softpage . It is good to provide enabling/disabling when some part of code is taking sometime to execute.

Once I used React Native, It had hotloading, nice feature. Just save the code and it'll automatically reflect to your build. I wanted that now in python as well.. there's not a proper way to achieve it. Python is not good for GUI program, it's backend mostly executed by console and Jupyter notebook. I tried one thing is to on close check if file is changed and then re-execute it, in loop if not then exit. I would not recommend this, better look at code twice before execution and then run it.

So I said about searching the free API for Plagiarism, I don't exactly remember how I got this intuition about this way, from Stackoverflow or on a walk I don't remember. But here's how I developed it. Take the text, convert sentence into minimal form like followed, following becomes follow, remove stopwords from nltk corpus, take necessary parts of sentence, make hash of it, store hash for every sentence, now compare hashes. You can achieve time complexity of nlogn from this if code efficiently. My first implementation was executing in 70 seconds but final one was 11 seconds so. Online and Offline Plagiarism both use same algorithm, Online uses wrap of scrapping and preprocessing.

Paraphrasing is real tough thing to implement, few approaches I'll discuss below which did not work but you should know what not to do. Also I should mention there is not free API for this as well. You can use free SEO tools available online but cannot make it part of software. I tried to make use of thesaurus and pyDictionary and applied it on random parts of sentence, did not work though, was slow and output is not satisfactory. Tried above approach along with converting active to passive sounds good but there is no proper implementation that converts active voice sentence into passive sentence. If it exist please comment below. Found one Google Translate API on Pypi, it does not work, only gives result for example, which is also not good enough to use as paraphraser. I have one model ready now in my colab notebook, will find a way to put it in so easy to use locally. Yeah, still haven't added this feature.๐Ÿ˜‚๐Ÿ‘€๐Ÿคฆโ€โ™‚๏ธ

Summarizing uses one rapidapi, ofcource they were giving better output than my implementation and available for free so. Code execution, which was most unreliable part of previous implementation is one that needed very less changes and was actually good. Faced many issues in this as well, how to open command prompt by passing command worked, other ways were not possible. Uses popen for subprocess and passes file name to execute.

Could not find a better way to do it, If exist please comment or any other recommendation you have, please comment or put it in pull request Softpage.

I really had fun building it and learned a lot of things, not a single thing I knew before is used in this project, learned everything first time. Every tech-stack used here.

Have a good day. Happy Coding.๐Ÿ’ป

Minor update, Paraphrasing feature is added now.

Softpage Github Repo