New Beginnings— My Lambda Labs Journey
The organization we worked with is named Human Rights First (HRF). It is an independent advocacy and action organization that challenges America to live up to its ideals. They believe American leadership is essential in the global struggle for human rights, so they press the U.S. government and private companies to respect human rights and the rule of law. When they fail, they step in to demand reform, accountability, and justice. Around the world, they work where they can best harness American influence to secure core freedoms. Susan Rice, Former U.S. National Security Advisor had this to say about this organization “For more than three decades, Human Rights First has been a clarion voice in defense of human dignity and the rights and freedom of people everywhere.”
The problem we are assigned to solve for HRF is to create a web tool backed by data science to aggregate data on asylum cases, allow users to explore that data, and predict and visualize how a judge might rule on a specific asylum case, as well as what specific elements of an asylum case seem to most impact a favorable or unfavorable ruling. To solve the problem, we’ll focus on scraping aggregated PDFs as well as individual PDFs of case files to extract plain text and the name of the judge who made the ruling.
Some of the fears I had going into the project were the typical ones such as, will I know what to do in this project, do I have enough data science skills/knowledge to be able to even begin working on this project, let alone finish it successfully. Will I fully understand what the stakeholder wants and provide them with a good product at the end? Will my teammates be easy to work with, and will everyone take responsibility to reach MVP.
To break down our product roadmap into shippable tasks, we had a team meeting with our project TPL, the web team, and our team — the data science team. We discussed the reasons why we are building this product? What we hope to accomplish? How will this help users? How feasible it is to fulfill the stakeholders’ requirements from what we have so far, and a plan for executing the strategy agreed upon. We discussed the strategic direction we will be taking to build the product, what each team will be responsible for, and the general order of what we will be building. We prioritize the tasks created by order of importance. We came up with a baseline set of functionalities that must exist in order to begin fulfilling the expectations set. We used a Trello board to create our product roadmap that shows how to accomplish our goals.
Above is a snippet of a section of our team’s Trello board. It shows our ideas, ready for work, which is where we move the ideas we are ready to work on. In progress shows the tasks we are currently working on. Blockers are issues we’re having with our work and may need help with. Ready for review is where we move our finished task for review by other team members, and lastly deployed is where we move our finished products.
Digging Into the Work
Our team is made up of two different groups, the web team, and the data science team. As part of the data science team, our MVP goals for this project were to work on aggregate case file scraping, which needs us to upload PDFs containing a compilation of links to case files on the web, then these case files should be automatically scraped to download and import each individual case PDF. Secondly, the Uploaded case file PDFs should be converted to plaintext and associated with the raw PDF in the database. The process should be able to recognize the judge’s name and associate it with the case. As a stretch goal, we should scrape the uploaded case file PDFs for additional relevant structured fields, and records in the database should be searchable/filterable on these fields.
A feature we worked on together as a team was trying to figure out an issue we were having with the OCR function created to extract the Judge’s name from the text. The issue was with one of the OCR dependencies on Heroku. It later turned out we were supposed to use Amazon Web Services (AWS), and not Heroku. We had to cancel all the work and individually pick up different features on the Trello board to work on while another teammate worked on the AWS API in order to save time. This happened in the second week of our project, so we were running low on time.
The first feature I worked on individually was to create a function that will be able to extract the judges name from the pdf files. While working on this, I encountered an issue when trying to convert the pdf files to images, extract the text in the images, save the extracted text to a dataframe, and then print the text. I had wanted to use SpaCy named entity recognition to extract the data from the text file. However, this did not work out for me because I kept getting an error. While researching the error, most of the solutions mentioned editing the policy.xml file. I could not figure out where to find this file to edit, so I explored another way to work on this feature using the same SpaCy named entity recognition method.
Another feature I worked on was to produce work on parsing the pdf documents. One of my other team mates was using Optical Character Recognition (OCR) to parse the PDF documents to text, but this was proving to be more tricky to work with as compared to other libraries and techniques used to get the text from the PDFs. I had a meeting with our labs manager, Ryan Herr, and he also advised me to steer clear of using the OCR method as best as possible. Through research, I found another technique called pdfminer which is what I used to convert the PDFs to text. PDFMiner turned out to be pretty simple to use and I was able to successfully parse all the documents I tested using the function. I was really excited about having a code that worked successfully. However, I experienced a technical challenge when my teammate later realize the function written cannot read another PDF document he had that I had not tested. He updated the code to read different types of PDF documents.
I also contributed in writing the README for our github repository to communicate important information about your project .
Lessons Learned
We were successful in meeting our MVP goals despite all the blockages each of us experienced together and on individual features we worked on. We built an application that uses OCR to scan input court decisions for certain values such as the names of Judges and the location of the Court. These values are then inserted into a database to be used for future purposes.
The plan for the database is that lawyers or other people assisting asylum seekers can search this database for information about a judge, and the judge’s rulings. They can then tailor their arguments before this judge in a way that will increase the chances of having their client’s asylum case being approved. As of now, I do not foresee any technical challenge with this product.
This project was a very interesting one, and there were definitely lessons learned working on it. For starters, I think there could have been more communication, better cohesiveness amongst us, the DS team. I did not receive any feedback from any teammate in this group, but I’ve received some in past team work. They are mostly positive feedbacks, but one area of concern from most of my team members was about me speaking up more often. I totally appreciate feedbacks, and I plan to continue to develop my communication skills in terms of making small talk with teammates. I do communicate about our projects, give updates, share my opinions and ideas in team meetings, but I guess I have to up my skills in making small talk as well.
This project as a whole has definitely helped to increase my interest in pursuing a career in the field of data science. I have learned a lot of new stuff working on this project, and it also opened my eyes to areas of strength I need to build upon, as well as growth areas I should work on improving my skill set to become a really good Data Scientist.