Experiences in Judicial Data Mining
Daksh experience with the Rule of Law Project
*Kavya Murthy and Ramya Sridhar Tirumalai
It is quite likely that the number of pending cases in the Indian courts - 60,000 cases in the Supreme Court, 41 lakh cases in the High Courts and 2.5 crore cases in the subordinate courts - are by now well-known to the reader. These numbers are usually referred to when discussing the effects of judicial pendency, which is said to affect the quality of the delivery of justice in the country. However, while it is widely agreed that pendency is a severe problem facing the judiciary, these numbers do not reflect the local, specific, or detailed causes of pendency.
The Rule of Law Project at Daksh
Our work in the Rule of Law Project at DAKSH has been to collect, amalgamate, and analyse judicial data towards a study of the problem of pendency in Indian courts. It not only brings judicial data together but is the first publicly available database for judicial data on a single platform. We also study this data for granular analysis of the causes of delay and pendency.
In this process, we faced numerous stumbling blocks, and though we had expected to meet hurdles given the course of our work, the sheer variety of challenges we faced was unanticipated.
While the High Courts and the subordinate courts in India make data available in the public domain, these databases are separate for each High Court and vary in form and content. In subordinate courts, while the form and content of the data are uniform, it is stored in separate databases. The first challenge has been for us to bring this multifarious data on to a single location. Beginning in January 2015, DAKSH started to collect and analyse data High Courts and the subordinate courts. As of March 2016, our database now contains information for 21 of 24 High Courts and 417 subordinate courts of approximately 4600 courts.1
-
Data Mining for High Courts
We began with the assumption that daily cause list is a repository of all cases appearing in the High Courts. We started to collate the cause list data for all courts which were digitally available.
Challenges in Data Mining in High Courts
Over the last 15 months, as we have been collecting and amalgamating judicial data, we have faced multiple challenges. The problems that we encountered ranged from the quantity of data that we worked with to the quality of the said data. The setbacks that confronted us and have impeded analysis can be broadly summarised as follows:-
1 There is no case data available online for three of the 24 High Courts. Our process of data collection for subordinate courts is ongoing and will include as many as is possible. Data for subordinate courts is available only for 4589 courts of a total of 16400 court rooms said to be in operation in India. The DAKSH database can be accessed at
[ 28 ]
-
Non-availability of Basic Data
The non-availability of a large enough set of scientific data seriously affects our ability to assemble and analyse data across High Courts, thus hampering wide-ranging and significant analysis of pendency. This has been noted by the Law Commission of India in its 245th Report, where the following observation was made:'(The) Lack of complete data was a great handicap in making critical analysis and more meaningful suggestions.'2
The data that is available is varied and problematic to use. These problems are compounded by the fact that many High Court websites don't make available multiple pieces of basic data. For example, out of the first 10 High Courts that we started analysing, only half made data available on 'Date Filed'. This data element refers to the date on which a case was registered in the particular court. Of the five courts that do not make this data available, one has the information, but chooses to limit access with a captcha. The date on which a case is filed is arguably the most vital piece of information as in the absence of the date of filing, the exact duration for which a case has been pending cannot be calculated. The said courts do not provide any explanation as to why it is not available.
Captcha and public information
Placing case data behind a captcha is problematic. In general, captchas are used by a system to verify whether the user is a human or not. Case data is public and thus needs to be made available without preventing automated access. Once again, as with the case of lack of standardisation, the non-availability of such basic information gravely affects the prospect and scope of judicial data analysis.
Very few High Courts provide another key data point: details of the statute that a case is registered under. This is another confusing omission since the details of the legislation are mandatorily included in legal processes. Without information on the statute under which a case was filed, detailed subject-wise analysis is impossible.
Yet another gaping void in case data from many courts is the ability to view details of past hearings against each case. Only some court websites provide information on the history of the case from the date it enters the said High Court. While a few courts also offer the ability to see data about disposed cases, most do not.
Importance of mapping lifecycle of cases
Benchmarking pendency means quantifying and constructing the life cycle of a case from the date of its institution to the date of its disposal. By building a case's life cycle and studying the orders passed at each stage, a more comprehensive understanding of the operation of the judicial system can be built.
To build and analyse the life cycle of a case, the full date of filing (including the day, month, and year) is crucial. To identify the reasons for delay through the life cycle of each case, the details of each hearing are key to understand the manner and stages through which cases progress in the system. The order sheets of cases will provide information on the proceedings in each hearing, such as reasons for which adjournments were sought. Providing access to this information would go a long way to building the life cycles of cases.
[ 29 ]
2Law Commision of India. 2014. 'Arrears and Backlog: Creating Additional Judicial (wo)manpower' p. 16, Report no. 245, Law Commission of India, available online at http://lawcommissionofindia.nic.in/reports/Report245.pdf&embedded=true (accessed on 14 March, 2016).
-
Lack of Data Standardisation
To simplify, every High Court is an island in the sea of Indian judicial data. The underlying problem with the data is undoubtedly the complete lack standardisation in the High Court data that is available on the internet. There is significant variation in terms of site layout and navigation, data availability, or data format. This lack of standardisation was puzzling since all High Courts are constituents of the same judicial system, logically implying that the manner in which they organise and present should be in a similar, if not identical.
This said, one has to bear in mind that this lack of standardisation in High Court data may not affect most users of the system, namely litigants and lawyers, given their focus on their own cases and may not be concerned with data from a bigger, analytical perspective.
The first place where ambiguity becomes evident is within the High Court websites. On a particular website, the same data will be displayed differently at different places. To illustrate, we can turn to case type lists. In order to classify and identify cases, the courts themselves have created categories known as case types. Lists of these are available in two places on High Court websites. One list can be found on the case status pages of the websites, where the status of cases currently pending are made available. The other list can be found inthe cause list, which is the daily list of cases that will be heard in all court halls of that High Court. The list of case types on the case status pages and cause lists is more often than not different. This difference is bewildering, since it is assumed that case nomenclature within a court would be standard.
There is also a tremendous variation in data availability from court to court. We looked at the data elements that each court makes available on a case and found that there are over 30 distinct obtainable data elements (for example, Combined Case Number, Case Type, Date of Filing, Petitioner and Respondent details). Of these, less than a third are found in all High Courts. In addition to current case information, there are some High Courts that provide lower court information and links to orders. This too, is not a standard feature.
Websites are different from each other, even on the most basic of information. While both case status pages and cause lists feature on most court websites, there are four High Courts, namely the High Courts of Jammu and Kashmir, Manipur, Meghalaya, and Sikkim which do not have case status pages. This means obtaining any current case information in these states is not possible.
Another major fount of judicial information found on the website that differs in form from court to court, is the daily cause list of each court. The daily cause lists contain a number of important elements such as the case number, the party name, the stage the case is currently in, petitioner and respondent names, the name of the lawyer and the name of the judge. However the data elements found on cause lists are not uniform across High Courts. For example the High Court of Delhi provides the name of the case ('X' vs. 'Y') however the High Court Kerala only makes the case number available.
The dearth of standardised and collated data obstructs even rudimentary analysis of case pendency in courts. At the moment, we do not have complete answers to many questions on pendency such as 'what are the kinds of cases pending for the longest/shortest time?', 'how long are cases pending for?' and 'which case types constitute a majority of pending cases?'
To carry out comparative analysis between different kinds of cases, to identify what kinds of cases are pending in our courts, and how long they are pending, we need to standardise case data. It is only then that will we be able to postulate sustainable solutions for pendency.
[ 30 ]
-
Quality of Information
The quality of a large portion of available High Court data is poor to say the least. Sizeable parts of this data are rendered unusable due to the fact that it is riddled with inaccuracies and mistakes. Thorough verification and clean-up is a pre-requisite to get this data in analysable form. Cleaning and standardising the data is an enormously difficult task, due to the massive volume of High Court data as well as the number and variety of errors it contains. Errors contained in the data can be roughly grouped as follows:
-
Incorrect spellings: There area substantial number of incorrect spellings in the data. This is particularly visible in fields such as district name, judge name, and name of the current stage of the case. Manual entry of data is the most probable causal factor for the huge number of misspellings.
-
Wrongly entered information: Many a time we have come across data that should be under one field, mistakenly entered under another. For instance, most courts have a field known as 'stage', which indicates the current procedural status of the case. Several times we have found this information in the field where information on the legislation the case is filed under should be contained. While sometimes it is clear that wrong information has been entered, for instance, if the name of the stage is entered where the name of the judge should appear, often, the fact that the information is in a wrong field is not apparent. This happens when the two fields in question are not very clearly defined such as stage name and case category.
Another field where information can be wrongly entered is dates and numbers. Figure 1 below illustrates this problem. This case showed up as the oldest case in our database as per its date of filing-1 September 1900. However, from the information in the second column ('Type/No/Year/District') it becomes clear that the case was filed in 1990.
[ 31 ]
Screenshot of wrongly entered date of filing in the High Court of Allahabad
-
Incomplete information: Many a time information in a field is incomplete. This problem is particularly rampant in data fields of judge nameand statute name. Since there are a multitude of similar judge and statute names, completing the information by ourselves is not an option. These incomplete case records become irrelevant for analysis.
-
Data field specific problems: Certain data fields have problems specific to themselves. For example, many courts have a data field called 'case category' assigned to each case. In some courts, there is a defined case category to carry the name of the statute that the case is filed under, whereas in other courts, categorisation makes no reference to the statute and carries information on subject matter instead.
-
Short forms or abbreviations: Whole sections of case-related data are expressed in short forms or abbreviations. Not only do these short forms vary from court to court, there is often no centralised key available to understand this data, which makes it undecipherable for users who do not have a legal background. In addition, due to the sheer volume and variation of data, even those possessing legal knowledge have no way of knowing whether they are interpreting the data correctly. A very good example for this problem are the case types that each court uses to categorise and label cases. Most of these case types are in the form of abbreviations. There is no key provided to make sense of the case type list on each High Court website. In the absence of a key, it is unfeasible for anyone other than a local lawyer to understand the case types.
-
Conclusion
As on date, the DAKSH database contains details of nearly 18 lakh cases and 59 lakh hearings from 21 High Courts. Using the information we have collected from the High Court websites, we have arrived at a range of statistics such as the average pendency for different kinds of cases, the number of days between hearings as well as average days to disposal. We have cleaned up, standardised, and added depth to the data so as to enable more focused analysis. To facilitate change in the system, the DAKSH database is a suitable starting point.
The lack of availability of standardised and high-quality error-free data hinders overall analysis of the judicial system, since it becomes nearly impossible to collate information. In essence, comparative analysis cannot be carried out without comparable elements.
The notion that in the sea of judicial data, each High Court is an island is one that needs to go. It is only when the judicial system is viewed as a whole, that optimal data management, efficiency, and productivity can be achieved.
[ 32 ]
*Kavya and Ramya are associates at Daksh Society, Bangalore and have extensive experience in judicial data mining.