Stackalytics

Mission
The Stackalytics project is on a mission to provide transparent and meaningful statistics regarding contributions to both OpenStack itself and projects related to OpenStack. But what does "transparent and meaningful" mean? Transparency is important so that the community can have confidence that all calculations are correct and fair. So "transparent" means that anyone can double check the methods of calculation Stackalytics uses. Meanwhile, results must be meaningful to be useful. "Meaningful" means that anyone may submit a correction that will adjust the influence of appropriate statistical data. For example, auto-generated code, mass renaming, automatic refactoring, auto-generated config files, and so on can artificially inflate various statistics. Stackalytics makes it possible to avoid these problems as they're discovered.

Description
Stackalytics is a service that collects and processes development activity data such as commits, lines of code changed, and code reviews, blueprints and makes it possible to visualize it in a convenient web dashboard. The Stackalytics dashboard makes it possible to view data by project, company, contributor, and other factors.

The primary data sources for Stackalytics are the OpenStack Git repositories and the Gerrit review history.

Git commits history
Stackalytics process three major metrics for OpenStack contribution:
 * Number of commits
 * Number of modified files
 * Number of modified lines

Basic code related statistics are retrieved from the output of the following command:

git log --pretty="commit_id:'%H%ndate:%at%nauthor:%an%nauthor_email:%ae%nsubject:%s%nmessage:%b%n'" --shortstat -M --no-merges

The output from this command looks something like this:

commit_id:b5a416ac344160512f95751ae16e6612aefd4a57 date:1369119386 author:Akihiro MOTOKI author_email:motoki@da.jp.nec.com subject:Remove class-based import in the code repo message:Fixes bug 1167901 This commit also removes backslashes for line break. Change-Id: Id26fdfd2af4862652d7270aec132d40662efeb96 diff_stat: 21 files changed, 340 insertions(+), 408 deletions(-)

This commit changes 21 file and 340 + 408 = 748 LOC (Line Of Code). I.e. LOC is a sum of insertions and deletions.

Company affiliation
Company affiliation for each commit author is determined according to the following rules:


 * First Stackalytics checks the domain of author email. If the domain is in the Stackalytics configuration file (default_data.json), then the affiliation for the commit is determined based on the email address.
 * If the email domain doesn't provide enough information, Stackalytics retrieves the author profile from LaunchPad using the email address. If LaunchPad does not identify the author, then the commit is affiliated with *independent, unless the next step provides enough information to determine the author's affiliation.
 * Next the LaunchPad ID is used as a primary key for further author identification. Stackalytics stores profiles for known contributors in the same configuration file (default_data.json). This profile has a historical list of contributor affiliations. For example:

{           "launchpad_id": "boris-42", "companies": [ {                   "company_name": "*independent", "end_date": "2013-Apr-10" },               {                    "company_name": "Mirantis",/* Git commits history */ "end_date": null }           ],            "user_name": "Boris Pavlovic", "emails": [ "boris@pavlovic.me" ]       },

As shown above, Stackalytics has a company_name and an end_date. This information is enough to determine the affiliation on any given commit based on its date, so that an individual who has worked for more than one company can have his or her commits properly attributed.

(Note: The optional "gerrit_id" has been added for situations in which a  user's launchpad_id and gerrit_id don't match; this parameter name will be changed, probably to "alternate_names", in a future version of Stackalytics to allow for better flexibility.)


 * Finally if all checks above fail, then the commit is affiliated with *independent.

Note: if you are updating your profile please keep old emails if you have any contribution associated with them.

Gerrit reviews history
Gerrit provides a command line interface for retrieval of review source data. Authorized users can connect to review.openstack.org via ssh and execute the following command:

''gerrit query --all-approvals --patch-sets --format JSON module branch:master limit:100

This command outputs a list of the latest reviews on the module. Stackalytics parses this information and stores it in run-time storage. The web front end pools run-time storage for any new records and changes in existing records, and retrieves data into memory for real-time processing.

Stackalytics provides the following analytics for reviews:
 * Number of reviews
 * Statistics of positive and negative reviews
 * Ratio of positive to negative reviews.

Mail lists activity
Stackalytics polls mailing list activity via the web-based OpenStack-dev archive. In the Stackalytics dashboard, mailing list activity is presented in the same way as commits and reviews. Stackalytics searches the following places to choose the OpenStack module to which an email is related, in order:


 * Module name in brackets in email subject
 * HTTP links to blue-prints or bugs in email body
 * Module name without brackets in email subject

If none of the above locations yields a known module, the email is attributed to the 'unknown' module.

The list of tracked web pages with mail archives is managed in default_data.json under the section marked 'mail_lists'.

Activity on Blueprints and Bugs
LaunchPad provides an API for blueprints and bugs tracking. Stackalytics polls LaunchPad and provides a standard drill-down analytic over blueprints and bugs activity. There are four metrics in Stackalytics dashboard that describe blueprints: 'Completed Blueprints', 'Drafted Blueprints', 'Bugs Filed' and 'Bugs Resolved'. 'Completed Blueprints' shows blueprints that have been completed and lists the person made an effort to deliver the blueprint into upstream code, and the 'Drafted Blueprints' shows all blueprints in LaunchPad, and lists the original author of each blueprint.

Besides this standard analytic, there are two additional reports. The first shows the activity history for a particular blueprint (available via link on blueprint name), and the second provides a reference analytic over blueprints for a particular module (Blueprint popularity).

Commit metrics corrections and a common sense approach
The recent history of contributions to OpenStack shows that the lines of code (LOC) metric is not reliable due to commits that were not representative, such as code auto-generation or automatic code refactoring. The most well known examples are:


 * Rename Quantum to Neutron Change-Id: Ib86e068aa8e4f48993809b6b25444407b7c1f17e
 * Updated translations from Transifex Change-Id: I4810c45d15413bdf21b9f68f59096c907bb1e624

Stackalytics provides a framework for a community-driven correction process. It works like this. Corrections are stored in the corrections.json JSON file in the Stackalytics repo. These corrections look something like this:

{   "corrections": [ {           "commit_id": "ee3fe4e836ca1c81e50a8324a9b5f982de4fa97f", "correction_comment": "Reset LOC to 0", "lines_added": 0, "lines_deleted": 0 }   ] }

The structure of these records is self-descriptive, and any OpenStack contributor can file a bug and provide a patchset for this file in order to apply a particular correction. This patchset goes through the standard review process and as soon as it merges into the upstream project, the changes are immediately visible in the Stackalytics data. Note that this process is driven by the community and should not be used for improper manipulation of statistics. Corrected commits are marked with comments in RED in the web dashboard and are fully transparent, should anyone else wish to make further challenges.

This framework was designed in order to make statistical data more reliable and representative. The following common sense approach should be used:
 * Commits that contain auto-generated files should be adjusted in order to represent the amount of effort actually produced by the contributor, not including generated output.
 * Commits that contain the result of automatic code refactoring should be adjusted accordingly.
 * Commits that are the result of improperly renamed files (shell rename instead of git rename) should be zeroed.
 * Commits with binary and 3rd party files should adjusted accordingly.

The default correction file contains manually filtered commits for the last two release cycles (Grizzly and Havana) with LOC higher than 3000.

Tracked projects and classification
Stackalytics is able to track any project that uses the standard OpenStack development infrastructure of Git, Gerrit, and Launchpad. At the highest level, all projects are divided into two major groups: OpenStack (official projects listed in governance's projects.yaml) and OpenStack Others (hosted in openstack, but not listed officially). This classification is determined by the organization account in [Git|http://git.openstack.org/cgit]. Stackalytics stores a list of projects in its main config file. Any OpenStack contributor can file a bug and provide a patchset for the addition of an untracked project. The repos section represents the list of tracked projects. It has the following format: "repos": [ {           "module": "nova", "uri": "git://github.com/openstack/nova.git", "releases": [ {                   "release_name": "Essex", "tag_from": "2011.3", "tag_to": "2012.1" },               {                    "release_name": "Folsom", "branch": ["stable/folsom"], "tag_from": "2012.1", "tag_to": "2012.2" },               {                    "release_name": "Grizzly", "branch": ["stable/grizzly"], "tag_from": "2012.2", "tag_to": "2013.1" },               {                    "release_name": "Havana", "tag_from": "2013.1", "tag_to": "HEAD" }           ]        }  ]

It is essential to provide git tags or the commit_id for known release cycles. Otherwise Stackalytics will attribute commits with regard to release cycles according to date of commit. Dates of release cycles ends are specified in Stackalytics the "releases" section of the configuration file default_data.json. By default Stackalytics tracks "master" branch, but it is possible to specify names for stable branch for each particular release. OpenStack projects are classified according to their official https://git.openstack.org/cgit/openstack/governance/plain/reference/programs.yaml. For OpenStack projects the following groups were identified:
 * integrated - has property integrated-since in https://git.openstack.org/cgit/openstack/governance/plain/reference/programs.yaml
 * incubated - has property incubated-since
 * documentation - listed in section Documentation
 * infra - projects such as zuul and the jenkins-job-builder that are necessary for the overall project infrastructure, but are not technically part of OpenStack itself. Listed in section Infrastructure.
 * other - projects that are hosted in the OpenStack GitHub repo and are not classified by the rules above.

Stackalytics is able to configure most OpenStack projects automatically. For this purpose the Stackalytics configuration file default_data.json contains the "project_sources" section. It is a list of Git repos with attributes required for project classification. For example, the record:

{           "organization": "openstack", "exclude": ["openstack", "gantt", "python-ganttclient"] },

says that all projects from the Git OpenStack repository will be attributed to "project_type": "openstack". Boundaries of release cycles for automatically imported project would be attributed according dates of commits. By default auto-imported projects track only the "master" branch.

Stackalytics provides mechanism to group several projects under one line item in modules drop-down. Those groups could be configured manually in section module_groups. For all official programs the corresponding module groups are created automatically based on official program list.

OpenStack foundation members
Stackalytics polls openstack.org member directory for new registrations and provides dril-down report on this data. User can get stats on new members/companies joined within week/month/quarter. Members affiliation is determined by heuristics based on user profile at openstack.org.

DriverLog
Stackalytics provides report on external CI test run statuses based on data provided by DriverLog api.

Release Notes
Release 0.5/0.6
 * Implemented module classification based on programs.yaml with retrospective integrated/incubated attribution
 * Fixed performance and memory consumption issues
 * Added support for co-authored commits
 * Added metrics on filed and resolved bugs
 * Added drill-down report on OpenStack foundation members
 * Fixed misc bugs

Release 0.4
 * Added review stats report that shows top reviewers with breakdown by marks and disagreement ratio against core's decision
 * Added open reviews report that shows top longest reviews and whole backlog summary
 * Added activity report with engineer's activity log and punch-card of usual online hours (in UTC). The same report is available for companies
 * Fixed review stats calculation, now Approve marks are counted separately
 * Fixed commit date calculation, now it is date of merge, not commit
 * Minor improvements in filter selectors
 * Incorporated 21 updates to user and company profiles in default data

Release 0.3
 * Added polling of mailing list and standard analytic in dashboard for this data source.
 * Added analytics over blueprints. Implemented report on blueprints popularity. It shows how many times blueprint was mentioned in emails or commit messages.
 * Added report on Top Mentors. It shows statistics of reviews on new contributors patches.
 * Added documentation on API.
 * Implemented project grouping. Now several projects can be grouped into one meta-module.
 * Implemented tracking of stable branches for older releases.
 * Improved Bug ID parser.
 * Implemented fail-over handling for GitHub.
 * Fixed some bugs and added affiliations for a bunch of people. Added corrections of statistics for Havana release cycle.

Release 0.2
 * Changed internal architecture. Got rid of persistent storage in Mongo. Configuration file default_data.json now serves as persistent storage.
 * Added polling of Gerrit for retrieval of review related source data.
 * Implemented basic statistics for reviews: number of reviews over time, number of negative/positive review, ratio of positive and negative reviews.
 * Redesigned navigation and layout of statistics pages.
 * Implemented auto-assignment of commits to release cycles based on dates.
 * Implemented automated retrieval of projects list from GitHub API.
 * Cleanup of default_data.json (reduced from 14k LOC to 4k)
 * Added to corrections.json all commits which are over 3k LOC and looks like auto-generated.
 * Implemented feature request Sort bugs ID as numbers, not as strings
 * Fixed bunch deletion from memcached

Release 0.1
 * Changed internal architecture. Main features: advanced real time processing and horizontal scalability.
 * Got rid of all 3rd party non-Apache libraries and published sources on StackForge under Apache2 license.
 * Improved release cycle tracking by means of Git tags instead of approximate date periods.
 * Changed projects classification to two-level structure. OpenStack (core, incubator, documentation, other) and StackForge.
 * Implement correction mechanism that allows users to tweak metrics for particular commits.
 * Added a bunch of new projects (Tempest, documentation, Puppet recipes).
 * Added company affiliated contribution breakdown on users profile page.

How To's
Stackalytics/HowToRun - how to install Stackalytics and run it in dev or prod environments

Source
https://github.com/openstack/stackalytics

Pending Code Reviews
https://review.openstack.org/#q,status:open+stackalytics,n,z

Project space
https://launchpad.net/stackalytics

Blueprints
https://blueprints.launchpad.net/stackalytics

Bugs
https://bugs.launchpad.net/stackalytics

API docs
http://stackalytics.readthedocs.org/en/latest/

Web-site
http://stackalytics.com/