Jump to: navigation, search

Difference between revisions of "Stackalytics"

Line 153: Line 153:
 
* Added to corrections.json all commits which are over 3k LOC and looks like auto-generated.
 
* Added to corrections.json all commits which are over 3k LOC and looks like auto-generated.
 
* Implemented feature request [https://bugs.launchpad.net/stackalytics/+bug/1204926 Sort bugs ID as numbers, not as strings]
 
* Implemented feature request [https://bugs.launchpad.net/stackalytics/+bug/1204926 Sort bugs ID as numbers, not as strings]
* [[Fixed bunch deletion from memcached|https://bugs.launchpad.net/stackalytics/+bug/1209211]]
+
* [Fixed bunch deletion from memcached|https://bugs.launchpad.net/stackalytics/+bug/1209211]
  
 
'''Release 0.1'''
 
'''Release 0.1'''

Revision as of 13:06, 14 August 2013

Mission

The Stackalytics project is on a mission to provide transparent and meaningful statistics regarding contribution to both OpenStack itself and projects related to OpenStack. But what does "transparent and meaningful" mean? Transparency is important so that the community can have confidence that all calculations are correct and fair. so "transparent" means that anyone can double check the methods of calculation Stackalytics uses. Meanwhile, results must be meaningful to be useful, so if someone does discover a problem with a method for calculating statistics, "meaningful" means that anyone may submit a correction that will adjust the influence of appropriate statistical data. For example, auto-generated code, mass renaming, automatic refactoring, auto-generated config files, and so on can artificially inflate various statistics. Stackalytics makes it possible to avoid these problems as they're discovered.

Description

Stackalytics is a service that collects and processes development activity data such as commits, lines of code changed, and code reviews, and makes it possible to visualize them in a convenient web dashboard. The Stackalytics dashboard makes it possible to view data by project, company, contributor, and other factors.

The primary data sources for Stackalytics are the OpenStack Git repositories and the Gerrit review history.

Git commits history

Stackalytics process three major metrics for OpenStack contribution:

  • Number of commits
  • Number of modified files
  • Number of modified lines
  • Number of reviews and statistics of positive and negative reviews

Basic code related statistics are retrieved from the output of the following command:

git log --pretty="commit_id:'%H%ndate:%at%nauthor:%an%nauthor_email:%ae%nsubject:%s%nmessage:%b%n' --shortstat -M --no-merges

The output from this command looks something like this:

commit_id:b5a416ac344160512f95751ae16e6612aefd4a57
date:1369119386
author:Akihiro MOTOKI
author_email:motoki@da.jp.nec.com
subject:Remove class-based import in the code repo
message:Fixes bug 1167901
This commit also removes backslashes for line break.
Change-Id: Id26fdfd2af4862652d7270aec132d40662efeb96
diff_stat:
21 files changed, 340 insertions(+), 408 deletions(-)

This commit changes 21 file and 340 + 408 = 748 LOC (Line Of Code). I.e. LOC is a sum of insertions and deletions.

Company affiliation for each commit author is determined according to the following rules:

  • First Stackalytics checks the domain of author email. If the domain is in Stackalytics configuration file (default_data.json), then affiliation for the commit is determined based on the email address.
  • If the email domain doesn't provide enough information, Stackalytics retrieves the author profile from LanchPad using the email address. If LanchPad does not identify the author, then the commit is affiliated with *independent, unless the next step provides enough information to determine the author's affiliation.
  • Next the LanchPad ID is used as a primary key for further author identification. Stackalytics stores profiles for known contributors in the same configuration file (default_data.json). This profile has a historical list of contributor affiliations. For example:
      {
           "launchpad_id": "boris-42",
           "companies": [
               {
                   "company_name": "*independent",
                   "end_date": "2013-Apr-10"
               },
               {
                   "company_name": "Mirantis",
                   "end_date": null
               }
           ],
           "user_name": "Boris Pavlovic",
           "emails": [
               "boris@pavlovic.me"
           ]
       },

As shown above, Stackalytics has a company_name and an end_date. This information is enough to determine the affiliation on any given commit based on its date, so that an individual who has worked for more than one company can have his or her commits properly attributed.

  • Finally if all checks above fail, then the commit is affiliated with *independent.

Commits metrics corrections and a common sense approach

The recent history of contributions to OpenStack shows that the lines of code (LOC) metric is not reliable due to commits that were not representative, such as code auto-generation or automatic code refactoring. The most well known examples are:

Stackalytics provides a framework for a community-driven correction process. It works like this. Corrections are stored in a corrections.json JSON file in the Stackalytics Stackforge repo. These corrections look something like this:

{
   "corrections": [
       {
           "commit_id": "ee3fe4e836ca1c81e50a8324a9b5f982de4fa97f",
           "correction_comment": "Reset LOC to 0",
           "lines_added": 0,
           "lines_deleted": 0
       }
   ]
}

The structure of these records is self-descriptive, and any OpenStack contributor can file a bug and provide a patchset for this file in order to apply particular correction. This patchset goes through the standard review process and as soon as it merges into the upstream project, the changes are immediately visible in the Stackalytics data. Note that this process is driven by the community and should not be used for improper manipulation of statistics. Corrected commits are marked with comments in RED in the web dashboard and are fully transparent, should anyone else wish to make further challenges.

This framework was designed in order to make statistical data more reliable and representative. The following common sense approach should be used:

  • Commits that contain auto-generated files should be adjusted in order to represent amount of effort actually produced by the contributor, not including generated output.
  • Commits that contain the result of automatic code refactoring should be adjusted accordingly.
  • Commits that are the result of improperly renamed files (shell rename instead of git rename) should be zeroed.
  • Commits with binary and 3rd party files should adjusted accordingly.

Default correction file contains manually filtered commits for last two release cycles (Grizzly and Havana) with LOC higher then 3000.

Tracked projects and classification

Stackalytics is able to track any project that uses the standard OpenStack development infrastructure of Git, Gerrit, and LanchPad. At the high level, all projects are divided into two major groups: OpenStack and StackForge. This classification is determined by the GitHub organization account. Stackalytics stores a list of projects in its persistent storage and uses the following config file for initial setup. Any OpenStack contributor is able to file a bug and provide a patchset for the addition of untracked project. The repos section represents the list of tracked projects. It has the following format:

   "repos": [
       {
           "branches": ["master"],
           "module": "nova",
           "project_type": "openstack",
           "project_group": "core",
           "uri": "git://github.com/openstack/nova.git",
           "releases": [
               {
                   "release_name": "Essex",
                   "tag_from": "2011.3",
                   "tag_to": "2012.1"
               },
               {
                   "release_name": "Folsom",
                   "tag_from": "2012.1",
                   "tag_to": "2012.2"
               },
               {
                   "release_name": "Grizzly",
                   "tag_from": "2012.2",
                   "tag_to": "2013.1"
               },
               {
                   "release_name": "Havana",
                   "tag_from": "2013.1",
                   "tag_to": "HEAD"
               }
           ]
       }
 ]

It is essential to provide git tags or the commit_id for known release cycles. Otherwise Stackalytics will attribute commits with regard to release cycles according to date of commit. Dates of release cycles ends are specified in Stackalytics configuration file default_data.json (sectioin "releases"). The project_type field represents the high level openstack/stackforge classification. The project_group field provides a way to classify projects in more details. OpenStack projects are classified according to their official OpenStack Program description. For OpenStack projects the following groups were identified:

  • core - projects that are hosted in the OpenStack GitHub repo and are listed in OpenStack Program description as core projects.
  • incubation - projects that are hosted in the OpenStack GitHub repo and listed in OpenStack Program description as incubation projects.
  • documentation - project documentation, which has been extracted into a dedicated group because the statistical comparison with code was not appropriate.
  • other - projects that are hosted in the OpenStack GitHub repo and are not classified by the rules above.

Second level classification for StackFoge projects require additional research and are not used at the moment.

Stackalytics is able to configure major number of OpenStack projects automatically. For this purpose Stackalytics configuration file default_data.json contains sectioin "project_sources". It is a list of GitHub repos with attributes required for for projects classification. For example record:

       {
           "organization": "openstack",
           "project_type": "openstack",
           "project_group": "other"
       },

says that all projects from GitHub OpenStack repository will be attributed to "project_type": "openstack" and "project_group": "other" unless they are not listed in section "repos". Boundaries of release cycles for automatically imported project would be attributed according dates of commits.

Release Notes

Release 0.2

  • Changed internal architecture. Got rid of persistent storage in Mongo. Configuration file default_data.json now serves as persistent storage.
  • Added polling of Gerrit for retrieval of review related source data.
  • Implemented basic statistics for reviews: number of reviews over time, number of negative/positive review, ratio of positive and negative reviews.
  • Redesigned navigation and layout of statistics pages.
  • Implemented auto-assignment of commits to release cycles based on dates.
  • Implemented automated retrieval of projects list from GitHub API.
  • Cleanup of default_data.json (reduced from 14k LOC to 4k)
  • Added to corrections.json all commits which are over 3k LOC and looks like auto-generated.
  • Implemented feature request Sort bugs ID as numbers, not as strings
  • [Fixed bunch deletion from memcached|https://bugs.launchpad.net/stackalytics/+bug/1209211]

Release 0.1

  • Changed internal architecture. Main features: advanced real time processing and horizontal scalability.
  • Got rid of all 3rd party non-Apache libraries and published sources on StackForge under Apache2 license.
  • Improved release cycle tracking by means of Git tags instead of approximate date periods.
  • Changed projects classification to two-level structure. OpenStack (core, incubator, documentation, other) and StackForge.
  • Implement correction mechanism that allows users to tweak metrics for particular commits.
  • Added a bunch of new projects (Tempest, documentation, Puppet recipes).
  • Added company affiliated contribution breakdown on users profile page.

How To's

Stackalytics/HowToRun - how to install Stackalytics and run it in dev or prod environments

Code

Source

https://github.com/stackforge/stackalytics

Pending Code Reviews

https://review.openstack.org/#q,status:open+stackalytics,n,z

Project space

https://launchpad.net/stackalytics

Blueprints

https://blueprints.launchpad.net/stackalytics

Bugs

https://bugs.launchpad.net/stackalytics

Web-site

http://stackalytics.com/