Difference between revisions of "Stackalytics"

Revision as of 20:47, 20 July 2013

Mission

The Stackalytics project is on a mission to provide transparent and meaningful statistics regarding contribution to both OpenStack itself and projects related to OpenStack. But what does "transparent and meaningful" mean? Transparency is important so that the community can have confidence that all calculations are correct and fair. so "transparent" means that anyone can double check the methods of calculation Stackalytics uses. Meanwhile, results must be meaningful to be useful, so if someone does discover a problem with a method for calculating statistics, "meaningful" means that anyone may submit a correction that will adjust the influence of appropriate statistical data. For example, auto-generated code, mass renaming, automatic refactoring, auto-generated config files, and so on can artificially inflate various statistics. Stackalytics makes it possible to avoid these problems as they're discovered.

Description

Stackalytics is a service that collects and processes development activity data such as commits, lines of code changed, and code reviews, and makes it possible to visualize them in a convenient web dashboard. The Stackalytics dashboard makes it possible to view data by project, company, contributor, and other factors.

The primary data sources for Stackalytics are the OpenStack Git repositories and the Gerrit review history.

Git commits history

Stackalytics process three major metrics for OpenStack contribution:

Number of commits
Number of modified files
Number of modified lines

These statistics are retrieved from the output of the following command:

git log --pretty="commit_id:'%H%n date:%at%n author:%an%n author_email:%ae%n author_email:%ae%n subject:%s%n message:%b%n --shortstat -M --no-merges

The output from this command looks something like this:

commit_id:b5a416ac344160512f95751ae16e6612aefd4a57
date:1369119386
author:Akihiro MOTOKI
author_email:motoki@da.jp.nec.com
author_email:motoki@da.jp.nec.com
subject:Remove class-based import in the code repo
message:Fixes bug 1167901
This commit also removes backslashes for line break.
Change-Id: Id26fdfd2af4862652d7270aec132d40662efeb96
diff_stat:
21 files changed, 340 insertions(+), 408 deletions(-)

This commit changes 21 file and 340 + 408 = 748 LOC (Line Of Code). I.e. LOC is a sum of insertions and deletions.

Company affiliation for each commit author is determined according to the following rules:

First Stackalytics checks the domain of author email. If the domain is in Stackalytics persistent storage, then affiliation for the commit is determined based on the email address.
If the email domain doesn't provide enough information, Stackalytics retrieves the author profile from LanchPad using the email address. If LanchPad does not identify the author, then the commit is affiliated with *independent, unless the next step provides enough information to determine the author's affiliation.
Next the LanchPad ID is used as a primary key for further author identification. Stackalytics stores profiles for known contributors in its persistent storage. This profile has a historical list of contributor affiliations. For example:

      {
           "launchpad_id": "boris-42",
           "companies": [
               {
                   "company_name": "*independent",
                   "end_date": "2013-Apr-10"
               },
               {
                   "company_name": "Mirantis",
                   "end_date": null
               }
           ],
           "user_name": "Boris Pavlovic",
           "emails": [
               "boris@pavlovic.me"
           ]
       },

As shown above, Stackalytics has a company_name and an end_date. This information is enough to determine the affiliation on any given commit based on its date, so that an individual who has worked for more than one company can have his or her commits properly attributed.

Finally if all checks above fail, then commit is affiliated with *independent.

Commits metrics corrections and common sense approach

Recent history of contribution to OpenStack shows that LOC metric is not reliable due to commits that were not representative - like code auto-generation or automatic code refactoring. The most well known examples are:

Rename Quantum to Neutron Change-Id: Ib86e068aa8e4f48993809b6b25444407b7c1f17e
Updated translations from Transifex Change-Id: I4810c45d15413bdf21b9f68f59096c907bb1e624

Stackalytics provides a framework for community driven correction process. There is a JSON file corrections.json in Stackalytics Stackforge repo that contains records like the following:

{
   "corrections": [
       {
           "commit_id": "ee3fe4e836ca1c81e50a8324a9b5f982de4fa97f",
           "correction_comment": "Reset LOC to 0",
           "lines_added": 0,
           "lines_deleted": 0
       }
   ]
}

Structure of records is self-descriptive. Any OpenStack contributor is able to file a bug and provide a patchset for this file in order to apply particular correction. This patchset goes through standard review process and immediately applies to Stackalytics data as soon as merges into upstream. This process is driven by community and should not be used for improper statistics manipulation. Corrected commits are marked with comments in RED in web dashboard and fully transparent for further challenges.

This framework was designed in order to make statistical data more reliable and representative. The following common sense approach is suppose to be used:

Commits that contains auto-generated files should be adjusted in order to represent amount of efforts for generator, but not for produced output.
Commits that contains result of automatic code refactoring should be adjusted accordingly.
Commits that are a result of improper files rename (shell rename instead of git rename) should be zeroed.
Commits with binary and 3rd party files should adjusted accordingly.

Tracked projects and classification

Stackalytics is able to track any project that uses standard OpenStack development infrastructure (Git, Gerrit, LanchPad). At the high level all projects are divided into two major groups: OpenStack and StackForge. This classification is determined by GitHub organization account. Stackalytics stores a list of projects in its persistent storage and uses the following config file for initial setup. Any OpenStack contributor is able to file a bug and provide a patchset for addition of untracked project. Section repos represents list of tracked projects. It has the following format:

   "repos": [
       {
           "branches": ["master"],
           "module": "nova",
           "project_type": "openstack",
           "project_group": "core",
           "uri": "git://github.com/openstack/nova.git",
           "releases": [
               {
                   "release_name": "Essex",
                   "tag_from": "2011.3",
                   "tag_to": "2012.1"
               },
               {
                   "release_name": "Folsom",
                   "tag_from": "2012.1",
                   "tag_to": "2012.2"
               },
               {
                   "release_name": "Grizzly",
                   "tag_from": "2012.2",
                   "tag_to": "2013.1"
               },
               {
                   "release_name": "Havana",
                   "tag_from": "2013.1",
                   "tag_to": "HEAD"
               }
           ]
       }
 ]

It is essential to provide git tags of commit_id for known release cycles. Otherwise Stackalytics will not be able to properly track contribution with regards to release cycles. Field project_type represents high level classification. Field project_group provides a way to classify projects in more details. OpenStack projects are classified according to official OpenStack Program description. For OpenStack projects the following groups were identified:

core - projects that hosts in OpenStack GitHub repo and listed in OpenStack Program description as core projects.
incubation - projects that hosts in OpenStack GitHub repo and listed in OpenStack Program description as incubation projects.
documentation - extracted into dedicated group because could not be relatively compared with the code.
other - projects that hosts in OpenStack GitHub repo and are not classified by rules above.

Second level classification for StackFoge project require additional research and is not used at the moment.

Release Notes

Release 0.1

Changed internal architecture. Main features: advanced real time processing and horizontal scalability.
Get rid of all 3rd party non-Apache libraries and published sources on StackForge under Apache2 license.
Improved release cycle tracking by means of Git tags instead of approximate date periods.
Changed projects classification to two-level structure. OpenStack (core, incubator, documentation, other) and StackForge.
Implement correction mechanism that allows to tweak metrics of particular commits.
Added a bunch of new projects (Tempest, documentation, Puppet recipes).
Added company affiliated contribution breakdown on users profile page.

How To's

Stackalytics/HowToRun - how to install Stackalytics and run it in dev or prod environments

Code

@@ Line 1: / Line 1: @@
 ==Mission==
-The Stackalytics project is on a mission to provide transparent and meaningful statistics regarding contribution to both OpenStack itself and projects related to OpenStack.  But what does "transparent and meaningful" mean?  Transparency is important so that the community can have confidence that all calculations are correct and fair. so "transparent" means that anyone can double check the methods of calculation Stackalytics uses.  Meanwhile, results must be meaningful to be useful, so if someone does discover a problem with a method for calculating statistics, "meaningful" means that anyone may submit a correction that will adjust the influence of appropriate statistical data.  For example, auto-generated code, mass renaming, automatic refactoring, auto-generated config files, and so on can artificially inflate various statistics.
+The Stackalytics project is on a mission to provide transparent and meaningful statistics regarding contribution to both OpenStack itself and projects related to OpenStack.  But what does "transparent and meaningful" mean?  Transparency is important so that the community can have confidence that all calculations are correct and fair. so "transparent" means that anyone can double check the methods of calculation Stackalytics uses.  Meanwhile, results must be meaningful to be useful, so if someone does discover a problem with a method for calculating statistics, "meaningful" means that anyone may submit a correction that will adjust the influence of appropriate statistical data.  For example, auto-generated code, mass renaming, automatic refactoring, auto-generated config files, and so on can artificially inflate various statistics.  Stackalytics makes it possible to avoid these problems as they're discovered.
 == Description ==
-Stackalytics is a service that collects and processes development activity data (such as commits, reviews) and visualizes them at web dashboard.
+Stackalytics is a service that collects and processes development activity data such as commits, lines of code changed, and code reviews, and makes it possible to visualize them in a convenient web dashboard.  The Stackalytics dashboard makes it possible to view data by project, company, contributor, and other factors.
-Primary data source for Stackalytics is a Git repositories and Gerrit review history.
+The primary data sources for Stackalytics are the OpenStack Git repositories and the Gerrit review history.
 == Git commits history ==
-Stackalytics process three major metrics for OpenStack contribution.
+Stackalytics process three major metrics for OpenStack contribution:
 * Number of commits
 * Number of modified files
 * Number of modified lines
-This statistics is retrieved from output of the following command:
+These statistics are retrieved from the output of the following command:
-''git log --pretty="commit_id:'%H%n date:%at%n author:%an%n author_email:%ae%n author_email:%ae%n subject:%s%n message:%b%n --shortstat -M --no-merges''
+ ''git log --pretty="commit_id:'%H%n date:%at%n author:%an%n author_email:%ae%n author_email:%ae%n subject:%s%n message:%b%n --shortstat -M --no-merges''
-Here is a sample output:
+The output from this command looks something like this:
   commit_id:b5a416ac344160512f95751ae16e6612aefd4a57
@@ Line 34: / Line 34: @@
 This commit changes 21 file and 340 + 408 = 748 LOC (Line Of Code). I.e. LOC is a sum of insertions and deletions.
-Company affiliation of commit author is determined according to the following rules:
+Company affiliation for each commit author is determined according to the following rules:
-* First Stackalytics checks domain of author email. If domain is in Stackalytics persistent storage then affiliation of commit is determined.
+* First Stackalytics checks the domain of author email. If the domain is in Stackalytics persistent storage, then affiliation for the commit is determined based on the email address.
-* After that Stackalytics retrieve author profile from LanchPad using email address. If LanchPad do not identify the author, then commit is affiliated to ''*independent''
+* If the email domain doesn't provide enough information, Stackalytics retrieves the author profile from LanchPad using the email address. If LanchPad does not identify the author, then the commit is affiliated with ''*independent'', unless the next step provides enough information to determine the author's affiliation.
-* LanchPad ID is a primary key for further author identification. Stackalytics stores profiles for known contributors in its persistent storage. This profile has a historical list of contributor affiliations. For example:
+* Next the LanchPad ID is used as a primary key for further author identification. Stackalytics stores profiles for known contributors in its persistent storage. This profile has a historical list of contributor affiliations. For example:
         {
              "launchpad_id": "boris-42",
@@ Line 56: / Line 56: @@
          },
-As shown above Stackalytics has a ''company_name'' and ''end_date''. This information is enough to determine affiliation on any given commit based on its date.
+As shown above, Stackalytics has a ''company_name'' and an ''end_date''. This information is enough to determine the affiliation on any given commit based on its date, so that an individual who has worked for more than one company can have his or her commits properly attributed.
-* Finally is all checks above fails, then commit is affiliated to ''*independent''.
+* Finally if all checks above fail, then commit is affiliated with ''*independent''.
 == Commits metrics corrections and ''common sense'' approach ==

Difference between revisions of "Stackalytics"

Revision as of 20:47, 20 July 2013

Contents

Mission

Description

Git commits history

Commits metrics corrections and common sense approach

Tracked projects and classification

Release Notes

How To's

Code

Source

Pending Code Reviews

Project space

Blueprints

Bugs

Web-site