Showing posts with label Data. Show all posts
Showing posts with label Data. Show all posts

Monday, November 17, 2014

Getting started with AzureML with the End-to-End tutorial...

Continuous Learning - End-to-End Predictive Model in AzureML using Linear Regression

Machine Learning (ML) is one of the most popular field in Computer Science discipline, but is also the most feared by developers. The fear is primarily because it is considered as a scientific field that requires deep mathematical expertise which most of us have forgotten. In today's world, ML has two disciplines: ML, and Applied ML. My goal is to make Machine Learning easier to understand for developers through simple applications. In other words, bridge the gap between a developer and a data scientist.  In this blog, I will provide you with a step-by-step guide for building a Linear Regression model in AzureML to predict the price of a car. You will also learn the basics of AzureML along the way, as well as its application it in real-world by creating a Windows Universal Client app.

What is AzureML?

AzureML is meant to democratize Machine Learning and build a new ecosystem and marketplace for monetizing algorithms.  You can find more information about AzureML here.

Why AzureML?

Because it is one of the simplest tools to use for Machine Learning. AzureML reduces the barriers to entry for anyone who wants to try out Machine Learning. You don’t have to be a data scientist to build Machine Learning models anymore.

Logical Machine Learning Flow

Figure below illustrates a typical machine learning process with end result in mind.






AzureML is a new and highly productive tool for Machine Learning. It may be the only tool that lets you publish a machine learning web service directly from your design environment. Machine Learning is a vast topic and Linear Regression models discussed in this article only scratches the surface of the topic. In this article, I went over a stale dataset to showcase AzureML as a predictive analytics tool. You can apply the same procedures and components for Classification and Clustering models. Finally, my goal was in writing about Applied Machine Learning. I am not a Data Scientist, but now with all the productive tools, I feel that I can put to work some of the great algorithms that scientists have already invented.

Some more Datasets you can play around with

  1. Daily and Sports Activities Data Set link
  2. Farm Ads Data Set link
  3. Arcene Data Set link
  4. Bag of Words Data Set link


There's a free tier for Azure ML that was announced week before last, so if you've been yearning to play in a Machine Learning sandbox, Azure ML and this post will get you started!

Thursday, September 25, 2014

Adventure Works, 2014

Jimmy May, Aspiring Geek: SQL Server Performance, Best Practices, & Productivity - AdventureWorks 2014 Sample Databases Are Now Available

Recently, SQL Community feedback from twitter prompted me to look in vain for SQL Server 2014 versions of the AdventureWorks sample databases we’ve all grown to know & love.

I searched Codeplex, then used the bing & even the google in an effort to locate them, yet all I could find were samples on different sites highlighting specific technologies, an incomplete collection inconsistent with the experience we users had learned to expect.  I began pinging internally & learned that an update to AdventureWorks wasn’t even on the road map.

Fortunately, SQL Marketing manager Luis Daniel Soto Maldonado (t) lent a sympathetic ear & got the update ball rolling; his direct report Darmodi Komo recently announced the release of the shiny new sample databases for OLTP, DW, Tabular, and Multidimensional models to supplement the extant In-Memory OLTP sample DB.

What Success Looks Like

In my correspondence with the team, here’s how I defined success:

1. Sample AdventureWorks DBs hosted on Codeplex showcasing SQL Server 2014’s latest-&-greatest features, including:

  • In-Memory OLTP (aka Hekaton)
  • Clustered Columnstore
  • Online Operations
  • Resource Governor IO

2. Where it makes sense to do so, consolidate the DBs (e.g., showcasing Columnstore likely involves a separate DW DB)

3. Documentation to support experimenting with these features

As Microsoft Senior SDE Bonnie Feinberg (b) stated, “I think it would be great to see an AdventureWorks for SQL 2014.  It would be super helpful for third-party book authors and trainers.  It also provides a common way to share examples in blog posts and forum discussions, for example.”


Adventure Works 2014 Sample Databases

Adventure Works 2014 sample databases are an upgrade from the 2012 version. To learn how to install the databases, see Readme for Adventure Works 2014 Sample Databases.docx.


Having training, sample data that is safe to demo and use is always, always, always nice...


Related Past Post XRef:
Community AdventureWorks on Azure one year later, where you, the community, not only helped keep it going BUT also donated GBP351.49 to War Child charity
Community Driven Read-Only AdventureWorks2012 now available on SQL Azure
103 SQL Server 2005 Samples and AdventureWorks Sample Databases Download

Wednesday, September 03, 2014

Using Brent Ozar's magic SQL steps to query and find unanswered StackExchange questions

Brent Ozar Unlimited - Finding Unanswered StackExchange Questions with SQL

You love Q&A sites like and, but sometimes it’s hard to find interesting questions that need to be answered. So many people just sit around hitting refresh, knocking out the new incoming questions as soon as they come in. What’s a database person to do?

Use the power of the SQL. lets you run real T-SQL queries against a recently restored copy of the StackExchange databases. Here’s my super-secret 3-step process to find questions that I have a shot at answering.

Step 1. Find out how old the restored database is....


Step 2. Find questions everybody’s talking about....



Step 3. Find questions that people keep looking at....



Why web query when you can just SQL your way through StackExchange? I don't know about you, but I often dream in SQL (no lie.. sigh), so this approach to StackExchange struck a cord for me. Now, if only I was actually smart enough to provide good answers... :O


Related Past Post XRef:
SELECT * FROM StackExchange. There's the easy way and the hard, yet much more data fun, way...
Stacks and stacks of data - Your copy of the Stack Overflow’s (and family) public data is a download away

The Stack Family (StackOverflow, SuperUser, etc) gets OData’d via Stack Exchange Data Explorer
Build something awesome with the new StackExchange v2 API and win something awesome...
Stacking up the Open Source Projects, Stack Exchange is...

Tuesday, April 29, 2014

Nine to Mine - Nine free Data Mining/Analysis eBooks

CodeCondo - 9 Free Books for Learning Data Mining & Data Analysis

Data mining, data analysis, these are the two terms that very often make the impressions of being very hard to understand – complex – and that you’re required to have the highest grade education in order to understand them.


By learning from these books, you will quickly uncover the ‘secrets’ of data mining and data analysis, and hopefully be able to make better judgement of what they do, and how they can help you in your working projects, both now and in the future.

I just want to say that, in order to learn these complex subjects, you need to have a completely open mind, be open to every possibility, because that is usually where all the learning happens, and no doubt your brain is going to set itself on fire; multiple times.


image image image image image image imageimageimage

Learn Data Science from Free Books

There is no better way to learn than from books, and then going out in the world and putting that newly found knowledge to the test, or otherwise we’re bound to forget what we actually had learned. This is a beautiful list of books that every aspiring data scientist should take note of, and add to his list of learning materials.

What books have you read in order to help you begin your own journey in data mining and analysis? I’m sure that the community would love to hear more, and I’m eager to see what I potentially let slip through my fingers myself.

Some light reading for the week...

(via KDNuggets - 9 Free Books for Learning Data Mining and Data Analysis)


Related Past Post XRef:
"Theory and Applications for Advanced Text Mining" Open eBook...
Free Big Data eBook of the Day, "Mining of Massive Datasets"

Wednesday, February 26, 2014

I see data visualizations... Power BI, Power Map and Power Q&A [Oh my]

SQL Server Blog - Data Visualizations

A couple of weeks back was a really exciting time for us. Less than a year after we released Office 365 for Businesses, we announced the general availability of Power BI for Office 365. You may have read previous blog articles by Quentin Clark on Making Big Data Work for Everyone and Kamal Hathi on Simplifying Business Intelligence through Power BI for Office 365. In this article, we’ll outline how we think about visualizations.

Why Visualizations Matter

While a list of items is great for entering or auditing data, data visualizations are a great way to distill information to what matters most that is understandable quickly.


Visualizations in Productivity Apps

We have the privilege of having the largest community of users of productivity applications in the world. Thanks...


Faster Creation of Visualizations

Excel 2007 introduced the ability to set the style of a chart with one click and leverage richer graphics such as shadows, anti-aliased lines, and transparency.

Office 2013 was one of our most ground-breaking releases.


Richer Interactivity

Part of my role at Microsoft involves presenting on various topics to stakeholders, and increasingly most of these include data visualizations. Only a few years back, I remember ...


Visualizations on All Data

In addition, both data volumes and the types of data customers want to visualize have expanded as well.

Excel 2013 also introduced the Data Model, opening the door for workbooks that contained significantly larger datasets than before, with richer way to express business logic directly within the workbook.

Increasingly, we have access to geospatial data, and recently introduced Power Map brings new 3D visualization tool for mapping, exploring, and interacting with geographical and temporal data to Excel, enabling people to discover and share new insights such as trends, patterns, and outliers in their data over time...


We are very excited to have introduced Power Q&A as part of the Power BI launch. This innovative experience makes it even easier to understand your data by providing a natural language experience that interprets your question and immediately serves up the correct answer on the fly in the form of an interactive chart or graph. These visualizations change dynamically as you modify the question, creating a truly interactive experience with your data.




Visualizations Everywhere

As customers are creating insights and sharing them, we have also invested in ensuring SharePoint 2013 and Office 365 provide full fidelity rendering as the desktop client so their products remain beautiful wherever it’s consumed.

What’s Next?


The Power Q&A looks interesting. I'd love to be able to provide that kind of thing in my apps. But lets see how it plays out over a version or two...


Related Past Post XRef:
Going with the GeoFlow for Excel 2013... Free 3D visualization add-in for mapping, exploring, and interacting with geographical/temporal data

Friday, January 17, 2014

SELECT * FROM StackExchange. There's the easy way and the hard, yet much more data fun, way...

Brent Ozar - How to Query the StackExchange Databases

During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.

The Easy Way to Query

Point your browser over to and the available database list shows the number of questions and answers, plus the date of the database you’ll be querying:


The Hard Way to Query StackOverflow.COM

First, you’ll need to download a copy of the most recent XML data dump. These files are pretty big – around 15GB total – so there’s no direct download for the entire repository. There’s two ways you can get the September 2013 export:

I strongly recommend working with a smaller site’s data first like DBA.StackExchange. If you decide to work with the monster’s data, you’re going to temporarily need:

  • ~15GB of space for the download
  • ~60GB after the exports are expanded with 7zip. They’re XML, so they compress extremely well for download, but holy cow, XML is wordy.
  • ~50GB for the SQL Server database (and this will stick around)

Next, you need a tool to load that XML into the database platform of your choosing. For Microsoft SQL Server, I use Jeremiah’s improved version of the old Sky Sanders’ SODDI. Sky stopped updating his version a few years ago, and it’s no longer compatible with the current Stack dumps. Jeremiah’s current download is here, and it works with the September 2013 data dump.




Why Go to All This Work?

When I’m teaching performance tuning of queries and indexes, there’s no substitute for a local copy of the database. I want to show the impact of new indexes, analyze execution plans with SQL Sentry Plan Explorer, and run load tests with HammerDB.

That’s what we do in our SQL Server Performance Troubleshooting class – specifically, in my modules on How to Think Like the Engine, What Queries are Killing My Server, T-SQL Anti-patterns, and My T-SQL Tuning Process. Forget AdventureWorks – it’s so much more fun to use real data to discover tag patterns, interesting questions, and helpful users.

A great resource, both Brent's post and of course the data for when you need some "safe" data, yet in a large enough volume to be meaningful...


Related Past Post XRef:
Stacks and stacks of data - Your copy of the Stack Overflow’s (and family) public data is a download away

The Stack Family (StackOverflow, SuperUser, etc) gets OData’d via Stack Exchange Data Explorer
Build something awesome with the new StackExchange v2 API and win something awesome...
Stacking up the Open Source Projects, Stack Exchange is...

Tuesday, November 19, 2013

A word or two or 10 about Word Clouds

Beyond Search - Easily Generate Your Own Word Clouds

Word clouds have become inescapable, and it is easy to see why– many people find such a blending of text and visual information easy to understand. But how, exactly, can you generate one of these content confections? Smashing Apps shares its collection of “10 Amazing Word Cloud Generators.”


VocabGrabber is different. It doesn’t even make a particularly pretty picture. As the name implies, VocabGrabber uses your text to build a list of vocabulary words, complete with examples of usage pulled from directly from the content. This could be a useful tool for students, or anyone learning something new that comes with specialized terminology. If your learning materials are digital, a simple cut-and-paste can generate a handy list of terms and in-context examples. A valuable find in a list full of fun and useful tools.

Smashing Apps - 10 Amazing Word Cloud Generators

Smashing Apps has been featured at Wordpress Showcase. If you like Smashing Apps and would like to share your love with us so you can click here to rate us.

In this session, we are presenting 10 amazing word cloud generators for you. Word cloud can be defined as a graphical representation of word frequency, whereas word cloud generators simply are the tools to map data, such as words and tags in a visual and engaging way. These generators come with different features that include different fonts, shapes, layouts and editing capabilities.

Without any further ado, here we are presenting a fine collection of 10 amazing and useful word cloud generators for you. Leave us a comment and let us know what you think of the proliferation of design inspiration in general on the web. Your comments are always more than welcome. Let us have a look. Enjoy!



Make sure you click through as SmashingApps has done a great job with blurbs and snap for each one.


Related Past Post XRef:
Wordle’ing Terms of Service Agreements – How a ToS would look as a word/tag cloud
Bipin shows us that creating a tag cloud doesn't have to be hard to do (in ASP.Net)
Interactive WinForm Tag Cloud Control (Think “Cool, I can add a Word/Tag Cloud thing to my WinForm app!”)
"WordCloud - A Squarified Treemap of Word Frequency" - Something like this would be cool in a Feed Reader...
Feed Stream Analysis - Web Feed/Post Analysis to Group Like/Related Posts
"Statistical parsing of English sentences"
"A Model for Weblog Research"

Tuesday, November 12, 2013

"The Field Guide to Data Science" Free eBook of the Day (Think "The non-Scientist Guide Data Science")

Booz Allen Hamilton - The Field Guide to Data Science


Understanding the DNA of Data Science

Data Science is the competitive advantage of the future for organizations interested in turning their data into a product through analytics. Industries from health, to national security, to finance, to energy can be improved by creating better data analytics through Data Science. The winners and the losers in the emerging data economy are going to be determined by their Data Science teams.

Booz Allen Hamilton created The Field Guide to Data Science to help organizations of all types and missions understand how to make use of data as a resource. The text spells out what Data Science is and why it matters to organizations as well as how to create Data Science teams. Along the way, our team of experts provides field-tested approaches, personal tips and tricks, and real-life case studies. Senior leaders will walk away with a deeper understanding of the concepts at the heart of Data Science. Practitioners will add to their toolboxes.

In The Field Guide to Data Science, our Booz Allen experts provide their insights in the following areas:

  • Start Here for the Basics provides an introduction to Data Science, including what makes Data Science unique from other analysis approaches. We will help you understand Data Science maturity within an organization and how to create a robust Data Science capability.
  • Take Off the Training Wheels is the practitioners guide to Data Science. We share our established processes, including our approach to decomposing complex Data Science problems, the Fractal Analytic Model. We conclude with the Guide to Analytic Selection to help you select the right analytic techniques to conquer your toughest challenges.
  • Life in the Trenches gives a first hand account of life as a Data Scientist. We share insights on a variety of Data Science topics through illustrative case studies. We provide tips and tricks from our own experiences on these real-life analytic challenges.
  • Putting it All Together highlights our successes creating Data Science solutions for our clients. It follows several projects from data to insights and see the impact Data Science can have on your organization.




When I first saw this title, I thought it was going to be one of the make my brain hurt kind of books, but heck, even I can read it! It's actually not dry and is kind of entertaining! If you have "data" (and who doesn't anymore) this free ebook might a good read for you. And really, it won't make your brain explode...


(via KDNuggets - Booz Allen "Field Guide to Data Science" - free download)

Friday, October 04, 2013, where your Local Government can get naked...(well, as in Budget Transparency, that is) - Simi Valley



What an awesome way to grok my home town's budget. While you'd think "budget = boring" this sight makes it actually fun to look at, explore and spelunk the budget. It's very eye opening to see where all the money is going...

Thursday, September 26, 2013

Get a big jump into Big Data with the "Getting Started with Microsoft Big Data" series

Channel 9 - Getting Started with Microsoft Big Data

Developers, take this course to get an overview of Microsoft Big Data tools as part of the Windows Azure HDInsight and Storage services. As a developer, you'll learn how to create map-reduce programs and automate the workflow of processing Big Data jobs. As a SQL developer, you'll learn Hive can make you instantly productive with Hadoop data.


Added to the billion and one of things I need to learn ASAP. When I find the time and "want to" this series looks like a great way to get started. I've done a tiny bit of hadoop, and I already know I'm going to need all the help I can get up this learning curve...

Monday, August 26, 2013

Cool LA Metro Rail Ridership Visualization (and news too)

LA Metro Ridership



(via reddit/LosAngeles - Metro Rail Network Ridership - class project from last spring I've wanted to show off for awhile (crossposting from dataisbeautiful))

Also of note:

APIs / Feeds / Data


Developer Resources

Welcome to Metro’s developer site – this is a website for technical individuals and entities who are using transportation and multi-modal data in interesting ways. Since first releasing our transit data in the summer of 2009, numerous developers have incorporated our data into their applications — you can see a list of  featured applications here.

New Items!

Getting Started

Become a member: Joining is FREE and will allow you to comment and have direct communication with our developers responsible for each data sets.

Get an API Key: You must have a valid API Key to utilize the Trip Planner Information Feed. You will be assigned a key at registration. Check your profile page to retrieve your API key.

Read the Policies: Please familiarize yourself with our Terms and Conditions, and Policies for using the various data and this website.

Read the Trip Planner Information Feed documentation: The web services offers data from 65+ Southern California transit agencies.

Read the FAQ: Questions about this site, the data or tools needed to utilize the data.

Monday, August 19, 2013

Fuzzy Lookup Add-In for Excel (Insert lame "Fuzzy, wuzzy was an Excel..." snip here)

Microsoft Downloads - Fuzzy Lookup Add-In for Excel

The Fuzzy Lookup Add-In for Excel performs fuzzy matching of textual data in Excel.


Date Published: 8/16/2013, 1.5 MB

The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages.

Supported Operating System

Windows 7, Windows Server 2008, Windows Vista

  • Preinstalled Software (Prerequisites): Microsoft Excel 2010
  • ...

Sounds like something I might be able to use... Now it would be even better if this were a .Net assembly that I could use. Will have to look at this and see what my programming options are...

Wednesday, July 31, 2013

Opening the U.S. Code, does the U.S. House, release in XML it does...

E Pluribus Unum - U.S. House of Representatives publishes U.S. Code as open government data

Three years on, Republicans in Congress continue to follow through on promises to embrace innovation and transparency in the legislative process. Today, the United States House of Representatives has made the United States Code available in bulk Extensible Markup Language (XML).

“Providing free and open access to the U.S. Code in XML is another win for open government,” said Speaker John Boehner and Majority Leader Eric Cantor, in a statement posted to “And we want to thank the Office of Law Revision Counsel for all of their work to make this project a reality. Whether it’s our ‘read the bill’ reforms, streaming debates and committee hearings live online, or providing unprecedented access to legislative data, we’re keeping our pledge to make Congress more transparent and accountable to the people we serve.”

House Democratic leaders praised the House of Representatives Office of the Law Revision Counsel (OLRC) for the release of the U.S. Code in XML, demonstrating strong bipartisan support for such measures.

“OLRC has taken an important step towards making our federal laws more open and transparent,” said Whip Steny H. Hoyer, in a statement.


“Just this morning, Josh Tauberer updated our public domain U.S. Code parser to make use of the new XML version of the US Code,” said Mill. “The XML version’s consistent design meant we could fix bugs and inaccuracies that will contribute directly to improving the quality of GovTrack’s and Sunlight’s work, and enables more new features going forward that weren’t possible before. The public will definitely benefit from the vastly more reliable understanding of our nation’s laws that today’s XML release enables.” (More from Tom Lee at the Sunlight Labs blog.)


“Last year, we reported that House Republicans had the transparency edge on Senate Democrats and the Obama administration,” he said. “(House Democrats support the Republican leadership’s efforts.) The release of the U.S. Code in XML joins projects like and in producing actual forward motion on transparency in Congress’s deliberations, management, and results.

For over a year, I’ve been pointing out that there is no machine-readable federal government organization chart. Having one is elemental transparency, and there’s some chance that the Obama administration will materialize with the Federal Program Inventory. But we don’t know yet if agency and program identifiers will be published. The Obama administration could catch up or overtake House Republicans with a little effort in this area. Here’s hoping they do.”

House of Representatives - US Code Most Current Release Point

Public Law 113-21
(Titles in bold are updated at this release point)

Information about the currency of United States Code titles is available on the Currency page.


The United States Code in XML uses the USLM Schema. That schema is explained in greater detail in the USLM Schema User Guide. For rendering the XML files, a Stylesheet (CSS) file is provided.

Each update of the United States Code is a "release point". This page contains links to downloadable files for the most current release point. The available formats are XML, XHTML, and PCC (photocomposition codes, sometimes called GPO locators). Certain limitations currently exist. Although older PDF files (generated through Microcomp) are available on the Annual Historical Archives page, the new PDF files for this page (to be generated through XSL-FO) are not yet available. In addition, the five appendices contained in the United States Code are not yet available in the XML format.

Links to files for prior release points are available on the Prior Release Points page. Links to older files are available on the Annual Historical Archives page.


While pretty cool, I was expecting something different. Seems the XML is really pretty much XHTML. So while it IS XML, it's still a display markup schema...


Guess we'll have to wait for this to complete, Legislative Data Challenge - Win $5k challenge by helping the Library of Congress make US laws machine readable.... Still I applaud the effort!


Related Past Post XRef:
Legislative Data Challenge - Win $5k challenge by helping the Library of Congress make US laws machine readable...
From A to W... The US Gov goes Git (and API crazy too). There's an insane about of data, API's and OSS projects from the US Government...

Monday, July 29, 2013

Building big bucks with big data... "Big Data, Analytics, and the Future of Marketing & Sales" Free eBook (With audio & video)

McKinsey - Chief Marketing & Sales Officer Forum - eBook: Big Data, Analytics, and the Future of Marketing & Sales


The goldmine of data available today represents a turning point for marketing and sales leaders

Table of Contents


  • Putting big data and advanced analytics to work (& Article)

Business Opportunities

  • Use Big Data to find new micromarkets (Article)
  • Value of big data and advanced analytics (Video)
  • Big data, better decisions (Presentation)
  • Marketing’s $200 billion opportunity (Video)
  • Smart analytics: How marketing drives short-term and long-term growth (Article)
  • Know your customers wherever they are (Article)

Insights and action

  • Five steps to squeeze more ROI from your marketing (Article)
  • Case: advanced analytics disproves common wisdom (Video)
  • Getting to “the price is right” (Article)
  • Gilt Groupe: Using Big Data, mobile, and social media to reinvent shopping (Interview)
  • Under the retail microscope: Seeing your customers for the first time (Article)
  • The sales science behind Big Data (Video)
  • Name your price: The power of Big Data and analytics (Article)
  • Data: The real promise of social/local/mobile (Video)
  • Getting beyond the buzz: Is your social media working? (Article)
  • Big Data & advanced analytics: Success stories from the front lines (Article)

How to get organized and get started

  • Get started with Big Data: Tie strategy to performance (Article)
  • What you need to make Big Data work: The pencil (Article)
  • Need for speed: Algorithmic marketing and customer data overload (Article)
  • Simplify Big Data – or it’ll be useless for sales (Article)
  • The challenges of harnessing big data to better understand customers (Video)
  • Contributors
  • Connect with us

Not a dev thing, but still, big data is big, right?

Wednesday, July 17, 2013

Legislative Data Challenge - Win $5k challenge by helping the Library of Congress make US laws machine readable...

Nextgov - Contest Aims to Make Proposed U.S. Laws Machine Readable Worldwide

The Library of Congress is crowdsourcing an initiative to make it easier for software programs around the world to read, understand and categorize federal legislation.

The library is offering a $5,000 prize to the contestant whose entry best fits U.S. legislation into Akoma Ntoso, an internationally-developed framework that aims to be the standard for presenting legislative data in machine-readable formats.


News from the Library of Congress - Library of Congress Announces Legislative Data Challenge

The Library of Congress, at the request of the U.S. House of Representatives, is utilizing the platform to advance the exchange of legislative information worldwide.

Akoma Ntoso ( is a framework used in many other countries around the world to annotate and format electronic versions of parliamentary, legislative and judiciary documents. The challenge, "Markup of U.S. Legislation in Akoma Ntoso", invites competitors to apply the Akoma Ntoso schema to U.S. federal legislative information so it can be more broadly accessed and analyzed alongside legislative documents created elsewhere.

"The Library works closely with the Congress and related agencies to make America’s federal legislative record more widely available through," said Robert Dizard Jr., Deputy Librarian of Congress. "This challenge will build on that accessibility goal by advancing the possibilities related to international frameworks. American legislators, analysts, and the public can benefit from international standards that reflect U.S. legislation, thereby allowing better comparative legislative information. We are initiating this effort as people around the world are working to share legislative information across nations and other jurisdictions."

Utilizing U.S. bill text, challenge participants would attempt to markup the text into electronic versions using the Akoma Ntoso framework. Participants will be expected to identify any issues that appear when applying the Akoma Ntoso schema to U.S. bill text, recommend solutions to resolve those issues, and provide information on the tools used to create the markup.

The challenge, which opened today and closes Oct. 31, 2013, is extended to participants 18 years of age or older. For the official rules and more detailed information about the challenge or to enter a submission, visit

The competition’s three judges are experts in either U.S. legislation XML standards or the Akoma Ntoso legal schema. The Library of Congress will announce the winner of the $5,000 prize on Dec. 19, 2013.


Akoma Ntoso

Akoma Ntoso (“linked hearts” in Akan language of West Africa) defines a “machine readable” set of simple technology-neutral electronic representations (in XML format) of parliamentary, legislative and judiciary documents.

Akoma Ntoso  XML schemas make “visible” the structure and semantic components of relevant digital documents so as to support the creation of high value information services to deliver the power of ICTs to increase efficiency and accountability in the parliamentary, legislative and judiciary contexts.

Akoma Ntoso is an initiative of "Africa i-Parliament Action Plan" ( a programme of UN/DESA.


I'm trying really hard to be supportive of this and not be snarky (like at least with this, something will read the laws congress passes... OH darn, see what I mean? ;)

Monday, July 15, 2013

Gestalt your way to better data visualization by following the Gestalt Laws

Six Revisions - How to Make Data Visualization Better with Gestalt Laws

People love order. We love to make sense of the world around us.

The human mind’s affinity for making sense of the objects it sees can be explained in a theory called Gestalt psychology. Gestalt psychology, also referred to gestaltism, is a set of laws that accounts for how we perceive or intuit patterns and conclusions from the things we see.

These laws can help designers produce better designs. For instance:

In this guide, we will talk about how to apply the principles of Gestalt to create better charts, graphs, and data visualization graphics.

For broader implementation tips of Gestalt laws, please read Gestalt Principles Applied in Design.


Gestalt laws originate from the field of psychology. Today, however, this set of laws finds relevance in a multitude of disciplines and industries like design, linguistics, musicology, architecture, visual communication, and more.

These laws provide us a framework for explaining how human perception works.

Understanding and applying these laws within the scope of charting and data visualization can help our users identify patterns that matter, quickly and efficiently.

None of the Gestalt laws work in isolation, and in any given scenario, you can find the interplay of two or more of these laws.

Let us cover some of the Gestalt laws that are relevant to enhancing data visualization graphics.



To sum up the lessons we can derive from these Gestalt laws:

  1. Law of Pr├Ągnanz: Keep it simple. Arrange data logically wherever possible.
  2. Law of Continuity: Arrange objects in a line to facilitate grouping and comparison.
  3. Law of Similarity: Use similar characteristics (color, size, shape, etc.) to establish relationships and to encourage groupings of objects.
  4. Law of Focal Point: Use distinctive characteristics (like a different color or a different shape) to highlight and create focal points.
  5. Law of Proximity: Know what your chart’s information priority is, and then create groupings through proximity to support that priority.
  6. Law of Isomorphic Correspondence: Keep in mind your user and their preconceived notions and experiences. Stick to well-established conventions and best practices.
  7. Law of Figure/Ground: Ensure there is enough contrast between your foreground and background so that charts and graphs are more legible.
  8. Law of Common Fate: Use direction and/or movement to establish or negate relationships.


The title of my post should have been "Break the Gestalt Laws, go directly to the Data Visualization jail, do not..." Anyway, great write up, advice and guidance...

Thursday, July 11, 2013

A little Hadoop, HDInsight, Mahout, some .Net and a little StackOverflow and you have...

Amazedsaint's Tech Journal - Building A Recommendation Engine - Machine Learning Using Windows Azure HDInsight, Hadoop And Mahout

Feel like helping some one today?

Let us help the Stack Exchange guys to suggest questions to a user that he can answer, based on his answering history, much like the way Amazon suggests you products based on your previous purchase history.  If you don’t know what Stack Exchange does – they run a number of Q&A sites including the massively popular Stack Overflow. 

Our objective here is to see how we can analyze the past answers of a user, to predict questions that he may answer in future. May Stack Exchange’s current recommendation logic may work better than ours, but that won’t prevent us from helping them for our own  learning purposes .

We’ll be doing the following tasks.

  • Extracting the required information from Stack Exchange data set
  • Using the required information to build a Recommender

But let us start with the basics.   If you are totally new to Apache Hadoop and Hadoop On Azure, I recommend you to read these introductory articles before you begin, where I explain HDInsight and Map Reduce model a bit in detail.


Conclusion In this example, we were doing a lot of manual work to upload the required input files to HDFS, and triggering the Recommender Job manually. In fact, you could automate this entire work flow leveraging Hadoop For Azure SDK. But that is for another post, stay tuned. Real life analysis has much more to do, including writing map/reducers for extracting and dumping data to HDFS, automating creation of hive tables, perform operations using HiveQL or PIG, etc. However, we just examined the steps involved in doing something meaningful with Azure, Hadoop and Mahout.

You may also access this data in your Mobile App or ASP.NET Web application, either by using Sqoop to export this to SQL Server, or by loading it to a Hive table as I explained earlier. Happy Coding and Machine Learning!! Also, if you are interested in scenarios where you could tie your existing applications with HD Insight to build end to end workflows, get in touch with me. -


Just the article I've been looking for. It provides a nice start to finish view of playing with HDInsight and Mahout, which is something I was pulling my hair out over a few months ago...

Thursday, June 13, 2013

Getting into the flow, surfing restaurant inspections with GeoFlow and Microsoft Data Explorer (Think "Web Data + Excel + 3D = Good Food")

Microsoft Business Intelligence - Surfing Restaurant Inspections with Microsoft Data Explorer and GeoFlow

Father’s Day is approaching and you might be thinking about a good place to have a nice lunch with your Dad… We would like to show you how Data Explorer and Geoflow can help you gather some insights to make a good decision.

In order to achieve this, we will look at publicly available data about Food Establishment Inspections for the past 7 years and we will also leverage the Yelp API to bring ratings and reviews for restaurants. For the purpose of this post, we will focus on the King County area (WA) but you can try to find local data about Food Establishment inspections for your area too.

What you will need:

What you will learn in this post:

  • Import data from the Yelp Web API (JSON) using Data Explorer.
  • Import public data about Food Establishment Inspections from a CSV file.
  • Reshape the data in your queries.
  • Parameterize the Yelp query by turning it into a function, using the Data Explorer formula language, so you can reuse it to retrieve information about different types of restaurants as well as different geographical locations.
  • Invoke a function given a set of user-defined inputs in an Excel table.
  • Combine (Merge) two queries.
  • Load the final query into the Data Model.
  • Visualize the results in Geoflow.



You know you want to play with this... Just admit it. Makes me want to install Office 2013 just so I can...  :)


Related Past Post XRef:
Going with the GeoFlow for Excel 2013... Free 3D visualization add-in for mapping, exploring, and interacting with geographical/temporal data

Friday, May 24, 2013

From A to W... The US Gov goes Git (and API crazy too). There's an insane about of data, API's and OSS projects from the US Government...

Nextgov - White House Releases New Tools for Digital Strategy Anniversary

The White House marked the one-year anniversary of its digital government strategy Thursday with a slate of new releases, including a catalog of government APIs, a toolkit for developing government mobile apps and a new framework for ensuring the security of government mobile devices.

Those releases correspond with three main goals for the digital strategy: make more information available to the public; serve customers better; and improve the security of federal computing.


DATA.Gov - Developer Resources




Government Open Source Projects




That list of API's and projects just blows my mind... I mean... wow. If you're looking to wander through some code, there HAS to be something here that you'll find interesting. There's something for every language, platform and interest, I think...


Related Past Post XRef:
Happy Birthday You’ve grown so in the last year… (from 47 to 272,677 datasets)