<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.1.3" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: Stop Collecting So Much Data…</title>
	<link>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/</link>
	<description>Web Analytics Blog - Paving the way to understanding web data as it relates to statistics and other methodologies.</description>
	<pubDate>Sun, 05 Feb 2012 01:56:33 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.1.3</generator>

	<item>
		<title>By: Jacques Warren</title>
		<link>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/#comment-22</link>
		<author>Jacques Warren</author>
		<pubDate>Tue, 03 Jul 2007 12:39:22 +0000</pubDate>
		<guid>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/#comment-22</guid>
					<description>Hi Wendi,

Nice post. I just read that article in CIO Insight after reading Jim Novo's blog. I think you illustrate well how false dependencies can be made. 

Your point:

1. is right on: I have been an advocate for adding attitudinal analysis to the behavioral one for a long time. But wouldn't "reasons" for visits add to the extra variables? Isn't there a risk to add to false dependencies? This being said, I am a big proponent of free will, and believe there is still some left in us, the consumers. This means I find a lot of explanation in the "why" people say they do what they do.

3. Nice advise. Too much friction definitely don't help the whole process.

4. "variables that make sense", yes, but I think this is the whole question here: how does one recognize what makes sense, I mean, with the stuff that's not obvious (i.e. the wheather, etc.)? Isn't there an element of discovery and learning?

5. Could you explain a little more?

6. ?</description>
		<content:encoded><![CDATA[<p>Hi Wendi,</p>
<p>Nice post. I just read that article in CIO Insight after reading Jim Novo&#8217;s blog. I think you illustrate well how false dependencies can be made. </p>
<p>Your point:</p>
<p>1. is right on: I have been an advocate for adding attitudinal analysis to the behavioral one for a long time. But wouldn&#8217;t &#8220;reasons&#8221; for visits add to the extra variables? Isn&#8217;t there a risk to add to false dependencies? This being said, I am a big proponent of free will, and believe there is still some left in us, the consumers. This means I find a lot of explanation in the &#8220;why&#8221; people say they do what they do.</p>
<p>3. Nice advise. Too much friction definitely don&#8217;t help the whole process.</p>
<p>4. &#8220;variables that make sense&#8221;, yes, but I think this is the whole question here: how does one recognize what makes sense, I mean, with the stuff that&#8217;s not obvious (i.e. the wheather, etc.)? Isn&#8217;t there an element of discovery and learning?</p>
<p>5. Could you explain a little more?</p>
<p>6. ?</p>
]]></content:encoded>
				</item>
	<item>
		<title>By: Wendi</title>
		<link>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/#comment-23</link>
		<author>Wendi</author>
		<pubDate>Tue, 03 Jul 2007 13:42:22 +0000</pubDate>
		<guid>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/#comment-23</guid>
					<description>Hi Jacques, Thanks for the thoughts.  

4. When building models you don't want to include variables that may not make sense when you try to explain it.  In some cases you have to be cognoscent of protected classes in law.  You can't build a credit score with demographics like age, race, etc....  Also, I am thinking of this from a practicality standpoint.  Try to make the model as simple as possible.  That makes it easier to implement to future events.    But you are right in it does take away some of the surprise element.  

5.  When you use data mining techniques to build models you need to test the predictive accuracy, computational speed, robustness, scalability, and interpretability (point #4 above).  Think of this as taking a test twice with the same questions.  You'd expect to get a better score the second time around on the same set of questions but if you take a test on the same topic but with two sets of questions you may or may not do better the second time around.  So many people don't hold back enough data to conduct a sound validation of the model.  The validation process helps identify the accuracy of the overall model.  

6.  There is a DOE (design of experiment) that I think applies better to transactional data or data that has an order/sequence which is called Repeated Measures.  This technique is not used enough (in my opinion).  So many people just use aggregate data which may be enough but the strength in knowing when something will happen with better precision is golden.  Peter Fader touches on this in his interview.    

Hope this helps!  
Wendi</description>
		<content:encoded><![CDATA[<p>Hi Jacques, Thanks for the thoughts.  </p>
<p>4. When building models you don&#8217;t want to include variables that may not make sense when you try to explain it.  In some cases you have to be cognoscent of protected classes in law.  You can&#8217;t build a credit score with demographics like age, race, etc&#8230;.  Also, I am thinking of this from a practicality standpoint.  Try to make the model as simple as possible.  That makes it easier to implement to future events.    But you are right in it does take away some of the surprise element.  </p>
<p>5.  When you use data mining techniques to build models you need to test the predictive accuracy, computational speed, robustness, scalability, and interpretability (point #4 above).  Think of this as taking a test twice with the same questions.  You&#8217;d expect to get a better score the second time around on the same set of questions but if you take a test on the same topic but with two sets of questions you may or may not do better the second time around.  So many people don&#8217;t hold back enough data to conduct a sound validation of the model.  The validation process helps identify the accuracy of the overall model.  </p>
<p>6.  There is a DOE (design of experiment) that I think applies better to transactional data or data that has an order/sequence which is called Repeated Measures.  This technique is not used enough (in my opinion).  So many people just use aggregate data which may be enough but the strength in knowing when something will happen with better precision is golden.  Peter Fader touches on this in his interview.    </p>
<p>Hope this helps!<br />
Wendi</p>
]]></content:encoded>
				</item>
	<item>
		<title>By: Matt Gershoff</title>
		<link>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/#comment-4811</link>
		<author>Matt Gershoff</author>
		<pubDate>Thu, 24 Jul 2008 03:38:17 +0000</pubDate>
		<guid>http://coremarkanalytics.com/blog/2007/07/02/stop-collecting-so-much-data%e2%80%a6/#comment-4811</guid>
					<description>Wendi,

If the task at hand is to make predictions why would you worry about covariance of your independent variables (assuming they are not perfectly collinear, in which case you are not adding any additional information)? Won’t this just increase the variance around the parameter estimates? Does that matter if they are not policy (control) variables? Also is this assuming that you are using regression as a data mining method?
There are various non OLS methods that work in very large/infinite feature spaces that might be better for classification and prediction problems than OLS.
I agree with you about rule 5 with one caveat - rather than making this a hard and fast rule the analyst should estimate the marginal benefit of the data vs the cost to acquire - and in fact that should be done for all of the data. If data is really expensive (time, hassle, $$$) but improves predictions so that the incremental return is greater than the cost of the data then by all means get that extra data.</description>
		<content:encoded><![CDATA[<p>Wendi,</p>
<p>If the task at hand is to make predictions why would you worry about covariance of your independent variables (assuming they are not perfectly collinear, in which case you are not adding any additional information)? Won’t this just increase the variance around the parameter estimates? Does that matter if they are not policy (control) variables? Also is this assuming that you are using regression as a data mining method?<br />
There are various non OLS methods that work in very large/infinite feature spaces that might be better for classification and prediction problems than OLS.<br />
I agree with you about rule 5 with one caveat - rather than making this a hard and fast rule the analyst should estimate the marginal benefit of the data vs the cost to acquire - and in fact that should be done for all of the data. If data is really expensive (time, hassle, $$$) but improves predictions so that the incremental return is greater than the cost of the data then by all means get that extra data.</p>
]]></content:encoded>
				</item>
</channel>
</rss>

